FT-Grid: A Fault-Tolerance System for e-Science

P. Townend, P. Groth, N. Looker, and J. Xu, “FT-Grid: A Fault-Tolerance System for e-Science,” presented at UK e-Science Programme All Hands Meeting, Nottingham, UK, 2005.

Abstract: The FT-Grid system introduces a multi-version design -based fault tolerance framework that allows faults occurring in service-based systems to be tolerated, thus increasing the dependability of such systems. This paper details the progress that has been made in the development of FT-Grid, including both a GUI client and also a web service interface. We show empirical evidence of the dependability benefits offered by FT-Grid, by performing a dependability analysis using fault injection testing performed with the WS-FIT tool. We then illustrate a potential problem with voting based fault tolerance approaches in the service-oriented paradigm – namely, that individual channels within fault-tolerant systems may invoke common services as part of their workflow, thus increasing the potential for common-mode failure. We propose a solution to this issue by using the technique of provenance to provide FT-Grid with topological awareness. We implement a large test system, and – with the use of the PreServ provenance system developed as part of the PASOA project at the University of Southampton – perform a large number of experiments which show that a topologically-aware FT-Grid system results in a much more dependable system than any other configuration tested, whilst imposing a negligible timing overhead.