Abstract
The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to trade off timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.
Similar content being viewed by others
References
K. Birman, The process group approach to reliable distributed computing, Communications of the ACM 36(12) (1993) 37-53.
J.-C. Bolot, Characterizing end-to-end packet delay and loss in the internet, Journal of High-Speed Networks 2(3) (1993) 305-323.
M.S. Borella, D. Swider, S. Uludag and G. Brewster, Analysis of end-to-end internet packet loss: Dependence and asymmetry, Technical Report AT031798, 3Com Advanced Technologies Corporation (1998).
H. Casanova and J. Dongarra, Netsolve: A network server for solving computational science problems, Technical Report CS-95-313, University of Tennessee (November 1995).
T.D. Chandra and S. Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM 43(2) (March 1996).
K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith and S. Tuecke, A resource management architecture for metacomputing systems, in: The 4th Workshop on Job Scheduling Strategies for Parallel Processing (1998).
M.J. Fischer, N.A. Lynch and M.S. Paterson, Impossibility of distributed consensus with one faulty process, Journal of the ACM 32(2) (April 1982).
I. Foster and C. Kesselman, The Globus project: A progress report, in: Proceedings of the Heterogeneous Computing Workshop (1998, to appear).
I. Foster and C. Kesselman, eds., The Grid: Blueprint for a Future Computing Infrastructure (Morgan Kaufmann, San Mateo, CA, 1998).
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek and V. Sunderam, PVM: Parallel Virtual Machine — A User's Guide and Tutorial for Network Parallel Computing (MIT Press, Cambridge, MA, 1994).
G.S. GmbH, CODINE: Computing in distributed networked environments (1995). http://www.genias.de/genias/english/codine.html.
A. Grimshaw, A. Nguyen-Tuong and W. Wulf, Campus-wide computing: Results using Legion at the University of Virginia, Technical Report CS-95-19, University of Virginia (1995).
M. Litzkow, M. Livny and M. Mutka, Condor — a hunter of idle workstations, in: Proc. of 8th Internat. Conf. on Distributed Computing Systems (1988) pp. 104-111.
K. Moore, G. Fagg, A. Geist and J. Dongarra, Scalable networked information processing environment (SNIPE), in: Proceedings of Supercomputing '97 (1997).
L. Moser, P. Melliar-Smith, D. Agarwal, R. Budhia and C. Lingley-Papadopoulos, Totem: A fault-tolerant multicast group communication system, Communications of the ACM 39(4) (1996).
A. Mukherjee, On the dynamics and significance of low-frequency components of network load, Internetworking: Research and Experience 5 (1994) 163-205.
S. Mullender, ed., Distributed Systems (ACM Press, 1989).
V. Paxson, Measurements and analysis of end-to-end Internet dynamics, Ph.D. thesis, U.C. Berkeley (1997).
R. van Renesse, T. Hickey and K. Birman, Design and performance of Horus: A lightweight group communications system, Technical Report TR94-1442, Cornell University (1994).
J. Weissman, Gallop: The benefits of wide-area computing for parallel processing, Technical Report, University of Texas at San Antonio (1997).
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Stelling, P., DeMatteis, C., Foster, I. et al. A fault detection service for wide area distributed computations. Cluster Computing 2, 117–128 (1999). https://doi.org/10.1023/A:1019070407281
Issue Date:
DOI: https://doi.org/10.1023/A:1019070407281