A fault detection service for wide area distributed computations

Paul Stelling¹,
Cheryl DeMatteis¹,
Ian Foster²,
Carl Kesselman³,
Craig Lee¹ &
…
Gregor von Laszewski²

361 Accesses
48 Citations
Explore all metrics

Abstract

The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to trade off timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

References

K. Birman, The process group approach to reliable distributed computing, Communications of the ACM 36(12) (1993) 37-53.
Article Google Scholar
J.-C. Bolot, Characterizing end-to-end packet delay and loss in the internet, Journal of High-Speed Networks 2(3) (1993) 305-323.
Google Scholar
M.S. Borella, D. Swider, S. Uludag and G. Brewster, Analysis of end-to-end internet packet loss: Dependence and asymmetry, Technical Report AT031798, 3Com Advanced Technologies Corporation (1998).
H. Casanova and J. Dongarra, Netsolve: A network server for solving computational science problems, Technical Report CS-95-313, University of Tennessee (November 1995).
T.D. Chandra and S. Toueg, Unreliable failure detectors for reliable distributed systems, Journal of the ACM 43(2) (March 1996).
K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith and S. Tuecke, A resource management architecture for metacomputing systems, in: The 4th Workshop on Job Scheduling Strategies for Parallel Processing (1998).
M.J. Fischer, N.A. Lynch and M.S. Paterson, Impossibility of distributed consensus with one faulty process, Journal of the ACM 32(2) (April 1982).
I. Foster and C. Kesselman, The Globus project: A progress report, in: Proceedings of the Heterogeneous Computing Workshop (1998, to appear).
I. Foster and C. Kesselman, eds., The Grid: Blueprint for a Future Computing Infrastructure (Morgan Kaufmann, San Mateo, CA, 1998).
Google Scholar
A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek and V. Sunderam, PVM: Parallel Virtual Machine — A User's Guide and Tutorial for Network Parallel Computing (MIT Press, Cambridge, MA, 1994).
Google Scholar
G.S. GmbH, CODINE: Computing in distributed networked environments (1995). http://www.genias.de/genias/english/codine.html.
A. Grimshaw, A. Nguyen-Tuong and W. Wulf, Campus-wide computing: Results using Legion at the University of Virginia, Technical Report CS-95-19, University of Virginia (1995).
M. Litzkow, M. Livny and M. Mutka, Condor — a hunter of idle workstations, in: Proc. of 8th Internat. Conf. on Distributed Computing Systems (1988) pp. 104-111.
K. Moore, G. Fagg, A. Geist and J. Dongarra, Scalable networked information processing environment (SNIPE), in: Proceedings of Supercomputing '97 (1997).
L. Moser, P. Melliar-Smith, D. Agarwal, R. Budhia and C. Lingley-Papadopoulos, Totem: A fault-tolerant multicast group communication system, Communications of the ACM 39(4) (1996).
A. Mukherjee, On the dynamics and significance of low-frequency components of network load, Internetworking: Research and Experience 5 (1994) 163-205.
Google Scholar
S. Mullender, ed., Distributed Systems (ACM Press, 1989).
V. Paxson, Measurements and analysis of end-to-end Internet dynamics, Ph.D. thesis, U.C. Berkeley (1997).
R. van Renesse, T. Hickey and K. Birman, Design and performance of Horus: A lightweight group communications system, Technical Report TR94-1442, Cornell University (1994).
J. Weissman, Gallop: The benefits of wide-area computing for parallel processing, Technical Report, University of Texas at San Antonio (1997).

Download references

Author information

Authors and Affiliations

The Aerospace Corporation, El Segundo, CA, 90245-4691, USA
Paul Stelling, Cheryl DeMatteis & Craig Lee
Mathematics and Computer Science, Argonne National Laboratory, Argonne, IL, 60439, USA
Ian Foster & Gregor von Laszewski
Information Sciences Institute, University of Southern California, Marina del Rey, CA, 90292, USA
Carl Kesselman

Authors

Paul Stelling
View author publications
You can also search for this author in PubMed Google Scholar
Cheryl DeMatteis
View author publications
You can also search for this author in PubMed Google Scholar
Ian Foster
View author publications
You can also search for this author in PubMed Google Scholar
Carl Kesselman
View author publications
You can also search for this author in PubMed Google Scholar
Craig Lee
View author publications
You can also search for this author in PubMed Google Scholar
Gregor von Laszewski
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stelling, P., DeMatteis, C., Foster, I. et al. A fault detection service for wide area distributed computations. Cluster Computing 2, 117–128 (1999). https://doi.org/10.1023/A:1019070407281

Download citation

Issue Date: September 1999
DOI: https://doi.org/10.1023/A:1019070407281

A fault detection service for wide area distributed computations

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Distributed Fault Analysis (DFA) Method for Fault Tolerance in High-Performance Computing Systems

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

A Survey on Fault Management Techniques in Distributed Computing

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A fault detection service for wide area distributed computations

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Distributed Fault Analysis (DFA) Method for Fault Tolerance in High-Performance Computing Systems

A Scalable Runtime Fault Localization Framework for High-Performance Computing Systems

A Survey on Fault Management Techniques in Distributed Computing

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation