Abstract
We study the problems of failure detection and consensus in asynchronous systems in which processes may crash and recover, and links may lose messages. We first propose new failure detectors that are particularly suitable to the crash-recovery model. We next determine under what conditions stable storage is necessary to solve consensus in this model. Using the new failure detectors, we give two consensus algorithms that match these conditions: one requires stable storage and the other does not. Both algorithms tolerate link failures and are particularly efficient in the runs that are most likely in practice — those with no failures or failure detector mistakes. In such runs, consensus is achieved within 3δ time and with 4n messages, where δ is the maximum message delay and n is the number of processes in the system.
Research partially supported by NSF grant CCR-9402896 and CCR-9711403, by ARPA/ONR grant N00014-96-1-1014, and by an Olin Fellowship.
Preview
Unable to display preview. Download preview PDF.
References
M. K. Aguilera, W. Chen, and S. Toueg. Heartbeat: a timeout-free failure detector for quiescent reliable communication. In Proceedings of the 11th International Workshop on Distributed Algorithms, Lecture Notes on Computer Science. Springer-Verlag, Sept. 1997. A full version is also available as Technical Report 97-1631, Computer Science Department, Cornell University, Ithaca, New York, May 1997.
M. K. Aguilera, W. Chen, and S. Toueg. Failure detection and consensus in the crash-recovery model. Technical Report 98-1676, Department of Computer Science, Cornell University, April 1998.
T. D. Chandra, V. Hadzilacos, and S. Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 43(4):685–722, July 1996.
T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, March 1996.
D. Dolev, R. Friedman, I. Keidar, and D. Malkhi. Failure detectors in omission failure environments. Technical Report 96-1608, Department of Computer Science, Cornell University, Ithaca, New York, Sept. 1996.
R. Guerraoui, R. Oliveira, and A. Schiper. Stubborn communication channels. Technical report, Département d'Informatique, Ecole Polytechnique Fédérale, Lausanne, Switzerland, Dec. 1996.
M. Hurfin, A. Mostefaoui, and M. Raynal. Consensus in asynchronous systems where processes can crash and recover. Technical Report 1144, Institut de Recherche en Informatique et Systèmes Aléatoires, Université de Rennes, Nov. 1997.
G. Neiger and S. Toueg. Automatically increasing the fault-tolerance of distributed algorithms. Journal of Algorithms, 11(3):374–419, 1990.
R. Oliveira, R. Guerraoui, and A. Schiper. Consensus in the crash-recover model. Technical Report 97-239, Département d'Informatique, Ecole Polytechnique Fédérale, Lausanne, Switzerland, Aug. 1997.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Aguilera, M.K., Chen, W., Toueg, S. (1998). Failure detection and consensus in the crash-recovery model. In: Kutten, S. (eds) Distributed Computing. DISC 1998. Lecture Notes in Computer Science, vol 1499. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0056486
Download citation
DOI: https://doi.org/10.1007/BFb0056486
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65066-9
Online ISBN: 978-3-540-49693-9
eBook Packages: Springer Book Archive