The overhead of consensus failure recovery

Partha Dutta¹,
Rachid Guerraoui² &
Idit Keidar³

105 Accesses
11 Citations
Explore all metrics

Abstract

Many reliable distributed systems are consensus-based and typically operate under two modes: a fast normal mode in failure-free synchronous periods, and a slower recovery mode following asynchrony and failures. A lot of work has been devoted to optimize the normal mode, but little has focused on optimizing the recovery mode. This paper seeks to understand whether the recovery mode is inherently slower than the normal mode. In particular, we consider consensus algorithms in the round-based eventually synchronous model of [11], where t out of n processes may fail by crashing, messages may be lost, and the system may be asynchronous for arbitrarily long, but eventually the system becomes synchronous and no new failure occurs (we say that the system becomes stable). For t ≥ n/3, we prove a lower bound of three rounds for achieving a global decision whenever the system becomes stable, and we contrast this with a bound of two rounds when t < n/3. We then give matching algorithms for both t ≥ n/3 and t < n/3.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

References

ACM: Special issue on group communications systems. Commun. ACM 39(4) (1996)
Amir, Y., Tutu, C.: From total order to database replication. In: Proceedings of the 22nd IEEE International Conference on Distributed Computing Systems (ICDCS-22) (2002)
Birman, K., van Renessee, R.: Reliable Distributed Computing with the Isis Toolkit. IEEE Computer Society Press (1993)
Chandra T.D., Toueg S. (1996) Unreliable failure detectors for reliable distributed systems. J. ACM 43(2): 225–267
Article MATH MathSciNet Google Scholar
Charron-Bost B., Schiper A. (2004) Uniform consensus is harder than consensus. J. Algorithms 51(1):15–37
Article MATH MathSciNet Google Scholar
Chockler G.V., Keidar I., Vitenberg R. (2001) Group communication specifications: a comprehensive study. ACM Comput. Surv. 33(4): 1–43
Article Google Scholar
Cristian, F., Fetzer, C.: The timed asynchronous distributed system model. IEEE Trans. Parallel Distrib. Syst. 10(6) (1999)
Dolev D., Reischuk R., Strong R. (1990) Early stopping in byzantine agreement. J. ACM 37(4): 720–741
Article MATH MathSciNet Google Scholar
Dutta, P., Guerraoui, R.: Fast indulgent consensus with zero degradation. In: Proceedings of the Fourth European Dependable Computing Conference (EDCC-4). Toulouse, France (2002)
Dutta, P., Guerraoui, R.: The inherent price of indulgence. Distrib. Comput. 18(1), 85–98 (2005). A preliminary version appeared in the Proceedings of the 21st ACM Symposium on Principles of Distributed Computing (PODC-21), 2002
Google Scholar
Dwork C., Lynch N.A., Stockmeyer L. (1988) Consensus in the presence of partial synchrony. J. ACM 35(2): 288–323
Article MathSciNet Google Scholar
El Abbadi, A., Skeen, D., Cristian, F.: An efficient fault-tolerant protocol for replicated data management. In: Proceedings of the 4th ACM Conference on Principles of Database Systems (1985)
Fischer M.J., Lynch N.A. (1982) A lower bound for the time to assure interactive consistency. Inform. Process. Lett. 14(4): 183–186
Article MATH MathSciNet Google Scholar
Fischer M.J., Lynch N.A., Paterson M.S. (1985) Impossibility of distributed consensus with one faulty process. J. ACM 32(2): 374–382
Article MATH MathSciNet Google Scholar
Friedman, R., Vaysburd, A.: Fast replicated state machines over partitionable networks. In: Proceedings of the 16th IEEE Symposium on Reliable Distributed Systems (SRDS-16), pp. 130–137. IEEE Computer Society (1997)
Gafni, E.: Round-by-round fault detectors: Unifying synchrony and asynchrony. In: Proceedings of the 17th ACM Symposium on Principles of Distributed Computing (PODC-17), pp. 143–152. Puerto Vallarta, Mexico (1998)
Guerraoui, R.: Revisiting the relationship between non blocking atomic commitment and consensus problems. In: Proceedings of the 9th International Workshop on Distributed Algorithms (WDAG-9) (1995)
Keidar, I., Dolev, D.: Efficient message ordering in dynamic networks. In: Proceedings of the 15th ACM Symposium on Principles of Distributed Computing (PODC-15), pp. 68–76. New York, NY (1996)
Keidar, I., Rajsbaum, S.: On the cost of fault-tolerant consensus when there are no faults – a tutorial. Tech. Rep. MIT-LCS-TR-821, MIT (2001). PODC 2002 Tutorial
Keidar I., Rajsbaum S. (2003) A simple proof of the uniform consensus synchronous lower bound. Inform. Process. Lett. 85(1): 47–52
Article MATH MathSciNet Google Scholar
Lamport L. (1978) Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21(7): 558–565
Article MATH Google Scholar
Lamport, L.: The part-time parliament. Tech. Rep. 49, Systems Research Center, Digital Equipment Corp, Palo Alto (1989). A revised version of the paper also appeared in ACM Trans. Comput. Syst. 16(2), 133–169 (1998)
Google Scholar
Lamport, L., Fischer, M.: Byzantine generals and transaction commit protocols. Technical Report 62, SRI International (1982)
Lamport L., Shostak R., Pease M. (1982) The byzantine generals problem. ACM Trans. Program. Lang. Syst. 4(3): 382–401
Article MATH Google Scholar
Lampson, B.: How to build a highly available system using consensus. In: Proceedings of the 10th International Workshop on Distributed Algorithms (WDAG-10), pp. 1–15. Bologna, Italy (1996)
Mostefaoui A., Raynal M. (2001) Leader-based consensus. Parallel Process. Lett. 11(1): 95–107
Article MathSciNet Google Scholar
Oki, B., Liskov, B.: Viewstamped replication: a general primary copy method to support highly available distributed systems. In: Proceedings of the 7th ACM Symposium on Principles of Distributed Computing (PODC-7), pp. 8–17. Toronto, Ontario, Canada (1988)
Santoro, N., Widmayer, P.: Time is not a healer. In: 6th Annual Symp. Theor. Aspects of Computer Science, LNCS, vol. 349, pp. 304–313. Springer, Berlin Heidelberg New York (1989)
Schneider F.B. (1990) Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv. 22(4): 299–319
Article Google Scholar
Thekkath, C.A., Mann, T., Lee, E.K.: Frangipani: A scalable distributed file system. In: ACM SIGOPS Symposium on Operating Systems Principles (SOSP), pp. 224–237 (1997)

Download references

Author information

Authors and Affiliations

Bell Labs Research, Bangalore, India
Partha Dutta
School of Computer and Communication Sciences, EPFL, Lausanne, 1015, Switzerland
Rachid Guerraoui
Department of Electrical Engineering, The Technion, Haifa, 32000, Israel
Idit Keidar

Authors

Partha Dutta
View author publications
You can also search for this author in PubMed Google Scholar
Rachid Guerraoui
View author publications
You can also search for this author in PubMed Google Scholar
Idit Keidar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Partha Dutta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dutta, P., Guerraoui, R. & Keidar, I. The overhead of consensus failure recovery. Distrib. Comput. 19, 373–386 (2007). https://doi.org/10.1007/s00446-006-0017-6

Download citation

Received: 05 April 2004
Accepted: 31 July 2006
Published: 28 October 2006
Issue Date: April 2007
DOI: https://doi.org/10.1007/s00446-006-0017-6

The overhead of consensus failure recovery

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Asynchronous Consensus in Synchronous Systems Using send_to_all Primitive

Optimistically tuning synchronous byzantine consensus: another win for null messages

Adaptive Consensus: Enhancing Robustness in Dynamic Environments

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

The overhead of consensus failure recovery

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Asynchronous Consensus in Synchronous Systems Using send_to_all Primitive

Optimistically tuning synchronous byzantine consensus: another win for null messages

Adaptive Consensus: Enhancing Robustness in Dynamic Environments

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation