Article

A low-overhead recovery technique using quasi-synchronous checkpointing

ICDCS '96: Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)

Page 100

Published: 27 May 1996 Publication History

Abstract

In this paper, we propose a quasi-synchronous checkpointing algorithm and a low-overhead recovery algorithm based on it. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpoint coordination for the progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has the easiness and low overhead of asynchronous checkpointing and the recovery time advantages of synchronous checkpointing. There is no extra message overhead involved during checkpointing and the additional checkpointing overhead is nominal. The algorithm ensures the existence of a recovery line consistent with the latest checkpoint of any process all the time. The recovery algorithm exploits this feature to restore the system to a state consistent with the latest checkpoint of a failed process. The recovery algorithm has no domino effect and a failed process needs only to rollback to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. To avoid domino effect, it uses selective pessimistic message logging at the receiver end. The recovery is asynchronous for single process failure. Neither the recovery algorithm nor the checkpointing algorithm requires the channels to be FIFO. We do not use vector timestamps for determining dependency between checkpoints since vector timestamps generally result in high message overhead during failure-free operation.

Cited By

View all

Kiehn ARaj PSingh P(2014)A Causal Checkpointing Algorithm for Mobile Computing EnvironmentsProceedings of the 15th International Conference on Distributed Computing and Networking - Volume 831410.1007/978-3-642-45249-9_9(134-148)Online publication date: 4-Jan-2014
https://dl.acm.org/doi/10.1007/978-3-642-45249-9_9
Perumalla KProtopopescu V(2013)Reversible simulations of elastic collisionsACM Transactions on Modeling and Computer Simulation10.1145/2457459.245746123:2(1-25)Online publication date: 10-May-2013
https://dl.acm.org/doi/10.1145/2457459.2457461
Luo YManivannan D(2011)Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E familiesPerformance Evaluation10.1016/j.peva.2011.01.00568:5(429-445)Online publication date: 1-May-2011
https://dl.acm.org/doi/10.1016/j.peva.2011.01.005
Show More Cited By

A low-overhead recovery technique using quasi-synchronous checkpointing
1. Computer systems organization
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures

Recommendations

Asynchronous recovery without using vector timestamps

A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also ...
Quasi-synchronous checkpointing and failure recovery in distributed systems
Low Overhead Incremental Checkpointing and Rollback Recovery Scheme on Windows Operating System
WKDD '10: Proceedings of the 2010 Third International Conference on Knowledge Discovery and Data Mining

Implementation of a low overhead incremental checkpointing and rollback recovery scheme that consists of incremental checkpointing combines copy-on-write technique and optimal checkpointing interval is addressed in this article. The checkpointing ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

ICDCS '96: Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96)

May 1996

ISBN:0818673982

Publisher

IEEE Computer Society

United States

Publication History

Published: 27 May 1996

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

27
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Kiehn ARaj PSingh P(2014)A Causal Checkpointing Algorithm for Mobile Computing EnvironmentsProceedings of the 15th International Conference on Distributed Computing and Networking - Volume 831410.1007/978-3-642-45249-9_9(134-148)Online publication date: 4-Jan-2014
https://dl.acm.org/doi/10.1007/978-3-642-45249-9_9
Perumalla KProtopopescu V(2013)Reversible simulations of elastic collisionsACM Transactions on Modeling and Computer Simulation10.1145/2457459.245746123:2(1-25)Online publication date: 10-May-2013
https://dl.acm.org/doi/10.1145/2457459.2457461
Luo YManivannan D(2011)Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E familiesPerformance Evaluation10.1016/j.peva.2011.01.00568:5(429-445)Online publication date: 1-May-2011
https://dl.acm.org/doi/10.1016/j.peva.2011.01.005
Bosilca GBouteiller AHerault TLemarinier PDongarra J(2010)Dodging the cost of unavoidable memory copies in message logging protocolsProceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface10.5555/1894122.1894148(189-197)Online publication date: 12-Sep-2010
https://dl.acm.org/doi/10.5555/1894122.1894148
Manivannan D(2008)Checkpointing and rollback recovery in distributed systemsProceedings of the 12th WSEAS international conference on Systems10.5555/1580134.1580272(569-574)Online publication date: 22-Jul-2008
https://dl.acm.org/doi/10.5555/1580134.1580272
Manivannan DJiang QYang JSinghal M(2008)A quasi-synchronous checkpointing algorithm that prevents contention for stable storageInformation Sciences: an International Journal10.5555/1379466.1383676178:15(3110-3117)Online publication date: 1-Aug-2008
https://dl.acm.org/doi/10.5555/1379466.1383676
Jiang QLuo YManivannan D(2008)An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2008.08.00368:12(1575-1589)Online publication date: 1-Dec-2008
https://dl.acm.org/doi/10.1016/j.jpdc.2008.08.003
Baudé FCaromel DDelbé CHenrio LYelick KMellor-Crummey J(2007)Promised messagesProceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/1229428.1229463(154-155)Online publication date: 14-Mar-2007
https://dl.acm.org/doi/10.1145/1229428.1229463
Sekhar Paul HGupta ASharma A(2006)Finding a suitable checkpoint and recovery protocol for a distributed applicationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2005.12.00866:5(732-749)Online publication date: 1-May-2006
https://dl.acm.org/doi/10.1016/j.jpdc.2005.12.008
Kumar KHansdah R(2006)An efficient and scalable checkpointing and recovery algorithm for distributed systemsProceedings of the 8th international conference on Distributed Computing and Networking10.1007/11947950_11(94-99)Online publication date: 27-Dec-2006
https://dl.acm.org/doi/10.1007/11947950_11
Show More Cited By

Abstract

Cited By

Recommendations

Asynchronous recovery without using vector timestamps

Quasi-synchronous checkpointing and failure recovery in distributed systems

Low Overhead Incremental Checkpointing and Rollback Recovery Scheme on Windows Operating System

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media