More Web Proxy on the site http://driver.im/

research-article

Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

Authors:

Mukesh SinghalAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 10, Issue 7

Pages 703 - 713

https://doi.org/10.1109/71.780865

Published: 01 July 1999 Publication History

Abstract

Checkpointing algorithms are classified as synchronous and asynchronous in the literature. In synchronous checkpointing, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system. Synchronizing checkpointing activity involves message overhead and process execution may have to be suspended during the checkpointing coordination, resulting in performance degradation. In asynchronous checkpointing, processes take checkpoints without any coordination with others. Asynchronous checkpointing provides maximum autonomy for processes to take checkpoints; however, some of the checkpoints taken may not lie on any consistent global checkpoint, thus making the checkpointing efforts useless. Asynchronous checkpointing algorithms in the literature can reduce the number of useless checkpoints by making processes take communication induced checkpoints besides asynchronous checkpoints. We call such algorithms quasi-synchronous. In this paper, we present a theoretical framework for characterizing and classifying such algorithms. The theory not only helps to classify and characterize the quasi-synchronous checkpointing algorithms, but also helps to analyze the properties and limitations of the algorithms belonging to each class. It also provides guidelinesfor designing and evaluating such algorithms.

References

[1]

A. Acharya and B.R. Badrinath, “Checkpointing Distributed Applications on Mobile Computer,” Proc. Third Int'l Conf. Parallel and Distributed Information Systems, Sept. 1994.

Digital Library

[2]

R. Baldoni J.M. Helary A. Mostefaoui and M. Raynal, “Consistent Checkpoints in Message Passing Distributed Systems,” Rapporte de Recherche no. 2564, INRIA, France, June 1995.

[3]

R. Baldoni J.M. Helary A. Mostefaoui and M. Raynal, “A Communication Induced Algorithm that Ensures the Rollback Dependency Trackability,” Proc. 27th Int'l Symp. Fault-Tolerant Computing, Seattle, Wash., July 1997.

Digital Library

[4]

B. Bhargava and S.R. Lian, “Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems—An Optimistic Approach,” Proc. Seventh IEEE Symp. Reliable Distributed Systems, pp. 3-12, 1988.

[5]

D. Briatico A. Ciuffoloetti and L. Simoncini, “A Distributed Domino-Effect Free Recovery Algorithm,” Proc. IEEE Fourth Symp. Reliability in Distributed Software and Database Systems, pp. 207-215, 1984.

[6]

L. Moura e Silva and J.G. Silva, “Global Checkpointing for Distributed Programs,” Proc. Symp. Reliable Distributed Systems, pp. 155-162, 1992.

[7]

K.H. Kim, “Programmer-Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules for Efficient Implementation,” IEEE Trans. Software Eng., vol. 14, no. 6, pp. 810-821, June 1988.

Digital Library

[8]

R. Koo and S. Toueg, “Checkpointing and Roll-Back Recovery for Distributed Systems,” IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.

Digital Library

[9]

A.D. Kshemkalyani M. Raynal and M. Singhal, “An Introduction to Snapshot Algorithms in Distributed Computing,” Distributed Systems Eng. J., vol. 2, no. 4, pp. 224-233, Dec. 1995.

[10]

K. Tsuruoka A. Kaneko and Y. Nishihara, “Dynamic Recovery Schemes for Distributed Process,” Proc. IEEE Second Symp. Reliability in Distributed Software and Database Systems, pp. 124-130, 1981.

[11]

L. Lamport, “Time, Clocks and Ordering of Events in Distributed Systems,” Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.

Digital Library

[12]

K. Li J.F. Naughton and J.S. Plank, “Checkpointing Multicomputer Application,” Proc. 10th Symp. Reliable Distributed Systems, pp. 2-11, 1991.

[13]

D. Manivannan R.H.B. Netzer and M. Singhal, “Finding Consistent Global Checkpoints in a Distributed Computation,” Technical Report OSU-CISRC-3/96-TR16, The Ohio State Univ., Dept. of Computer and Information Science, 1996.

[14]

D. Manivannan R.H.B. Netzer and M. Singhal, “Finding Consistent Global Checkpoints in a Distributed Computation,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 6, pp. 623-627, June 1997.

Digital Library

[15]

D. Manivannan and M. Singhal, “A Low-Overhead Recovery Technique Using Quasi-Synchronous Checkpointing,” Proc. 16th Int'l Conf. Distributed Computing Sytems, pp. 100-107, Hong Kong, May 1996.

Digital Library

[16]

F. Mattern, “Virtual Time and Global States of Distributed Systems,” Parallel and Distributed Algorithms, M. Cosnard et al., eds., pp. 215-226. North Holland: Elsevier Science, 1989.

[17]

R.H.B. Netzer and J. Xu, “Necessary and Sufficient Conditions for Consistent Global Snapshots,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 2, pp. 165-169, Feb. 1995.

Digital Library

[18]

M. Raynal and M. Singhal, “Logical Time: Capturing Causality in Distributed Systems,” Computer, vol. 29, no. 2, pp. 49-56, Feb. 1996.

Digital Library

[19]

D. Russel, “State Restoration in Systems of Communicating Processes,” IEEE Trans. Software Eng., vol. 6, no. 2, pp. 183-194, 1980.

Digital Library

[20]

J. Tsai S.-Y. Kuo and Y.-M. Wang, “Theorectical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 963-971, Oct. 1998.

Digital Library

[21]

J. Tsai Y.-M. Wang and S.-Y. Kuo, “Evaluation of Domino-Free Communication-Induced Checkpointing Protocols,” Research Report MSR-TR-98-43, Microsoft Corp., http://www.research.microsoft.com/scripts/pubdb/trpub.asp, Sept. 1998.

[22]

K. Venkatesh T. Radhakrishnan and H.F. Li, “Optimal Checkpointing and Local Encoding for Domino-Free Rollback Recovery,” Information Processing Letters, vol. 25, pp. 295-303, July 1987.

Digital Library

[23]

Y.-M. Wang, “Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints,” IEEE Trans. Computers, vol. 46, no. 4, pp. 456-468, Apr. 1997.

Digital Library

[24]

J. Xu and R.H.B. Netzer, “Adaptive Independent Checkpointing for Reducing Rollback Propagation,” Proc. Fifth IEEE Symp. Parallel and Distributed Processing, Dec. 1993.

Digital Library

Cited By

Dathathri RGill GHoang LPingali KBahar IHerlihy MWitchel ELebeck A(2019)PhoenixProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304056(615-630)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304056
Vora KTian CGupta RHu Z(2017)CoRALACM SIGARCH Computer Architecture News10.1145/3093337.303774745:1(223-236)Online publication date: 4-Apr-2017
https://dl.acm.org/doi/10.1145/3093337.3037747
Vora KTian CGupta RHu Z(2017)CoRALACM SIGPLAN Notices10.1145/3093336.303774752:4(223-236)Online publication date: 4-Apr-2017
https://dl.acm.org/doi/10.1145/3093336.3037747
Show More Cited By

Index Terms

Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

Recommendations

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes ...
A Communication-Induced Checkpointing and Asynchronous Recovery Protocol for Mobile Computing Systems
PDCAT '05: Proceedings of the Sixth International Conference on Parallel and Distributed Computing Applications and Technologies

Mobile computing systems have many constraints such as low battery power, low bandwidth , high mobility and lack of stable storage which are not presented in static distributed systems. In this paper, we propose an efficient communication-induced ...
Quasi-synchronous checkpointing and failure recovery in distributed systems

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 10, Issue 7

July 1999

96 pages

ISSN:1045-9219

Editor:
John A. Stankovic
Univ. of Virginia

Issue’s Table of Contents

Copyright © Copyright © 1999 IEEE. All Rights Reserved.

Publisher

IEEE Press

Publication History

Published: 01 July 1999

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

36
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Dathathri RGill GHoang LPingali KBahar IHerlihy MWitchel ELebeck A(2019)PhoenixProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304056(615-630)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304056
Vora KTian CGupta RHu Z(2017)CoRALACM SIGARCH Computer Architecture News10.1145/3093337.303774745:1(223-236)Online publication date: 4-Apr-2017
https://dl.acm.org/doi/10.1145/3093337.3037747
Vora KTian CGupta RHu Z(2017)CoRALACM SIGPLAN Notices10.1145/3093336.303774752:4(223-236)Online publication date: 4-Apr-2017
https://dl.acm.org/doi/10.1145/3093336.3037747
Vora KTian CGupta RHu ZChen YTemam OCarter J(2017)CoRALProceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3037697.3037747(223-236)Online publication date: 4-Apr-2017
https://dl.acm.org/doi/10.1145/3037697.3037747
Qian XSen KHargrove PIancu C(2016)SReplayProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926264(1-13)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1145/2925426.2926264
Khlif HHadj Kacem HHernandez SHadj Kacem AEichler CSimón A(2016)An efficient validation approach for quasi-synchronous checkpointing oriented to distributed diagnosabilityJournal of Systems and Software10.1016/j.jss.2016.04.070122:C(364-377)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1016/j.jss.2016.04.070
Gupta GSridharan SSohi G(2014)Globally precise-restartable execution of parallel programsACM SIGPLAN Notices10.1145/2666356.259430649:6(181-192)Online publication date: 9-Jun-2014
https://dl.acm.org/doi/10.1145/2666356.2594306
Gupta GSridharan SSohi GO'Boyle MPingali K(2014)Globally precise-restartable execution of parallel programsProceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2594291.2594306(181-192)Online publication date: 9-Jun-2014
https://dl.acm.org/doi/10.1145/2594291.2594306
Luo YManivannan D(2011)Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E familiesPerformance Evaluation10.1016/j.peva.2011.01.00568:5(429-445)Online publication date: 1-May-2011
https://dl.acm.org/doi/10.1016/j.peva.2011.01.005
Wu JManivannan DThuraisingham B(2009)Necessary and sufficient conditions for transaction-consistent global checkpoints in a distributed database systemInformation Sciences: an International Journal10.1016/j.ins.2009.06.016179:20(3659-3672)Online publication date: 1-Sep-2009
https://dl.acm.org/doi/10.1016/j.ins.2009.06.016
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents