[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Quasi-Synchronous Checkpointing: Models, Characterization, and Classification

Published: 01 July 1999 Publication History

Abstract

Checkpointing algorithms are classified as synchronous and asynchronous in the literature. In synchronous checkpointing, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system. Synchronizing checkpointing activity involves message overhead and process execution may have to be suspended during the checkpointing coordination, resulting in performance degradation. In asynchronous checkpointing, processes take checkpoints without any coordination with others. Asynchronous checkpointing provides maximum autonomy for processes to take checkpoints; however, some of the checkpoints taken may not lie on any consistent global checkpoint, thus making the checkpointing efforts useless. Asynchronous checkpointing algorithms in the literature can reduce the number of useless checkpoints by making processes take communication induced checkpoints besides asynchronous checkpoints. We call such algorithms quasi-synchronous. In this paper, we present a theoretical framework for characterizing and classifying such algorithms. The theory not only helps to classify and characterize the quasi-synchronous checkpointing algorithms, but also helps to analyze the properties and limitations of the algorithms belonging to each class. It also provides guidelinesfor designing and evaluating such algorithms.

References

[1]
A. Acharya and B.R. Badrinath, “Checkpointing Distributed Applications on Mobile Computer,” Proc. Third Int'l Conf. Parallel and Distributed Information Systems, Sept. 1994.
[2]
R. Baldoni J.M. Helary A. Mostefaoui and M. Raynal, “Consistent Checkpoints in Message Passing Distributed Systems,” Rapporte de Recherche no. 2564, INRIA, France, June 1995.
[3]
R. Baldoni J.M. Helary A. Mostefaoui and M. Raynal, “A Communication Induced Algorithm that Ensures the Rollback Dependency Trackability,” Proc. 27th Int'l Symp. Fault-Tolerant Computing, Seattle, Wash., July 1997.
[4]
B. Bhargava and S.R. Lian, “Independent Checkpointing and Concurrent Rollback for Recovery in Distributed Systems—An Optimistic Approach,” Proc. Seventh IEEE Symp. Reliable Distributed Systems, pp. 3-12, 1988.
[5]
D. Briatico A. Ciuffoloetti and L. Simoncini, “A Distributed Domino-Effect Free Recovery Algorithm,” Proc. IEEE Fourth Symp. Reliability in Distributed Software and Database Systems, pp. 207-215, 1984.
[6]
L. Moura e Silva and J.G. Silva, “Global Checkpointing for Distributed Programs,” Proc. Symp. Reliable Distributed Systems, pp. 155-162, 1992.
[7]
K.H. Kim, “Programmer-Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules for Efficient Implementation,” IEEE Trans. Software Eng., vol. 14, no. 6, pp. 810-821, June 1988.
[8]
R. Koo and S. Toueg, “Checkpointing and Roll-Back Recovery for Distributed Systems,” IEEE Trans. Software Eng., vol. 13, no. 1, pp. 23-31, Jan. 1987.
[9]
A.D. Kshemkalyani M. Raynal and M. Singhal, “An Introduction to Snapshot Algorithms in Distributed Computing,” Distributed Systems Eng. J., vol. 2, no. 4, pp. 224-233, Dec. 1995.
[10]
K. Tsuruoka A. Kaneko and Y. Nishihara, “Dynamic Recovery Schemes for Distributed Process,” Proc. IEEE Second Symp. Reliability in Distributed Software and Database Systems, pp. 124-130, 1981.
[11]
L. Lamport, “Time, Clocks and Ordering of Events in Distributed Systems,” Comm. ACM, vol. 21, no. 7, pp. 558-565, July 1978.
[12]
K. Li J.F. Naughton and J.S. Plank, “Checkpointing Multicomputer Application,” Proc. 10th Symp. Reliable Distributed Systems, pp. 2-11, 1991.
[13]
D. Manivannan R.H.B. Netzer and M. Singhal, “Finding Consistent Global Checkpoints in a Distributed Computation,” Technical Report OSU-CISRC-3/96-TR16, The Ohio State Univ., Dept. of Computer and Information Science, 1996.
[14]
D. Manivannan R.H.B. Netzer and M. Singhal, “Finding Consistent Global Checkpoints in a Distributed Computation,” IEEE Trans. Parallel and Distributed Systems, vol. 8, no. 6, pp. 623-627, June 1997.
[15]
D. Manivannan and M. Singhal, “A Low-Overhead Recovery Technique Using Quasi-Synchronous Checkpointing,” Proc. 16th Int'l Conf. Distributed Computing Sytems, pp. 100-107, Hong Kong, May 1996.
[16]
F. Mattern, “Virtual Time and Global States of Distributed Systems,” Parallel and Distributed Algorithms, M. Cosnard et al., eds., pp. 215-226. North Holland: Elsevier Science, 1989.
[17]
R.H.B. Netzer and J. Xu, “Necessary and Sufficient Conditions for Consistent Global Snapshots,” IEEE Trans. Parallel and Distributed Systems, vol. 6, no. 2, pp. 165-169, Feb. 1995.
[18]
M. Raynal and M. Singhal, “Logical Time: Capturing Causality in Distributed Systems,” Computer, vol. 29, no. 2, pp. 49-56, Feb. 1996.
[19]
D. Russel, “State Restoration in Systems of Communicating Processes,” IEEE Trans. Software Eng., vol. 6, no. 2, pp. 183-194, 1980.
[20]
J. Tsai S.-Y. Kuo and Y.-M. Wang, “Theorectical Analysis for Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability,” IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 963-971, Oct. 1998.
[21]
J. Tsai Y.-M. Wang and S.-Y. Kuo, “Evaluation of Domino-Free Communication-Induced Checkpointing Protocols,” Research Report MSR-TR-98-43, Microsoft Corp., http://www.research.microsoft.com/scripts/pubdb/trpub.asp, Sept. 1998.
[22]
K. Venkatesh T. Radhakrishnan and H.F. Li, “Optimal Checkpointing and Local Encoding for Domino-Free Rollback Recovery,” Information Processing Letters, vol. 25, pp. 295-303, July 1987.
[23]
Y.-M. Wang, “Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints,” IEEE Trans. Computers, vol. 46, no. 4, pp. 456-468, Apr. 1997.
[24]
J. Xu and R.H.B. Netzer, “Adaptive Independent Checkpointing for Reducing Rollback Propagation,” Proc. Fifth IEEE Symp. Parallel and Distributed Processing, Dec. 1993.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 10, Issue 7
July 1999
96 pages
ISSN:1045-9219
Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 July 1999

Author Tags

  1. Causality
  2. consistent global checkpoint
  3. distributed checkpointing
  4. failure recovery
  5. fault tolerance
  6. zigzag paths.

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2019)PhoenixProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304056(615-630)Online publication date: 4-Apr-2019
  • (2017)CoRALACM SIGARCH Computer Architecture News10.1145/3093337.303774745:1(223-236)Online publication date: 4-Apr-2017
  • (2017)CoRALACM SIGPLAN Notices10.1145/3093336.303774752:4(223-236)Online publication date: 4-Apr-2017
  • (2017)CoRALProceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3037697.3037747(223-236)Online publication date: 4-Apr-2017
  • (2016)SReplayProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926264(1-13)Online publication date: 1-Jun-2016
  • (2016)An efficient validation approach for quasi-synchronous checkpointing oriented to distributed diagnosabilityJournal of Systems and Software10.1016/j.jss.2016.04.070122:C(364-377)Online publication date: 1-Dec-2016
  • (2014)Globally precise-restartable execution of parallel programsACM SIGPLAN Notices10.1145/2666356.259430649:6(181-192)Online publication date: 9-Jun-2014
  • (2014)Globally precise-restartable execution of parallel programsProceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/2594291.2594306(181-192)Online publication date: 9-Jun-2014
  • (2011)Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E familiesPerformance Evaluation10.1016/j.peva.2011.01.00568:5(429-445)Online publication date: 1-May-2011
  • (2009)Necessary and sufficient conditions for transaction-consistent global checkpoints in a distributed database systemInformation Sciences: an International Journal10.1016/j.ins.2009.06.016179:20(3659-3672)Online publication date: 1-Sep-2009
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media