[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Published: 01 August 2008 Publication History

Abstract

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead.

References

[1]
Agbaria, A., Attiya, H., Friedman, R. and Vitenberg, R., Quantifying rollback propagation in distributed checkpointing. Journal of Parallel and Distributed Computing. v64. 370-384.
[2]
R. Baldoni, J.M. Helary, A. Mostefaoui, M. Raynal, A communication induced algorithm that ensures the rollback dependency trackability, in: Proceedings of the 27th International Symposium on Fault-Tolerant Computing, Seattle, July 1997.
[3]
B. Bhargava, S.R. Lian, Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach, in: Proceedings of 7th IEEE Symposium on Reliable Distributed Systems, 1988, pp. 3-12.
[4]
Chandy, K.M. and Lamport, L., Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems. v3 i1. 63-75.
[5]
Chang, Yun Seok, Cho, Sun Young and Kim, Bo Yeon, Performance evaluation of the striped checkpointing algorithm on the distributed raid for cluster computer. In: Lecture Notes in Computer Science, vol. 2658. Springer-Verlag. pp. 955-962.
[6]
Luis Moura e Silva, Jou¿o Gabriel Silva, Global checkpointing for distributed programs, in: Proceedings of Symposium on Reliable Distributed Systems, 1992, pp. 155-162.
[7]
Elnozahy, E.N. and Zwaenepoel, W., Manetho: transparent rollback-recovery with low overhead, limited roll-back and fast output commit. IEEE Transactions on Computers. v41 i5. 526-531.
[8]
Elnozahy, E.N. and Plank, J.S., Checkpointing for Peta-scale systems: a look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing. v1. 97-108.
[9]
B. Gupta, S. Rahimi, P. Sunke, A domino-effect free checkpointing/recovery mechanism for cluster federations, in: Proceedings of the ISCA 22nd International Conference Computers and their Applications, March 2007.
[10]
Helary, J.-M., Observing global states of asynchronous distributed applications. In: LNCS, vol. 392. Springer, Berlin. pp. 124-134.
[11]
Q. Jiang, D. Manivannan, An optimistic checkpointing and selective message logging approach for consistent global checkpoint collection in distributed systems, in: Proceedings of 2007 IEEE International Parallel and Distributed Processing Symposium, March 2007.
[12]
H. Jin, K. Hwang, Distributed checkpointing on clusters with dynamic striping and staggering, in: Proceedings of the 7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster, Lecture Notes in Computer Science, vol. 2550, 2002, pp. 19-33.
[13]
Kant, K., A model for error recovery with global checkpointing. Information Sciences. v30 i3. 225-239.
[14]
Katsaros, P., Angelis, L. and Lazos, C., Performance and effectiveness trade-off for checkpointing in fault-tolerant distributed systems. Concurrency and Computation Practice & Experience. v19 i1. 37-63.
[15]
K.H. Kim, A scheme for coordinated execution of independently designed recoverable distributed processes, in: Proceedings of 16th IEEE Symposium on Fault-Tolerant Computing, June 1986, pp. 130-135.
[16]
Koo, R. and Toueg, S., Checkpointing and roll-back recovery for distributed systems. IEEE Transactions on Software Engineering. vSE-13 i1. 23-31.
[17]
Lamport, L., Time, clocks and ordering of events in distributed systems. Communications of the ACM. v21 i7. 558-565.
[18]
K. Li, J.F. Naughton, J.S. Plank, Checkpointing multicomputer applications, in: Proceedings of 10th Symposium on Reliable Distributed Systems, 1991, pp. 2-11.
[19]
Mandal, P.S. and Mukhopadhyaya, K., Performance analysis of different checkpointing and recovery schemes using stochastic model. Journal of Parallel and Distributed Computing. v66. 99-107.
[20]
Manivannan, D., Jiang, Q., Yang, J., Persson, K. and Singhal, M., An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm. Springer Lecture Notes in Computer Science Series. vNo. 3741. 117-128.
[21]
D. Manivannan, M. Singhal, A low-overhead recovery technique using quasi-synchronous checkpointing, in: Proceedings of the 16th IEEE International Conference on Distributed Computing Systems, Hong Kong, May 1996, pp. 100-107.
[22]
Manivannan, D. and Singhal, M., Asynchronous recovery without using vector timestamps. Journal of Parallel and Distributed Computing. v62 i12. 1695-1728.
[23]
Manivannan, D. and Singhal, Mukesh, Quasi-synchronous checkpointing: models, characterization, and classification. IEEE Transactions on Parallel and Distributed Systems. v10 i7. 703-713.
[24]
Mukhopadhyaya, K. and Mandal, P.S., Self-stabilizing algorithm for checkpointing in a distributed system. Journal of Parallel and Distributed Computing. v67 i7. 816-829.
[25]
Robert, H., Netzer, B. and Jian, Xu, Necessary and sufficient conditions for consistent global snapshots. IEEE Transactions on Parallel and Distributed Systems. v6 i2. 165-169.
[26]
Pamula, R.S., Thanawastien, S. and Varol, Y.L., On selecting rollback points for error recovery. Information Sciences. v38 i3. 283-292.
[27]
J.S. Plank, Efficient checkpointing on MIMD architectures, Ph.D. thesis, Princeton University, June 1993.
[28]
Strom, R.E. and Yemini, S., Optimistic recovery in distributed systems. ACM Transactions on Computer Systems. v3 i3. 204-226.
[29]
Tsai, J.C., Kuo, S.Y. and Wang, Y.M., More properties of communication-induced checkpointing protocols with rollback-dependency trackability. Journal of Information Science and Engineering. v21. 239-257.
[30]
N. Vaidya, On checkpoint latency, in: Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems, December 1995.
[31]
Vaidya, N., Staggered consistent checkpointing. IEEE Transactions on Parallel and Distributed Systems. v10 i7. 694-702.

Cited By

View all
  • (2017)Adaptive checkpointing with reliable storage in cloud environmentMultiagent and Grid Systems10.3233/MGS-17027013:3(253-268)Online publication date: 1-Jan-2017
  • (2013)AI-CkptProceedings of the 22nd international symposium on High-performance parallel and distributed computing10.1145/2493123.2462918(155-166)Online publication date: 17-Jun-2013
  • (2013)AI-CkptProceedings of the 22nd international symposium on High-performance parallel and distributed computing10.1145/2462902.2462918(155-166)Online publication date: 17-Jun-2013
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal
Information Sciences: an International Journal  Volume 178, Issue 15
August, 2008
178 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 August 2008

Author Tags

  1. Checkpoint staggering
  2. Communication-induced checkpointing
  3. Distributed checkpointing
  4. Failure-recovery
  5. Fault-tolerance
  6. Rollback recovery
  7. Staggered checkpointing
  8. Uncoordinated

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2017)Adaptive checkpointing with reliable storage in cloud environmentMultiagent and Grid Systems10.3233/MGS-17027013:3(253-268)Online publication date: 1-Jan-2017
  • (2013)AI-CkptProceedings of the 22nd international symposium on High-performance parallel and distributed computing10.1145/2493123.2462918(155-166)Online publication date: 17-Jun-2013
  • (2013)AI-CkptProceedings of the 22nd international symposium on High-performance parallel and distributed computing10.1145/2462902.2462918(155-166)Online publication date: 17-Jun-2013
  • (2009)Necessary and sufficient conditions for transaction-consistent global checkpoints in a distributed database systemInformation Sciences: an International Journal10.1016/j.ins.2009.06.016179:20(3659-3672)Online publication date: 1-Sep-2009
  • (2008)Checkpointing and rollback recovery in distributed systemsProceedings of the 12th WSEAS international conference on Systems10.5555/1580134.1580272(569-574)Online publication date: 22-Jul-2008

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media