More Web Proxy on the site http://driver.im/

article

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Authors:

Jianchang Yang,

M. SinghalAuthors Info & Claims

Information Sciences—Informatics and Computer Science, Intelligent Systems, Applications: An International Journal, Volume 178, Issue 15

Pages 3110 - 3117

Published: 01 August 2008 Publication History

Abstract

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead.

References

[1]

Agbaria, A., Attiya, H., Friedman, R. and Vitenberg, R., Quantifying rollback propagation in distributed checkpointing. Journal of Parallel and Distributed Computing. v64. 370-384.

Digital Library

[2]

R. Baldoni, J.M. Helary, A. Mostefaoui, M. Raynal, A communication induced algorithm that ensures the rollback dependency trackability, in: Proceedings of the 27th International Symposium on Fault-Tolerant Computing, Seattle, July 1997.

Digital Library

[3]

B. Bhargava, S.R. Lian, Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach, in: Proceedings of 7th IEEE Symposium on Reliable Distributed Systems, 1988, pp. 3-12.

[4]

Chandy, K.M. and Lamport, L., Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems. v3 i1. 63-75.

Digital Library

[5]

Chang, Yun Seok, Cho, Sun Young and Kim, Bo Yeon, Performance evaluation of the striped checkpointing algorithm on the distributed raid for cluster computer. In: Lecture Notes in Computer Science, vol. 2658. Springer-Verlag. pp. 955-962.

Digital Library

[6]

Luis Moura e Silva, Jou¿o Gabriel Silva, Global checkpointing for distributed programs, in: Proceedings of Symposium on Reliable Distributed Systems, 1992, pp. 155-162.

[7]

Elnozahy, E.N. and Zwaenepoel, W., Manetho: transparent rollback-recovery with low overhead, limited roll-back and fast output commit. IEEE Transactions on Computers. v41 i5. 526-531.

Digital Library

[8]

Elnozahy, E.N. and Plank, J.S., Checkpointing for Peta-scale systems: a look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing. v1. 97-108.

Digital Library

[9]

B. Gupta, S. Rahimi, P. Sunke, A domino-effect free checkpointing/recovery mechanism for cluster federations, in: Proceedings of the ISCA 22nd International Conference Computers and their Applications, March 2007.

[10]

Helary, J.-M., Observing global states of asynchronous distributed applications. In: LNCS, vol. 392. Springer, Berlin. pp. 124-134.

Digital Library

[11]

Q. Jiang, D. Manivannan, An optimistic checkpointing and selective message logging approach for consistent global checkpoint collection in distributed systems, in: Proceedings of 2007 IEEE International Parallel and Distributed Processing Symposium, March 2007.

[12]

H. Jin, K. Hwang, Distributed checkpointing on clusters with dynamic striping and staggering, in: Proceedings of the 7th Asian Computing Science Conference on Advances in Computing Science: Internet Computing and Modeling, Grid Computing, Peer-to-Peer Computing, and Cluster, Lecture Notes in Computer Science, vol. 2550, 2002, pp. 19-33.

Digital Library

[13]

Kant, K., A model for error recovery with global checkpointing. Information Sciences. v30 i3. 225-239.

[14]

Katsaros, P., Angelis, L. and Lazos, C., Performance and effectiveness trade-off for checkpointing in fault-tolerant distributed systems. Concurrency and Computation Practice & Experience. v19 i1. 37-63.

Digital Library

[15]

K.H. Kim, A scheme for coordinated execution of independently designed recoverable distributed processes, in: Proceedings of 16th IEEE Symposium on Fault-Tolerant Computing, June 1986, pp. 130-135.

[16]

Koo, R. and Toueg, S., Checkpointing and roll-back recovery for distributed systems. IEEE Transactions on Software Engineering. vSE-13 i1. 23-31.

Digital Library

[17]

Lamport, L., Time, clocks and ordering of events in distributed systems. Communications of the ACM. v21 i7. 558-565.

Digital Library

[18]

K. Li, J.F. Naughton, J.S. Plank, Checkpointing multicomputer applications, in: Proceedings of 10th Symposium on Reliable Distributed Systems, 1991, pp. 2-11.

[19]

Mandal, P.S. and Mukhopadhyaya, K., Performance analysis of different checkpointing and recovery schemes using stochastic model. Journal of Parallel and Distributed Computing. v66. 99-107.

Digital Library

[20]

Manivannan, D., Jiang, Q., Yang, J., Persson, K. and Singhal, M., An asynchronous recovery algorithm based on a staggered quasi-synchronous checkpointing algorithm. Springer Lecture Notes in Computer Science Series. vNo. 3741. 117-128.

Digital Library

[21]

D. Manivannan, M. Singhal, A low-overhead recovery technique using quasi-synchronous checkpointing, in: Proceedings of the 16th IEEE International Conference on Distributed Computing Systems, Hong Kong, May 1996, pp. 100-107.

Digital Library

[22]

Manivannan, D. and Singhal, M., Asynchronous recovery without using vector timestamps. Journal of Parallel and Distributed Computing. v62 i12. 1695-1728.

Digital Library

[23]

Manivannan, D. and Singhal, Mukesh, Quasi-synchronous checkpointing: models, characterization, and classification. IEEE Transactions on Parallel and Distributed Systems. v10 i7. 703-713.

Digital Library

[24]

Mukhopadhyaya, K. and Mandal, P.S., Self-stabilizing algorithm for checkpointing in a distributed system. Journal of Parallel and Distributed Computing. v67 i7. 816-829.

Digital Library

[25]

Robert, H., Netzer, B. and Jian, Xu, Necessary and sufficient conditions for consistent global snapshots. IEEE Transactions on Parallel and Distributed Systems. v6 i2. 165-169.

Digital Library

[26]

Pamula, R.S., Thanawastien, S. and Varol, Y.L., On selecting rollback points for error recovery. Information Sciences. v38 i3. 283-292.

Digital Library

[27]

J.S. Plank, Efficient checkpointing on MIMD architectures, Ph.D. thesis, Princeton University, June 1993.

Digital Library

[28]

Strom, R.E. and Yemini, S., Optimistic recovery in distributed systems. ACM Transactions on Computer Systems. v3 i3. 204-226.

Digital Library

[29]

Tsai, J.C., Kuo, S.Y. and Wang, Y.M., More properties of communication-induced checkpointing protocols with rollback-dependency trackability. Journal of Information Science and Engineering. v21. 239-257.

[30]

N. Vaidya, On checkpoint latency, in: Proceedings of the Pacific Rim International Symposium on Fault-Tolerant Systems, December 1995.

[31]

Vaidya, N., Staggered consistent checkpointing. IEEE Transactions on Parallel and Distributed Systems. v10 i7. 694-702.

Digital Library

Cited By

Meroufel BBelalem G(2017)Adaptive checkpointing with reliable storage in cloud environmentMultiagent and Grid Systems10.3233/MGS-17027013:3(253-268)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.3233/MGS-170270
Nicolae BCappello FParashar MWeissman JEpema DFigueiredo R(2013)AI-CkptProceedings of the 22nd international symposium on High-performance parallel and distributed computing10.1145/2493123.2462918(155-166)Online publication date: 17-Jun-2013
https://dl.acm.org/doi/10.1145/2493123.2462918
Nicolae BCappello FParashar MWeissman JEpema DFigueiredo R(2013)AI-CkptProceedings of the 22nd international symposium on High-performance parallel and distributed computing10.1145/2462902.2462918(155-166)Online publication date: 17-Jun-2013
https://dl.acm.org/doi/10.1145/2462902.2462918
Show More Cited By

Index Terms

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

Recommendations

A communication-induced checkpointing and asynchronous recovery algorithm for multithreaded distributed systems
PDCAT'04: Proceedings of the 5th international conference on Parallel and Distributed Computing: applications and Technologies

Checkpointing and recovery in traditional distributed systems is relatively well established. However, checkpointing and recovery in multithreaded distributed systems has not been studied in the literature. Using the traditional checkpointing and ...
A fully informed model-based checkpointing protocol for preventing useless checkpoints

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...
Asynchronous recovery without using vector timestamps

A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 178, Issue 15

August, 2008

178 pages

ISSN:0020-0255

Issue’s Table of Contents

Copyright © Elsevier Inc. © 2008.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 August 2008

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Meroufel BBelalem G(2017)Adaptive checkpointing with reliable storage in cloud environmentMultiagent and Grid Systems10.3233/MGS-17027013:3(253-268)Online publication date: 1-Jan-2017
https://dl.acm.org/doi/10.3233/MGS-170270
Nicolae BCappello FParashar MWeissman JEpema DFigueiredo R(2013)AI-CkptProceedings of the 22nd international symposium on High-performance parallel and distributed computing10.1145/2493123.2462918(155-166)Online publication date: 17-Jun-2013
https://dl.acm.org/doi/10.1145/2493123.2462918
Nicolae BCappello FParashar MWeissman JEpema DFigueiredo R(2013)AI-CkptProceedings of the 22nd international symposium on High-performance parallel and distributed computing10.1145/2462902.2462918(155-166)Online publication date: 17-Jun-2013
https://dl.acm.org/doi/10.1145/2462902.2462918
Wu JManivannan DThuraisingham B(2009)Necessary and sufficient conditions for transaction-consistent global checkpoints in a distributed database systemInformation Sciences: an International Journal10.1016/j.ins.2009.06.016179:20(3659-3672)Online publication date: 1-Sep-2009
https://dl.acm.org/doi/10.1016/j.ins.2009.06.016
Manivannan D(2008)Checkpointing and rollback recovery in distributed systemsProceedings of the 12th WSEAS international conference on Systems10.5555/1580134.1580272(569-574)Online publication date: 22-Jul-2008
https://dl.acm.org/doi/10.5555/1580134.1580272

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents