More Web Proxy on the site http://driver.im/

article

Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++

Authors:

Laxmikant V. KaléAuthors Info & Claims

ACM SIGOPS Operating Systems Review, Volume 40, Issue 2

Pages 90 - 99

https://doi.org/10.1145/1131322.1131340

Published: 01 April 2006 Publication History

Abstract

As the size of high performance clusters multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. However, the application developer is required to write significant additional code for checkpointing and restarting. This paper describes disk-based and memory-based checkpointing fault tolerance schemes that automate the task of checkpointing and restarting. The schemes also allow the program to be restarted on a different number of processors. These schemes are based on self-checkpointable, migratable objects supported by the Adaptive MPI (AMPI) and Charm++ run-time and can be applied to a wide class of applications written using MPI or message-driven languages. We demonstrate the effectiveness of the strategies and evaluate their performance.

References

[1]

NR Adiga, G Almasi, GS Almasi, Y Aridor, R Barik, D Beece, R Bellofatto, G Bhanot, R Bickford, M Blumrich, AA Bright, and J. An overview of the bluegene/1 supercomputer, 2002.]]

[2]

Lorenzo Alvisi, E. N. Elnozahy, Sriram Rao, Syed Amir Husain, and Asanka De Mel. An analysis of communication induced checkpointing. In Symposium on Fault-Tolerant Computing, pages 242--249, 1999.]]

Digital Library

[3]

Gabriel Antoniu, Luc Bouge, and Raymond Namyst. An efficient and transparent thread migration scheme in the PM2 runtime system. In Proc. 3rd Workshop on Runtime Systems for Parallel Programming (RTSPP) San Juan, Puerto Rico. Lecture Notes in Computer Science 1586, pages 496--510. Springer-Verlag, April 1999.]]

Digital Library

[4]

Amnon Barak, Shai Guday, and Richard G. Wheeler. The mosix distributed operating system. In LNCS 672. Springer, 1993.]]

Digital Library

[5]

Milind Bhandarkar and L. V. Kalé. A Parallel Framework for Explicit FEM. In M. Valero, V. K. Prasanna, and S. Vajpeyam, editors, Proceedings of the International Conference on High Performance Computing (HiPC 2000), Lecture Notes in Computer Science, volume 1970, pages 385--395. Springer Verlag, December 2000.]]

Digital Library

[6]

D. Briatico, A. Ciuffoletti, and L. Simoncini. A distributed domino-effect free recovery algorithm. In IEEE International Symposium on Reliability, Distributed Software, and Databases, pages 207--215, December 1984.]]

[7]

G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Automated application-level checkpointing of mpi programs. In Principles and Practice of Parallel Programming, June 2003.]]

Digital Library

[8]

K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. In ACM Transactions on Computer Systems, pages 3(1):63--75, February 1985.]]

Digital Library

[9]

Charm++ website. http://charm.cs.uiuc.edu/.]]

[10]

Yuqun Chen, Kai Li, and James S. Plank. CLIP: A checkpointing tool for message-passing parallel programs. 1997.]]

[11]

Epcc blue gene/1. http://www.epcc.ed.ac.uk/.]]

[12]

Chao Huang. System support for checkpoint and restart of charm++ and ampi applications. Master's thesis, Dept. of Computer Science, University of Illinois, 2004.]]

[13]

Chao Huang, Orion Lawlor, and L. V. Kalé. Adaptive MPI. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958, pages 306--322, College Station, Texas, October 2003.]]

[14]

Chao Huang, Gengbin Zheng, Sameer Kumar, and Laxmikant V. Kalé. Performance evaluation of adaptive MPI. In Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2006, March 2006.]]

Digital Library

[15]

Rashmi Jyothi, Orion Sky Lawlor, and L. V. Kale. Debugging support for Charm++. In PADTAD Workshop for IPDPS 2004, page 294. IEEE Press, 2004.]]

[16]

Laxmikant V. Kalé. The virtualization model of parallel programming: Runtime optimizations and the state of art. In LACSI 2002, Albuquerque, October 2002.]]

[17]

James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. Libckpt: Transparent checkpointing under unix. Technical Report UT-CS-94-242, 1994.]]

Digital Library

[18]

James S. Plank and Kai Li. Faster checkpointing with n+1 parity. In 24th Annual International Symposium on Fault-Tolerant Computing, June 1994.]]

[19]

B. Randell. System structure for software fault-tolerance. In IEEE Trans. on Software on Software Engineering, volume SE-1 (2), pages 226--232, June 1975.]]

[20]

Georg Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS'96), Honolulu, Hawaii, 1996.]]

Digital Library

[21]

Y. Tamir and C. Equin. Error recovery in multicomputers using global checkpoints. In 13th International Conference on Parallel Processing, pages 32--41, August 1984.]]

[22]

Turing cluster. http://www.cse.uiuc.edu/turing.]]

[23]

Y. M. Wang. Space reclamation for uncoordinated checkpointing in message-passing systems. PhD thesis, University of Illinois Urbana-Champaign, Aug 1993.]]

Digital Library

[24]

Gengbin Zheng. Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005.]]

Digital Library

[25]

Gengbin Zheng, Lixia Shi, and Laxmikant V. Kalé. Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi. In 2004 IEEE International Conference on Cluster Computing, San Dieago, CA, September 2004.]]

Digital Library

Cited By

Bosilca GHarrison RHerault TJavanmard MNookala PValeev E(2020)The Template Task Graph (TTG) - an emerging practical dataflow programming paradigm for scientific simulation at extreme scale2020 IEEE/ACM Fifth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM251964.2020.00011(1-7)Online publication date: Nov-2020
https://doi.org/10.1109/ESPM251964.2020.00011
Georgakoudis GGuo LLaguna I(2020)Reinit: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 22-Jun-2020
https://dl.acm.org/doi/10.1007/978-3-030-50743-5_27
Adam JKermarquer MBesnard JBautista-Gomez LPérache MCarribault PJaeger JMalony AShende S(2019)Checkpoint/restart approaches for a thread-based MPI runtimeParallel Computing10.1016/j.parco.2019.02.00685:C(204-219)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1016/j.parco.2019.02.006
Show More Cited By

Index Terms

Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software fault tolerance
        Checkpoint / restart
    2. Software system structures
      1. Distributed systems organizing principles
        Organizing principles for web applications

Recommendations

Coordinated checkpoint/restart process fault tolerance for mpi applications on hpc systems
Algorithm-based fault tolerance applied to high performance computing

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance technique [K. Huang, J. Abraham, Algorithm-based fault tolerance for matrix ...
Checkpoint Selection in Fault Recovery Based on Byzantine Fault Model
CICN '12: Proceedings of the 2012 Fourth International Conference on Computational Intelligence and Communication Networks

Nowadays, with the growth of the performance, the reliability problem of supercomputers becomes more and more serious. In order to complete an application with small fault recovery overhead, Checkpoint/Restart(C/R) methods are widely used. So far, the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review

ACM SIGOPS Operating Systems Review Volume 40, Issue 2

April 2006

107 pages

ISSN:0163-5980

DOI:10.1145/1131322

Issue’s Table of Contents

Copyright © 2006 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 2006

Published in SIGOPS Volume 40, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
441
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 29 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bosilca GHarrison RHerault TJavanmard MNookala PValeev E(2020)The Template Task Graph (TTG) - an emerging practical dataflow programming paradigm for scientific simulation at extreme scale2020 IEEE/ACM Fifth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM251964.2020.00011(1-7)Online publication date: Nov-2020
https://doi.org/10.1109/ESPM251964.2020.00011
Georgakoudis GGuo LLaguna I(2020)Reinit: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 22-Jun-2020
https://dl.acm.org/doi/10.1007/978-3-030-50743-5_27
Adam JKermarquer MBesnard JBautista-Gomez LPérache MCarribault PJaeger JMalony AShende S(2019)Checkpoint/restart approaches for a thread-based MPI runtimeParallel Computing10.1016/j.parco.2019.02.00685:C(204-219)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1016/j.parco.2019.02.006
Reguly IMudalige GGiles MMaheswaran S(2019)Improving resilience of scientific software through a domain-specific approachJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.01.015128:C(99-114)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1016/j.jpdc.2019.01.015
Adam JBesnard JMalony AShende SPérache MCarribault PJaeger J(2018)Transparent High-Speed Network Checkpoint/Restart in MPIProceedings of the 25th European MPI Users' Group Meeting10.1145/3236367.3236383(1-11)Online publication date: 23-Sep-2018
https://dl.acm.org/doi/10.1145/3236367.3236383
Alvarez-Bermejo JLopez-Ramos JRosenthal JSchipani D(2018)Managing key multicasting through orthogonal systemsJournal of Discrete Mathematical Sciences and Cryptography10.1080/09720529.2016.119056320:8(1721-1740)Online publication date: 17-Jan-2018
https://doi.org/10.1080/09720529.2016.1190563
Jia ZTreichler SShipman GBauer MWatkins NMaltzahn CMcCormick PAiken A(2017)Integrating External Resources with a Task-Based Programming Model2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00043(307-316)Online publication date: Dec-2017
https://doi.org/10.1109/HiPC.2017.00043
Huber MGmeiner BRüde UWohlmuth B(2016)Resilience for Massively Parallel Multigrid SolversSIAM Journal on Scientific Computing10.1137/15M102612238:5(S217-S239)Online publication date: Jan-2016
https://doi.org/10.1137/15M1026122
Cappello FAl GGropp WKale SKramer BSnir M(2014)Toward Exascale ResilienceSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1401011:1(5-28)Online publication date: 6-Apr-2014
https://dl.acm.org/doi/10.14529/jsfi140101
Fazenda AMendes aKale LPanetta JRodrigues E(2014)Dynamic load balancing in GPU-based systems for a MPI program2014 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2014.6903681(154-161)Online publication date: Jul-2014
https://doi.org/10.1109/HPCSim.2014.6903681
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents