[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Performance evaluation of automatic checkpoint-based fault tolerance for AMPI and Charm++

Published: 01 April 2006 Publication History

Abstract

As the size of high performance clusters multiplies, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a fault occurs, the application is restarted from a recent checkpoint. However, the application developer is required to write significant additional code for checkpointing and restarting. This paper describes disk-based and memory-based checkpointing fault tolerance schemes that automate the task of checkpointing and restarting. The schemes also allow the program to be restarted on a different number of processors. These schemes are based on self-checkpointable, migratable objects supported by the Adaptive MPI (AMPI) and Charm++ run-time and can be applied to a wide class of applications written using MPI or message-driven languages. We demonstrate the effectiveness of the strategies and evaluate their performance.

References

[1]
NR Adiga, G Almasi, GS Almasi, Y Aridor, R Barik, D Beece, R Bellofatto, G Bhanot, R Bickford, M Blumrich, AA Bright, and J. An overview of the bluegene/1 supercomputer, 2002.]]
[2]
Lorenzo Alvisi, E. N. Elnozahy, Sriram Rao, Syed Amir Husain, and Asanka De Mel. An analysis of communication induced checkpointing. In Symposium on Fault-Tolerant Computing, pages 242--249, 1999.]]
[3]
Gabriel Antoniu, Luc Bouge, and Raymond Namyst. An efficient and transparent thread migration scheme in the PM2 runtime system. In Proc. 3rd Workshop on Runtime Systems for Parallel Programming (RTSPP) San Juan, Puerto Rico. Lecture Notes in Computer Science 1586, pages 496--510. Springer-Verlag, April 1999.]]
[4]
Amnon Barak, Shai Guday, and Richard G. Wheeler. The mosix distributed operating system. In LNCS 672. Springer, 1993.]]
[5]
Milind Bhandarkar and L. V. Kalé. A Parallel Framework for Explicit FEM. In M. Valero, V. K. Prasanna, and S. Vajpeyam, editors, Proceedings of the International Conference on High Performance Computing (HiPC 2000), Lecture Notes in Computer Science, volume 1970, pages 385--395. Springer Verlag, December 2000.]]
[6]
D. Briatico, A. Ciuffoletti, and L. Simoncini. A distributed domino-effect free recovery algorithm. In IEEE International Symposium on Reliability, Distributed Software, and Databases, pages 207--215, December 1984.]]
[7]
G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Automated application-level checkpointing of mpi programs. In Principles and Practice of Parallel Programming, June 2003.]]
[8]
K. M. Chandy and L. Lamport. Distributed snapshots: Determining global states of distributed systems. In ACM Transactions on Computer Systems, pages 3(1):63--75, February 1985.]]
[9]
Charm++ website. http://charm.cs.uiuc.edu/.]]
[10]
Yuqun Chen, Kai Li, and James S. Plank. CLIP: A checkpointing tool for message-passing parallel programs. 1997.]]
[11]
Epcc blue gene/1. http://www.epcc.ed.ac.uk/.]]
[12]
Chao Huang. System support for checkpoint and restart of charm++ and ampi applications. Master's thesis, Dept. of Computer Science, University of Illinois, 2004.]]
[13]
Chao Huang, Orion Lawlor, and L. V. Kalé. Adaptive MPI. In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958, pages 306--322, College Station, Texas, October 2003.]]
[14]
Chao Huang, Gengbin Zheng, Sameer Kumar, and Laxmikant V. Kalé. Performance evaluation of adaptive MPI. In Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2006, March 2006.]]
[15]
Rashmi Jyothi, Orion Sky Lawlor, and L. V. Kale. Debugging support for Charm++. In PADTAD Workshop for IPDPS 2004, page 294. IEEE Press, 2004.]]
[16]
Laxmikant V. Kalé. The virtualization model of parallel programming: Runtime optimizations and the state of art. In LACSI 2002, Albuquerque, October 2002.]]
[17]
James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. Libckpt: Transparent checkpointing under unix. Technical Report UT-CS-94-242, 1994.]]
[18]
James S. Plank and Kai Li. Faster checkpointing with n+1 parity. In 24th Annual International Symposium on Fault-Tolerant Computing, June 1994.]]
[19]
B. Randell. System structure for software fault-tolerance. In IEEE Trans. on Software on Software Engineering, volume SE-1 (2), pages 226--232, June 1975.]]
[20]
Georg Stellner. CoCheck: Checkpointing and Process Migration for MPI. In Proceedings of the 10th International Parallel Processing Symposium (IPPS'96), Honolulu, Hawaii, 1996.]]
[21]
Y. Tamir and C. Equin. Error recovery in multicomputers using global checkpoints. In 13th International Conference on Parallel Processing, pages 32--41, August 1984.]]
[22]
Turing cluster. http://www.cse.uiuc.edu/turing.]]
[23]
Y. M. Wang. Space reclamation for uncoordinated checkpointing in message-passing systems. PhD thesis, University of Illinois Urbana-Champaign, Aug 1993.]]
[24]
Gengbin Zheng. Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005.]]
[25]
Gengbin Zheng, Lixia Shi, and Laxmikant V. Kalé. Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi. In 2004 IEEE International Conference on Cluster Computing, San Dieago, CA, September 2004.]]

Cited By

View all
  • (2020)The Template Task Graph (TTG) - an emerging practical dataflow programming paradigm for scientific simulation at extreme scale2020 IEEE/ACM Fifth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM251964.2020.00011(1-7)Online publication date: Nov-2020
  • (2020)Reinit: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 22-Jun-2020
  • (2019)Checkpoint/restart approaches for a thread-based MPI runtimeParallel Computing10.1016/j.parco.2019.02.00685:C(204-219)Online publication date: 1-Jul-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGOPS Operating Systems Review
ACM SIGOPS Operating Systems Review  Volume 40, Issue 2
April 2006
107 pages
ISSN:0163-5980
DOI:10.1145/1131322
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 2006
Published in SIGOPS Volume 40, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2020)The Template Task Graph (TTG) - an emerging practical dataflow programming paradigm for scientific simulation at extreme scale2020 IEEE/ACM Fifth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM251964.2020.00011(1-7)Online publication date: Nov-2020
  • (2020)Reinit: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 22-Jun-2020
  • (2019)Checkpoint/restart approaches for a thread-based MPI runtimeParallel Computing10.1016/j.parco.2019.02.00685:C(204-219)Online publication date: 1-Jul-2019
  • (2019)Improving resilience of scientific software through a domain-specific approachJournal of Parallel and Distributed Computing10.1016/j.jpdc.2019.01.015128:C(99-114)Online publication date: 1-Jun-2019
  • (2018)Transparent High-Speed Network Checkpoint/Restart in MPIProceedings of the 25th European MPI Users' Group Meeting10.1145/3236367.3236383(1-11)Online publication date: 23-Sep-2018
  • (2018)Managing key multicasting through orthogonal systemsJournal of Discrete Mathematical Sciences and Cryptography10.1080/09720529.2016.119056320:8(1721-1740)Online publication date: 17-Jan-2018
  • (2017)Integrating External Resources with a Task-Based Programming Model2017 IEEE 24th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2017.00043(307-316)Online publication date: Dec-2017
  • (2016)Resilience for Massively Parallel Multigrid SolversSIAM Journal on Scientific Computing10.1137/15M102612238:5(S217-S239)Online publication date: Jan-2016
  • (2014)Toward Exascale ResilienceSupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1401011:1(5-28)Online publication date: 6-Apr-2014
  • (2014)Dynamic load balancing in GPU-based systems for a MPI program2014 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2014.6903681(154-161)Online publication date: Jul-2014
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media