[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
Coordinated checkpoint/restart process fault tolerance for mpi applications on hpc systems
Publisher:
  • Indiana University
  • Indianapolis, IN
  • United States
ISBN:978-1-124-24735-9
Order Number:AAI3423687
Pages:
202
Reflects downloads up to 29 Jan 2025Bibliometrics
Skip Abstract Section
Abstract

Scientists use advanced computing techniques to assist in answering the complex questions at the forefront of discovery. The High Performance Computing (HPC) scientific applications created by these scientists are running longer and scaling to larger systems. These applications must be able to tolerate the inevitable failure of a subset of processes (process failures) that occur as a result of pushing the reliability boundaries of HPC systems. HPC system reliability is emerging as a problem in future exascale systems where the time to failure is measured in minutes or hours instead of days or months. Resilient applications (i.e., applications that can continue to run despite process failures) depend on resilient communication and runtime environments to sustain the application across process failures. Unfortunately, these environments are uncommon and not typically present on HPC systems. In order to preserve performance, scalability, and scientific accuracy, a resilient application may choose the invasiveness of the recovery solution, from completely transparent to completely application-directed. Therefore, resilient communication and runtime environments must provide customizable fault recovery mechanisms.

Resilient applications often use rollback recovery techniques for fault tolerance: particularly popular are checkpoint/restart (C/R) techniques. HPC applications commonly use the Message Passing Interface (MPI) standard for communication. This thesis identifies a complete set of capabilities that compose to form a coordinated C/R infrastructure for MPI applications running on HPC systems. These capabilities, when integrated into an MPI implementation, provide applications with transparent, yet optionally application configurable, fault tolerance. By adding these capabilities to Open MPI we demonstrate support for C/R process fault tolerance, automatic recovery, proactive process migration, and parallel debugging. We also discuss how this infrastructure is being used to support further research into fault tolerance.

Cited By

  1. Stals L (2019). Algorithm-based fault recovery of adaptively refined parallel multilevel grids, International Journal of High Performance Computing Applications, 33:1, (189-211), Online publication date: 1-Jan-2019.
  2. Shahzad F, Kreutzer M, Zeiser T, Machado R, Pieper A, Hager G and Wellein G (2018). Building and utilizing fault tolerance support tools for the GASPI applications, International Journal of High Performance Computing Applications, 32:5, (613-626), Online publication date: 1-Sep-2018.
  3. ACM
    Abeyratne N, Chen H, Oh B, Dreslinski R, Chakrabarti C and Mudge T Checkpointing Exascale Memory Systems with Existing Memory Technologies Proceedings of the Second International Symposium on Memory Systems, (18-29)
  4. ACM
    Gamell M, Teranishi K, Heroux M, Mayo J, Kolla H, Chen J and Parashar M Exploring Failure Recovery for Stencil-based Applications at Extreme Scales Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, (279-282)
  5. ACM
    Gamell M, Teranishi K, Heroux M, Mayo J, Kolla H, Chen J and Parashar M Local recovery and failure masking for stencil-based applications at extreme scales Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (1-12)
  6. Gamell M, Katz D, Kolla H, Chen J, Klasky S and Parashar M Exploring automatic, online failure recovery for scientific applications at extreme scales Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, (895-906)
  7. Shahzad F, Wittmann M, Zeiser T and Wellein G Asynchronous checkpointing by dedicated checkpoint threads Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface, (289-290)
  8. Taifi M, Shi J and Khreishah A SpotMPI Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II, (109-120)
Contributors
  • University of Washington
  • International Business Machines
Please enable JavaScript to view thecomments powered by Disqus.

Recommendations