Coordinated checkpoint/restart process fault tolerance for mpi applications on hpc systems

January 2010

Author:
Joshua Hursey
Indiana University
,
Adviser:
Andrew Lumsdaine
Indiana University

Publisher:

Indiana University
Indianapolis, IN
United States

ISBN:978-1-124-24735-9

Order Number:AAI3423687

Pages:

202

Purchase on ProQuest

Bibliometrics

Abstract

Scientists use advanced computing techniques to assist in answering the complex questions at the forefront of discovery. The High Performance Computing (HPC) scientific applications created by these scientists are running longer and scaling to larger systems. These applications must be able to tolerate the inevitable failure of a subset of processes (process failures) that occur as a result of pushing the reliability boundaries of HPC systems. HPC system reliability is emerging as a problem in future exascale systems where the time to failure is measured in minutes or hours instead of days or months. Resilient applications (i.e., applications that can continue to run despite process failures) depend on resilient communication and runtime environments to sustain the application across process failures. Unfortunately, these environments are uncommon and not typically present on HPC systems. In order to preserve performance, scalability, and scientific accuracy, a resilient application may choose the invasiveness of the recovery solution, from completely transparent to completely application-directed. Therefore, resilient communication and runtime environments must provide customizable fault recovery mechanisms.

Resilient applications often use rollback recovery techniques for fault tolerance: particularly popular are checkpoint/restart (C/R) techniques. HPC applications commonly use the Message Passing Interface (MPI) standard for communication. This thesis identifies a complete set of capabilities that compose to form a coordinated C/R infrastructure for MPI applications running on HPC systems. These capabilities, when integrated into an MPI implementation, provide applications with transparent, yet optionally application configurable, fault tolerance. By adding these capabilities to Open MPI we demonstrate support for C/R process fault tolerance, automatic recovery, proactive process migration, and parallel debugging. We also discuss how this infrastructure is being used to support further research into fault tolerance.

Cited By

Contributors

Andrew Lumsdaine
University of Washington
- Publication Years1992 - 2024
- Publication counts139
- Citation count1,774
- Available for Download74
- Downloads (cumulative)30,563
- Downloads (12 months)1,865
- Downloads (6 weeks)326
- Average Downloads per Article413
- Average Citation per Article13
View Full Profile
Josh Hursey
International Business Machines
- Publication Years2005 - 2022
- Publication counts23
- Citation count255
- Available for Download8
- Downloads (cumulative)2,945
- Downloads (12 months)479
- Downloads (6 weeks)47
- Average Downloads per Article368
- Average Citation per Article11
View Full Profile

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Recommendations

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for ...
Process Migration for MPI Applications based on Coordinated Checkpoint
ICPADS '05: Proceedings of the 11th International Conference on Parallel and Distributed Systems - Volume 01

A lot of research has been done on faulttolerance for MPI applications, some on checkpoint/restart, and some on network faulttolerance. Process migration, however, has not gained widespread use due to the additional complexity of the requirement that ...
Transparent High-Speed Network Checkpoint/Restart in MPI
EuroMPI '18: Proceedings of the 25th European MPI Users' Group Meeting

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, ...

Browse Theses

Sections

Cited By

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

Process Migration for MPI Applications based on Coordinated Checkpoint

Transparent High-Speed Network Checkpoint/Restart in MPI

Sections

Cited By

Save to Binder

Recommendations

The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

Process Migration for MPI Applications based on Coordinated Checkpoint

Transparent High-Speed Network Checkpoint/Restart in MPI