Cited By
View all- Georgakoudis GGuo LLaguna I(2020)Reinit: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 22-Jun-2020
Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, ...
As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for ...
With the massive scale of high-performance computing systems, long-running scientific parallel applications periodically save the state of their execution to files called checkpoints to recover from system failures. Checkpoints are stored on external ...
Elsevier Science Publishers B. V.
Netherlands