More Web Proxy on the site http://driver.im/

research-article

Checkpoint/restart approaches for a thread-based MPI runtime

Authors:

Maxime Kermarquer,

Jean-Baptiste Besnard,

Leonardo Bautista-Gomez,

Patrick Carribault,

Allen D. Malony,

Sameer ShendeAuthors Info & Claims

Volume 85, Issue C

Pages 204 - 219

https://doi.org/10.1016/j.parco.2019.02.006

Published: 01 July 2019 Publication History

Highlights

•

Transparent checkpoint restart can be applied to high-speed networks with collaboration from the MPI runtime (particularly network modularity).

•

Thread-based MPI runtimes can be checkpointed both transparently and at application-level without blocking difficulties when compared to their process-based counterpart.

•

We introduce an asynchronous checkpointing interface for transparent checkpointing.

Abstract

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MPI benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed.

References

[1]

J. Adam, J.B. Besnard, A.D. Malony, S. Shende, M. Pérache, P. Carribault, J. Jaeger, Transparent high-speed network checkpoint/restart in mpi, Proceedings of the 25th European MPI Users’ Group Meeting, EuroMPI’18, ACM, New York, NY, USA, 2018, pp. 12:1–12:11,.

Digital Library

[2]

J. Ansel, K. Arya, G. Cooperman, Dmtcp: transparent checkpointing for cluster computations and the desktop, Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, IEEE, 2009, pp. 1–12.

[3]

B.W. Barrett, R. Brightwell, S. Hemmert, K. Pedretti, K. Wheeler, K. Underwood, R. Riesen, A.B. Maccabe, T. Hudson, The portals 4.2 network programming interface, Sandia National Laboratories, November 2012, Technical Report SAND2018-12790 (2018).

[4]

L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, S. Matsuoka, Fti: high performance fault tolerance interface for hybrid systems, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011, pp. 1–12,.

Digital Library

[5]

F. Bellard, Qemu, a fast and portable dynamic translator., USENIX Annual Technical Conference, FREENIX Track, 41, 2005, p. 46.

[6]

J.B. Besnard, J. Adam, S. Shende, M. Pérache, P. Carribault, J. Jaeger, A.D. Malony, Introducing task-containers as an alternative to runtime-stacking, Proceedings of the 23rd European MPI Users’ Group Meeting, ACM, 2016, pp. 51–63.

[7]

W. Bland, A. Bouteiller, T. Herault, G. Bosilca, J. Dongarra, Post-failure recovery of mpi communication capability: design and rationale, Int. J. High Perform. Comput. Appl. 27 (3) (2013) 244–254,.

Digital Library

[8]

W. Bland, A. Bouteiller, T. Herault, J. Hursey, G. Bosilca, J.J. Dongarra, An evaluation of user-level failure mitigation support in mpi, European MPI Users’ Group Meeting, Springer, 2012, pp. 193–203.

[9]

A. Bouteiller, G. Bosilca, J.J. Dongarra, Plan b: interruption of ongoing mpi operations to support failure recovery, Proceedings of the 22Nd European MPI Users’ Group Meeting, EuroMPI’15, ACM, New York, NY, USA, 2015, pp. 11:1–11:9,.

Digital Library

[10]

D. Buntinas, C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, F. Cappello, Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi protocols, Future Gener. Comput. Syst. 24 (1) (2008) 73–84,.

[11]

J. Cao, G. Kerr, K. Arya, G. Cooperman, Transparent checkpoint-restart over infiniband, Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC’14, ACM, New York, NY, USA, 2014, pp. 13–24,.

Digital Library

[12]

J. Corbet, Autonuma: the other approach to numa scheduling, LWN. net (2012).

[13]

S. Derradji, T. Palfer-Sollier, J.P. Panziera, A. Poudes, F.W. Atos, The bxi interconnect architecture, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, 2015, pp. 18–25,.

Digital Library

[14]

J. Dinan, R.E. Grant, P. Balaji, D. Goodell, D. Miller, M. Snir, R. Thakur, Enabling communication concurrency through flexible mpi endpoints, Int. J. High Perform. Comput. Appl. 28 (4) (2014) 390–405.

[15]

P. EMELYANOV, Criu: checkpoint/restore in userspace, 2011, https://criu.org/.

[16]

G.E. Fagg, J.J. Dongarra, Ft-mpi: ffault tolerant mpi, supporting dynamic applications in a dynamic world, in: J. Dongarra, P. Kacsuk, N. Podhorszki (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, Springer Berlin Heidelberg, Berlin, Heidelberg, 2000, pp. 346–353.

[17]

M. Gamell, D.S. Katz, H. Kolla, J. Chen, S. Klasky, M. Parashar, Exploring automatic, online failure recovery for scientific applications at extreme scales, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC’14, IEEE Press, Piscataway, NJ, USA, 2014, pp. 895–906,.

Digital Library

[18]

R. Garg, K. Sodha, Z. Jin, G. Cooperman, Checkpoint-restart for a network of virtual machines, 2013 IEEE International Conference on Cluster Computing (CLUSTER), 2013, pp. 1–8,.

[19]

D. Hakkarinen, Z. Chen, Multilevel diskless checkpointing, IEEE Trans. Comput. 62 (4) (2013) 772–783.

[20]

P.H. Hargrove, J.C. Duell, Berkeley lab checkpoint/restart (blcr) for linux clusters, J. Phys. Conf. Ser. 46 (1) (2006) 494.

[21]

D. Holmes, K. Mohror, R.E. Grant, A. Skjellum, M. Schulz, W. Bland, J.M. Squyres, Mpi sessions: Leveraging runtime infrastructure to increase scalability of applications at exascale, Proceedings of the 23rd European MPI Users’ Group Meeting, ACM, 2016, pp. 121–129.

[22]

J. Hursey, R.L. Graham, G. Bronevetsky, D. Buntinas, H. Pritchard, D.G. Solt, Run-through stabilization: an mpi proposal for process fault tolerance, in: Y. Cotronis, A. Danalis, D.S. Nikolopoulos, J. Dongarra (Eds.), Recent Advances in the Message Passing Interface, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 329–332.

[23]

C. Iancu, S. Hofmeyr, F. Blagojevic, Y. Zheng, Oversubscription on multicore processors, 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010, pp. 1–11.

[24]

L.V. Kale, S. Krishnan, Charm++: a portable concurrent object oriented system based on c++, Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA’93, ACM, New York, NY, USA, 1993, pp. 91–108,.

Digital Library

[25]

L.V. Kale, G. Zheng, Charm++ and ampi: adaptive runtime strategies via migratable objects, Adv. Comput. Infrastruct. Parallel Distrib. Appl. (2009) 265–282.

[26]

H. Kamal, A. Wagner, Added concurrency to improve mpi performance on multicore, 2012 41st International Conference on Parallel Processing, 2012, pp. 229–238,.

Digital Library

[27]

I. Karlin, J. Keasler, J. Neely, Lulesh 2.0 updates and changes, Technical Report, Lawrence Livermore National Laboratory (LLNL), Livermore, CA, 2013.

[28]

A. Moody, G. Bronevetsky, K. Mohror, B.R. De Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, IEEE, 2010, pp. 1–11.

[29]

N. Naksinehaboon, Y. Liu, C. Leangsuksun, R. Nassar, M. Paun, S.L. Scott, Reliability-aware approach: an incremental checkpoint/restart model in hpc environments, Cluster Computing and the Grid, 2008. CCGRID’08. 8th IEEE International Symposium on, IEEE, 2008, pp. 783–788.

[30]

X. Ni, E. Meneses, N. Jain, L.V. Kalé, Acr: automatic checkpoint/restart for soft and hard error protection, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ACM, 2013, p. 7.

[31]

X. Ni, E. Meneses, L.V. Kalé, Hiding checkpoint overhead in hpc applications with a semi-blocking algorithm, Cluster Computing (CLUSTER), 2012 IEEE International Conference on, IEEE, 2012, pp. 364–372.

[32]

M. Pérache, P. Carribault, H. Jourdren, Mpc-mpi: an mpi implementation reducing the overall memory consumption, in: M. Ropo, J. Westerholm, J. Dongarra (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, Proceedings of the 16th European PVM/MPI Users’ Group Meeting (EuroPVM/MPI 2009), in: Lecture Notes in Computer Science, 5759, Springer Berlin Heidelberg, 2009, pp. 94–103,.

Digital Library

[33]

M. Pérache, H. Jourdren, R. Namyst, Mpc: A unified parallel runtime for clusters of numa machines, Proceedings of the 14th International Euro-Par Conference on Parallel Processing, Euro-Par’08, Springer-Verlag, Berlin, Heidelberg, 2008, pp. 78–88,.

Digital Library

[34]

J.S. Plank, K. Li, M.A. Puening, Diskless checkpointing, IEEE Trans. Parallel Distrib. Syst. 9 (10) (1998) 972–986.

[35]

M. Rieker, J. Ansel, G. Cooperman, Transparent user-level checkpointing for the native posix thread library for linux., PDPTA, 6, 2006, pp. 492–498.

[36]

K. Teranishi, M.A. Heroux, Toward local failure local recovery resilience model using mpi-ulfm, Proceedings of the 21st European MPI Users’ Group Meeting, EuroMPI/ASIA’14, ACM, New York, NY, USA, 2014, pp. 51:51–51:56,.

Digital Library

[37]

G. Utrera, J. Corbalan, J. Labarta, Scheduling parallel jobs on multicore clusters using cpu oversubscription, J. Supercomput. 68 (3) (2014) 1113–1140,.

Digital Library

[38]

F. Wende, T. Steinke, A. Reinefeld, The impact of process placement and oversubscription on application performance: a case study for exascale computing, in: A. Gray, L. Smith, M. Weiland (Eds.), Proceedings of the 3rd International Conference on Exascale Applications and Software, EASC 2015, 2015, pp. 13–18.

[39]

G. Zheng, C. Huang, L.V. Kalé, Performance evaluation of automatic checkpoint-based fault tolerance for ampi and charm++, SIGOPS Oper. Syst. Rev. 40 (2) (2006) 90–99,.

Digital Library

[40]

G. Zheng, L. Shi, L.V. Kale, Ftc-charm++: an in-memory checkpoint-based fault tolerant runtime for charm++ and mpi, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935), 2004, pp. 93–103,.

Cited By

Georgakoudis GGuo LLaguna I(2020)Reinit: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 22-Jun-2020
https://dl.acm.org/doi/10.1007/978-3-030-50743-5_27

Index Terms

Checkpoint/restart approaches for a thread-based MPI runtime

Index terms have been assigned to the content through auto-classification.

Recommendations

Transparent High-Speed Network Checkpoint/Restart in MPI
EuroMPI '18: Proceedings of the 25th European MPI Users' Group Meeting

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, ...
The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing

As high performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for ...
A 1 PB/s file system to checkpoint three million MPI tasks
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

With the massive scale of high-performance computing systems, long-running scientific parallel applications periodically save the state of their execution to files called checkpoints to recover from system failures. Checkpoints are stored on external ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Parallel Computing

Parallel Computing Volume 85, Issue C

Jul 2019

243 pages

ISSN:0167-8191

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 July 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Georgakoudis GGuo LLaguna I(2020)Reinit: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 22-Jun-2020
https://dl.acm.org/doi/10.1007/978-3-030-50743-5_27

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents