[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Checkpoint/restart approaches for a thread-based MPI runtime

Published: 01 July 2019 Publication History

Highlights

Transparent checkpoint restart can be applied to high-speed networks with collaboration from the MPI runtime (particularly network modularity).
Thread-based MPI runtimes can be checkpointed both transparently and at application-level without blocking difficulties when compared to their process-based counterpart.
We introduce an asynchronous checkpointing interface for transparent checkpointing.

Abstract

Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MPI benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed.

References

[1]
J. Adam, J.B. Besnard, A.D. Malony, S. Shende, M. Pérache, P. Carribault, J. Jaeger, Transparent high-speed network checkpoint/restart in mpi, Proceedings of the 25th European MPI Users’ Group Meeting, EuroMPI’18, ACM, New York, NY, USA, 2018, pp. 12:1–12:11,.
[2]
J. Ansel, K. Arya, G. Cooperman, Dmtcp: transparent checkpointing for cluster computations and the desktop, Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, IEEE, 2009, pp. 1–12.
[3]
B.W. Barrett, R. Brightwell, S. Hemmert, K. Pedretti, K. Wheeler, K. Underwood, R. Riesen, A.B. Maccabe, T. Hudson, The portals 4.2 network programming interface, Sandia National Laboratories, November 2012, Technical Report SAND2018-12790 (2018).
[4]
L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, S. Matsuoka, Fti: high performance fault tolerance interface for hybrid systems, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2011, pp. 1–12,.
[5]
F. Bellard, Qemu, a fast and portable dynamic translator., USENIX Annual Technical Conference, FREENIX Track, 41, 2005, p. 46.
[6]
J.B. Besnard, J. Adam, S. Shende, M. Pérache, P. Carribault, J. Jaeger, A.D. Malony, Introducing task-containers as an alternative to runtime-stacking, Proceedings of the 23rd European MPI Users’ Group Meeting, ACM, 2016, pp. 51–63.
[7]
W. Bland, A. Bouteiller, T. Herault, G. Bosilca, J. Dongarra, Post-failure recovery of mpi communication capability: design and rationale, Int. J. High Perform. Comput. Appl. 27 (3) (2013) 244–254,.
[8]
W. Bland, A. Bouteiller, T. Herault, J. Hursey, G. Bosilca, J.J. Dongarra, An evaluation of user-level failure mitigation support in mpi, European MPI Users’ Group Meeting, Springer, 2012, pp. 193–203.
[9]
A. Bouteiller, G. Bosilca, J.J. Dongarra, Plan b: interruption of ongoing mpi operations to support failure recovery, Proceedings of the 22Nd European MPI Users’ Group Meeting, EuroMPI’15, ACM, New York, NY, USA, 2015, pp. 11:1–11:9,.
[10]
D. Buntinas, C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, F. Cappello, Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi protocols, Future Gener. Comput. Syst. 24 (1) (2008) 73–84,.
[11]
J. Cao, G. Kerr, K. Arya, G. Cooperman, Transparent checkpoint-restart over infiniband, Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC’14, ACM, New York, NY, USA, 2014, pp. 13–24,.
[12]
J. Corbet, Autonuma: the other approach to numa scheduling, LWN. net (2012).
[13]
S. Derradji, T. Palfer-Sollier, J.P. Panziera, A. Poudes, F.W. Atos, The bxi interconnect architecture, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects, 2015, pp. 18–25,.
[14]
J. Dinan, R.E. Grant, P. Balaji, D. Goodell, D. Miller, M. Snir, R. Thakur, Enabling communication concurrency through flexible mpi endpoints, Int. J. High Perform. Comput. Appl. 28 (4) (2014) 390–405.
[15]
P. EMELYANOV, Criu: checkpoint/restore in userspace, 2011, https://criu.org/.
[16]
G.E. Fagg, J.J. Dongarra, Ft-mpi: ffault tolerant mpi, supporting dynamic applications in a dynamic world, in: J. Dongarra, P. Kacsuk, N. Podhorszki (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, Springer Berlin Heidelberg, Berlin, Heidelberg, 2000, pp. 346–353.
[17]
M. Gamell, D.S. Katz, H. Kolla, J. Chen, S. Klasky, M. Parashar, Exploring automatic, online failure recovery for scientific applications at extreme scales, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC’14, IEEE Press, Piscataway, NJ, USA, 2014, pp. 895–906,.
[18]
R. Garg, K. Sodha, Z. Jin, G. Cooperman, Checkpoint-restart for a network of virtual machines, 2013 IEEE International Conference on Cluster Computing (CLUSTER), 2013, pp. 1–8,.
[19]
D. Hakkarinen, Z. Chen, Multilevel diskless checkpointing, IEEE Trans. Comput. 62 (4) (2013) 772–783.
[20]
P.H. Hargrove, J.C. Duell, Berkeley lab checkpoint/restart (blcr) for linux clusters, J. Phys. Conf. Ser. 46 (1) (2006) 494.
[21]
D. Holmes, K. Mohror, R.E. Grant, A. Skjellum, M. Schulz, W. Bland, J.M. Squyres, Mpi sessions: Leveraging runtime infrastructure to increase scalability of applications at exascale, Proceedings of the 23rd European MPI Users’ Group Meeting, ACM, 2016, pp. 121–129.
[22]
J. Hursey, R.L. Graham, G. Bronevetsky, D. Buntinas, H. Pritchard, D.G. Solt, Run-through stabilization: an mpi proposal for process fault tolerance, in: Y. Cotronis, A. Danalis, D.S. Nikolopoulos, J. Dongarra (Eds.), Recent Advances in the Message Passing Interface, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 329–332.
[23]
C. Iancu, S. Hofmeyr, F. Blagojevic, Y. Zheng, Oversubscription on multicore processors, 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010, pp. 1–11.
[24]
L.V. Kale, S. Krishnan, Charm++: a portable concurrent object oriented system based on c++, Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA’93, ACM, New York, NY, USA, 1993, pp. 91–108,.
[25]
L.V. Kale, G. Zheng, Charm++ and ampi: adaptive runtime strategies via migratable objects, Adv. Comput. Infrastruct. Parallel Distrib. Appl. (2009) 265–282.
[26]
H. Kamal, A. Wagner, Added concurrency to improve mpi performance on multicore, 2012 41st International Conference on Parallel Processing, 2012, pp. 229–238,.
[27]
I. Karlin, J. Keasler, J. Neely, Lulesh 2.0 updates and changes, Technical Report, Lawrence Livermore National Laboratory (LLNL), Livermore, CA, 2013.
[28]
A. Moody, G. Bronevetsky, K. Mohror, B.R. De Supinski, Design, modeling, and evaluation of a scalable multi-level checkpointing system, High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, IEEE, 2010, pp. 1–11.
[29]
N. Naksinehaboon, Y. Liu, C. Leangsuksun, R. Nassar, M. Paun, S.L. Scott, Reliability-aware approach: an incremental checkpoint/restart model in hpc environments, Cluster Computing and the Grid, 2008. CCGRID’08. 8th IEEE International Symposium on, IEEE, 2008, pp. 783–788.
[30]
X. Ni, E. Meneses, N. Jain, L.V. Kalé, Acr: automatic checkpoint/restart for soft and hard error protection, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ACM, 2013, p. 7.
[31]
X. Ni, E. Meneses, L.V. Kalé, Hiding checkpoint overhead in hpc applications with a semi-blocking algorithm, Cluster Computing (CLUSTER), 2012 IEEE International Conference on, IEEE, 2012, pp. 364–372.
[32]
M. Pérache, P. Carribault, H. Jourdren, Mpc-mpi: an mpi implementation reducing the overall memory consumption, in: M. Ropo, J. Westerholm, J. Dongarra (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, Proceedings of the 16th European PVM/MPI Users’ Group Meeting (EuroPVM/MPI 2009), in: Lecture Notes in Computer Science, 5759, Springer Berlin Heidelberg, 2009, pp. 94–103,.
[33]
M. Pérache, H. Jourdren, R. Namyst, Mpc: A unified parallel runtime for clusters of numa machines, Proceedings of the 14th International Euro-Par Conference on Parallel Processing, Euro-Par’08, Springer-Verlag, Berlin, Heidelberg, 2008, pp. 78–88,.
[34]
J.S. Plank, K. Li, M.A. Puening, Diskless checkpointing, IEEE Trans. Parallel Distrib. Syst. 9 (10) (1998) 972–986.
[35]
M. Rieker, J. Ansel, G. Cooperman, Transparent user-level checkpointing for the native posix thread library for linux., PDPTA, 6, 2006, pp. 492–498.
[36]
K. Teranishi, M.A. Heroux, Toward local failure local recovery resilience model using mpi-ulfm, Proceedings of the 21st European MPI Users’ Group Meeting, EuroMPI/ASIA’14, ACM, New York, NY, USA, 2014, pp. 51:51–51:56,.
[37]
G. Utrera, J. Corbalan, J. Labarta, Scheduling parallel jobs on multicore clusters using cpu oversubscription, J. Supercomput. 68 (3) (2014) 1113–1140,.
[38]
F. Wende, T. Steinke, A. Reinefeld, The impact of process placement and oversubscription on application performance: a case study for exascale computing, in: A. Gray, L. Smith, M. Weiland (Eds.), Proceedings of the 3rd International Conference on Exascale Applications and Software, EASC 2015, 2015, pp. 13–18.
[39]
G. Zheng, C. Huang, L.V. Kalé, Performance evaluation of automatic checkpoint-based fault tolerance for ampi and charm++, SIGOPS Oper. Syst. Rev. 40 (2) (2006) 90–99,.
[40]
G. Zheng, L. Shi, L.V. Kale, Ftc-charm++: an in-memory checkpoint-based fault tolerant runtime for charm++ and mpi, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935), 2004, pp. 93–103,.

Cited By

View all
  • (2020)Reinit: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 22-Jun-2020

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Parallel Computing
Parallel Computing  Volume 85, Issue C
Jul 2019
243 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 July 2019

Author Tags

  1. Checkpoint-restart
  2. Fault-tolerance
  3. DMTCP
  4. Infiniband
  5. Multilevel checkpointing
  6. MPI oversubscribing

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Reinit: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault ToleranceHigh Performance Computing10.1007/978-3-030-50743-5_27(536-554)Online publication date: 22-Jun-2020

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media