[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/2388996.2389022acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Design and modeling of a non-blocking checkpointing system

Published: 10 November 2012 Publication History

Abstract

As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on today's machines, it is not expected to be sufficient for exascale class machines, which are predicted to have orders of magnitude larger memory sizes and failure rates. Our solution combines the benefits of non-blocking and multi-level checkpointing. In this paper, we present the design of our system and model its performance. Our experiments show that our system can improve efficiency by 1.1 to 2.0x on future machines. Additionally, applications using our checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.

References

[1]
"TOP500 Supercomputing Sites," http://www.top500.org/.
[2]
"TSUBAME 2.0 - Monitoring Portal," http://mon.g.gsic.titech.ac.jp/.
[3]
B. Schroeder and G. A. Gibson, "Understanding Failures in Petascale Computers," Journal of Physics: Conference Series, vol. 78, no. 1, pp. 012 022+, Jul. 2007. {Online}. Available: http://dx.doi.org/10.1088/1742-6596/78/1/012022
[4]
A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, "Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '10. Washington, DC, USA: IEEE Computer Society, Nov. 2010, pp. 1--11. {Online}. Available: http://dx.doi.org/10.1109/SC.2010.18
[5]
L. Bautista-Gomez, D. Komatitsch, N. Maruyama, S. Tsuboi, F. Cappello, and S. Matsuoka, "FTI: High Performance Fault Tolerance Interface for Hybrid Systems," in Proceedings of the 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WS, USA, 2011.
[6]
"Scalable Checkpoint/Restart Library," http://sourceforge.net/projects/scalablecr/.
[7]
"IOR HPC Benchmark," http://sourceforge.net/projects/ior-sio/.
[8]
D. Patterson, G. Gibson, and R. Katz, "A Case for Redundant Arrays of Inexpensive Disks (RAID)," in Proceedings of the 1988 ACM SIGMOD Conference on Management of Data, 1988.
[9]
W. Gropp, R. Ross, and N. Miller, "Providing Efficient I/O Redundancy in MPI Environments," in Lecture Notes in Computer Science, 3241:7786, September 2004. 11th European PVM/MPI Users Group Meeting, 2004.
[10]
J. Borrill, L. Oliker, J. Shalf, and H. Shan, "Investigation of Leading HPC I/O Performance Using a Scientific-Application Derived Benchmark," in Proceedings of the 2007 ACM/IEEE conference on Supercomputing, ser. SC '07. New York, NY, USA: ACM, 2007, pp. 1--12. {Online}. Available: http://dx.doi.org/10.1145/1362622.1362636
[11]
H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan, and F. Zheng, "DataStager: Scalable Data Staging Services for Petascale Applications," in Proceedings of the 18th ACM international symposium on High performance distributed computing, ser. HPDC '09. New York, NY, USA: ACM, 2009, pp. 39--48. {Online}. Available: http://dx.doi.org/10.1145/1551609.1551618
[12]
A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, "Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System," https://library-ext.llnl.gov, Lawrence Livermore National Laboratory, Tech. Rep., Jul. 2010.
[13]
"Lustre: A Scalable, High-Performance File System," http://wiki.lustre.org/index.php/Main_Page.
[14]
R. Himeno, "Himeno benchmark," http://accc.riken.jp/HPC_e/himenobmt_e.html.
[15]
C. M. Patrick, S. Son, and M. Kandemir, "Comparative Evaluation of Overlap Strategies with Study of I/O Overlap in MPI-IO," SIGOPS Oper. Syst. Rev., vol. 42, pp. 43--49, Oct. 2008. {Online}. Available: http://dx.doi.org/10.1145/1453775.1453784
[16]
N. Ali and M. Lauria, "Improving the Performance of Remote I/O Using Asynchronous Primitives," pp. 218--228. {Online}. Available: http://dx.doi.org/10.1109/HPDC.2006.1652153
[17]
N. Liu, C. Jason, C. Philip, C. Christopher, R. Robert, G. Gary, C. Adam, and M. Carlos, "On the Role of Burst Buffers in Leadership-Class Storage Systems," in MSST/SNAPI, Apr. 2012.
[18]
J. W. Young, "A First Order Approximation to the Optimum Checkpoint Interval," Commun. ACM, vol. 17, pp. 530--531, Sep. 1974. {Online}. Available: http://dx.doi.org/10.1145/361147.361115
[19]
N. H. Vaidya, "On Checkpoint Latency," College Station, TX, USA, Tech. Rep., 1995. {Online}. Available: http://portal.acm.org/citation.cfm?id=892900
[20]
N. H. Vaidya, "A Case for Two-Level Distributed Recovery Schemes," SIGMETRICS Perform. Eval. Rev., vol. 23, no. 1, pp. 64--73, May 1995. {Online}. Available: http://dx.doi.org/10.1145/223586.223596
[21]
N. H. Vaidya, "Another Two-Level Failure Recovery Scheme," College Station, TX, USA, Tech. Rep., 1994. {Online}. Available: http://portal.acm.org/citation.cfm?id=892923

Cited By

View all
  • (2024)Reinforcement Learning-based Adaptive Mitigation of Uncorrected DRAM Errors in the FieldProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658686(240-252)Online publication date: 3-Jun-2024
  • (2020)Overhead of using spare nodesInternational Journal of High Performance Computing Applications10.1177/109434202090188534:2(208-226)Online publication date: 1-Mar-2020
  • (2020)Orchestrating Fault Prediction with Live Migration and CheckpointingProceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3369583.3392672(167-171)Online publication date: 23-Jun-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2012
1161 pages
ISBN:9781467308045

Sponsors

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 10 November 2012

Check for updates

Author Tags

  1. Markov model
  2. checkpoint/restart
  3. fault tolerance

Qualifiers

  • Research-article

Conference

SC '12
Sponsor:

Acceptance Rates

SC '12 Paper Acceptance Rate 100 of 461 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Reinforcement Learning-based Adaptive Mitigation of Uncorrected DRAM Errors in the FieldProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658686(240-252)Online publication date: 3-Jun-2024
  • (2020)Overhead of using spare nodesInternational Journal of High Performance Computing Applications10.1177/109434202090188534:2(208-226)Online publication date: 1-Mar-2020
  • (2020)Orchestrating Fault Prediction with Live Migration and CheckpointingProceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3369583.3392672(167-171)Online publication date: 23-Jun-2020
  • (2019)Failure Recovery in Resilient X10ACM Transactions on Programming Languages and Systems10.1145/333237241:3(1-30)Online publication date: 2-Jul-2019
  • (2019)GPU snapshotProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330361(171-183)Online publication date: 26-Jun-2019
  • (2019)A Minimally Intrusive Low-Memory Approach to Resilience for Existing Transient SolversJournal of Scientific Computing10.1007/s10915-018-0778-778:1(565-581)Online publication date: 1-Jan-2019
  • (2018)Building and utilizing fault tolerance support tools for the GASPI applicationsInternational Journal of High Performance Computing Applications10.1177/109434201667708532:5(613-626)Online publication date: 1-Sep-2018
  • (2017)Supporting Fault-Tolerance in Presence of In-Situ AnalyticsProceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.5555/3101112.3101155(304-313)Online publication date: 14-May-2017
  • (2017)AllConcurProceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3078597.3078598(205-218)Online publication date: 26-Jun-2017
  • (2016)Granularity and the cost of error recovery in resilient AMR scientific applicationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3014904.3014961(1-10)Online publication date: 13-Nov-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media