More Web Proxy on the site http://driver.im/

research-article

Application monitoring and checkpointing in HPC: looking towards exascale systems

Authors:

William M. Jones,

Nathan DeBardelebenAuthors Info & Claims

ACMSE '12: Proceedings of the 50th annual ACM Southeast Conference

Pages 262 - 267

https://doi.org/10.1145/2184512.2184574

Published: 29 March 2012 Publication History

Abstract

As computational cluster computers rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved compute performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing.

We demonstrate the impact of sub-optimal checkpoint intervals on application efficiency via simulation with real workload data. We find that application efficiency is relatively insensitive to error in estimation of an application's mean time to interrupt (AMTTI), a parameter central to calculating the optimal checkpoint interval. This result corroborates the trends predicted by previous analytical models. We also find that erring on the side of overestimation may be preferable to underestimation.

We further discuss how application monitoring and resilience frameworks can benefit from this insensitivity to error in AMTTI estimates. Finally, we discuss the importance of application monitoring at exascale and conclude with a discussion of challenges faced in the use of checkpointing at such extreme scales.

References

[1]

R. Ballance and N. DeBardeleben. The Mojo Application Monitoring Tool Suite. In 11th LCI International Conference on High-Performance Clustered Computing, March 2010.

[2]

J. T. Daly. A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps. Future Generation Computer Systems, 22:300--312, 2006.

Digital Library

[3]

J. T. Daly. Methodology and metrics for quantifying application throughput. In Proceedings of the Nuclear Explosives Code Developers Conference, 2006.

[4]

J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak. Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale. In Workshop on Resilience held at the IEEE Intl. Conf. on Cluster Computing and the Grid, May 2008.

Digital Library

[5]

X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi. Hybrid checkpointing using emerging nonvolatile memories for future exascale systems. ACM Transactions on Architecture and Code Optimization, 8:6:1--6:29, June 2011.

Digital Library

[6]

J. Dongarra and P. Beckman. International Exascale Software Project Roadmap. International Journal of High Performance Computer Applications, 25(1), 2011.

Digital Library

[7]

A. Geist and R. Lucas. Major computer science challenges at exascale. In Exascale.org, Feb. 2009.

Digital Library

[8]

G. Grider. ExaScale FSIO: Can we get there? Can we afford to? In HEC FSIO R&D Workshop, July 2010.

[9]

E. Hendriks. Bproc: the beowulf distributed process space. In Proc. of the 16th Intl. Conf. on Supercomputing, pages 129--136. ACM, 2002.

Digital Library

[10]

BeoSim Website. http://www.parl.clemson.edu/beosim.

[11]

W. M. Jones. Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems. In Journal of Concurrency and Computation: Practice and Experience, volume 21, pages 1672--1691. John Wiley and Sons, Ltd., September 2009.

[12]

W. M. Jones, J. T. Daly, and N. A. DeBardeleben. Application resilience: Making progress in spite of failure. In The Workshop on Resilience held in conjunction with the IEEE Intl. Conf. on Cluster Computing and the Grid, pages 789--794, May 2008.

Digital Library

[13]

W. M. Jones, J. T. Daly, and N. A. DeBardeleben. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 276--279, 2010.

Digital Library

[14]

W. M. Jones, L. W. Pang, D. Stanzione, and W. B. Ligon III. Characterization of bandwidth-aware meta-schedulers for co-allocating jobs across multiple clusters. In Journal of Supercomputing, Special Issue on the Evaluation of Grid and Cluster Computing Systems, volume 34, pages 135--163. Springer Science and Business Media B. V, November 2005.

Digital Library

[15]

ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. DARPA, 2008.

[16]

A. Moody and G. Bronevetsky. Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O. In Lawrence Livermore National Laboratory: Technical Report #415791, 2009.

[17]

A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proc. of the ACM/IEEE Intl. Conf. for High Perf. Comp., Networking, Storage and Analysis, pages 1--11, 2010.

Digital Library

[18]

R. A. Ballance et al. Application Monitoring. Cray User Group Meeting, May 2008.

[19]

B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In International Conference on Dependable Systems and Networks, pages 249--258, 2006.

Digital Library

[20]

B. Schroeder and G. Gibson. Understanding failures in petascale computers. In J. of Physics, 2007.

[21]

N. D. Singpurwalla and A. G. Wilson. Probability, chance and the probability of chance. In IIE Transactions, volume 41, pages 12--22, Jan 2009.

[22]

Vivek Sarkar et al. ExaScale Computing Software Study: Software Challenges in Extreme Scale Systems. DARPA, September 2009.

[23]

Ubiquitous High Perf. Comp. (UHPC) Request for Information (RFI). DARPA-SN-09-46, 2009.

[24]

J. W. Young. A first-order approximation to the optimum checkpoint interval. In Communications of the ACM, pages 530--531, September 1974.

Digital Library

Cited By

Iyer RKalbarczyk ZNakka N(2024)Reliable Networked and Distributed SystemsDependable Computing10.1002/9781119743453.ch8(337-411)Online publication date: 26-Apr-2024
https://doi.org/10.1002/9781119743453.ch8
Zeng JJeong JJung C(2023)Persistent Processor ArchitectureProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623772(1075-1091)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623772
Li LZnati TMelhem R(2021)Differential Shadowing: A Resilience Framework for Extreme-scale, Heterogeneous Environments with Non-Uniform Node Failure Distribution2021 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC51483.2021.9679435(1-8)Online publication date: 29-Oct-2021
https://doi.org/10.1109/IPCCC51483.2021.9679435
Show More Cited By

Index Terms

Application monitoring and checkpointing in HPC: looking towards exascale systems
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

As computational clusters rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved computational ...
A fully informed model-based checkpointing protocol for preventing useless checkpoints

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, ...
Resilient MPI applications using an application-level checkpointing framework and ULFM

Future exascale systems, formed by millions of cores, will present high failure rates, and long-running applications will need to make use of new fault tolerance techniques to ensure successful execution completion. The Fault Tolerance Working Group, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ACMSE '12: Proceedings of the 50th annual ACM Southeast Conference

March 2012

424 pages

ISBN:9781450312035

DOI:10.1145/2184512

Conference Chair:
Randy K. Smith
University of Alabama
,
Program Chair:
Susan V. Vrbsky
University of Alabama

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ACM SE '12

Sponsor:

ACM

ACM SE '12: ACM Southeast Regional Conference

March 29 - 31, 2012

Alabama, Tuscaloosa

Acceptance Rates

ACMSE '12 Paper Acceptance Rate 28 of 56 submissions, 50%;

Overall Acceptance Rate 502 of 1,023 submissions, 49%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
305
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Iyer RKalbarczyk ZNakka N(2024)Reliable Networked and Distributed SystemsDependable Computing10.1002/9781119743453.ch8(337-411)Online publication date: 26-Apr-2024
https://doi.org/10.1002/9781119743453.ch8
Zeng JJeong JJung C(2023)Persistent Processor ArchitectureProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623772(1075-1091)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3623772
Li LZnati TMelhem R(2021)Differential Shadowing: A Resilience Framework for Extreme-scale, Heterogeneous Environments with Non-Uniform Node Failure Distribution2021 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC51483.2021.9679435(1-8)Online publication date: 29-Oct-2021
https://doi.org/10.1109/IPCCC51483.2021.9679435
Cascajo ASingh DCarretero J(2019)Performance-Aware Scheduling of Parallel Applications on Non-Dedicated ClustersElectronics10.3390/electronics80909828:9(982)Online publication date: 2-Sep-2019
https://doi.org/10.3390/electronics8090982
Netti AMüller MAuweter AGuillen COtt MTafani DSchulz MTaufer MBalaji PPeña A(2019)From facility to application sensor dataProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356191(1-27)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356191
Alazzawe AKant K(2019)Mimic: Fast Recovery from Data Corruption Errors in Stencil Computations2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC)10.1109/IPCCC47392.2019.8958749(1-8)Online publication date: Oct-2019
https://doi.org/10.1109/IPCCC47392.2019.8958749
Iuhasz GPetcu D(2019)Monitoring of Exascale data processing2019 IEEE International Conference on Advanced Scientific Computing (ICASC)10.1109/ICASC48083.2019.8946279(1-5)Online publication date: Sep-2019
https://doi.org/10.1109/ICASC48083.2019.8946279
Frank AYang DBrinkmann ASchulz MSuss T(2019)Reducing False Node Failure Predictions in HPC2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC.2019.00047(323-332)Online publication date: Dec-2019
https://doi.org/10.1109/HiPC.2019.00047
Babaoglu OSirbu A(2018)Cognified Distributed Computing2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS.2018.00118(1180-1191)Online publication date: Jul-2018
https://doi.org/10.1109/ICDCS.2018.00118
Gupta SPatel TEngelmann CTiwari DMohr BRaghavan P(2017)Failures in large scale systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126937(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126937
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents