[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2184512.2184574acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
research-article

Application monitoring and checkpointing in HPC: looking towards exascale systems

Published: 29 March 2012 Publication History

Abstract

As computational cluster computers rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved compute performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing.
We demonstrate the impact of sub-optimal checkpoint intervals on application efficiency via simulation with real workload data. We find that application efficiency is relatively insensitive to error in estimation of an application's mean time to interrupt (AMTTI), a parameter central to calculating the optimal checkpoint interval. This result corroborates the trends predicted by previous analytical models. We also find that erring on the side of overestimation may be preferable to underestimation.
We further discuss how application monitoring and resilience frameworks can benefit from this insensitivity to error in AMTTI estimates. Finally, we discuss the importance of application monitoring at exascale and conclude with a discussion of challenges faced in the use of checkpointing at such extreme scales.

References

[1]
R. Ballance and N. DeBardeleben. The Mojo Application Monitoring Tool Suite. In 11th LCI International Conference on High-Performance Clustered Computing, March 2010.
[2]
J. T. Daly. A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps. Future Generation Computer Systems, 22:300--312, 2006.
[3]
J. T. Daly. Methodology and metrics for quantifying application throughput. In Proceedings of the Nuclear Explosives Code Developers Conference, 2006.
[4]
J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak. Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale. In Workshop on Resilience held at the IEEE Intl. Conf. on Cluster Computing and the Grid, May 2008.
[5]
X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi. Hybrid checkpointing using emerging nonvolatile memories for future exascale systems. ACM Transactions on Architecture and Code Optimization, 8:6:1--6:29, June 2011.
[6]
J. Dongarra and P. Beckman. International Exascale Software Project Roadmap. International Journal of High Performance Computer Applications, 25(1), 2011.
[7]
A. Geist and R. Lucas. Major computer science challenges at exascale. In Exascale.org, Feb. 2009.
[8]
G. Grider. ExaScale FSIO: Can we get there? Can we afford to? In HEC FSIO R&D Workshop, July 2010.
[9]
E. Hendriks. Bproc: the beowulf distributed process space. In Proc. of the 16th Intl. Conf. on Supercomputing, pages 129--136. ACM, 2002.
[10]
BeoSim Website. http://www.parl.clemson.edu/beosim.
[11]
W. M. Jones. Network-aware selective job checkpoint and migration to enhance co-allocation in multi-cluster systems. In Journal of Concurrency and Computation: Practice and Experience, volume 21, pages 1672--1691. John Wiley and Sons, Ltd., September 2009.
[12]
W. M. Jones, J. T. Daly, and N. A. DeBardeleben. Application resilience: Making progress in spite of failure. In The Workshop on Resilience held in conjunction with the IEEE Intl. Conf. on Cluster Computing and the Grid, pages 789--794, May 2008.
[13]
W. M. Jones, J. T. Daly, and N. A. DeBardeleben. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 276--279, 2010.
[14]
W. M. Jones, L. W. Pang, D. Stanzione, and W. B. Ligon III. Characterization of bandwidth-aware meta-schedulers for co-allocating jobs across multiple clusters. In Journal of Supercomputing, Special Issue on the Evaluation of Grid and Cluster Computing Systems, volume 34, pages 135--163. Springer Science and Business Media B. V, November 2005.
[15]
ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. DARPA, 2008.
[16]
A. Moody and G. Bronevetsky. Scalable I/O Systems via Node-Local Storage: Approaching 1 TB/sec File I/O. In Lawrence Livermore National Laboratory: Technical Report #415791, 2009.
[17]
A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In Proc. of the ACM/IEEE Intl. Conf. for High Perf. Comp., Networking, Storage and Analysis, pages 1--11, 2010.
[18]
R. A. Ballance et al. Application Monitoring. Cray User Group Meeting, May 2008.
[19]
B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In International Conference on Dependable Systems and Networks, pages 249--258, 2006.
[20]
B. Schroeder and G. Gibson. Understanding failures in petascale computers. In J. of Physics, 2007.
[21]
N. D. Singpurwalla and A. G. Wilson. Probability, chance and the probability of chance. In IIE Transactions, volume 41, pages 12--22, Jan 2009.
[22]
Vivek Sarkar et al. ExaScale Computing Software Study: Software Challenges in Extreme Scale Systems. DARPA, September 2009.
[23]
Ubiquitous High Perf. Comp. (UHPC) Request for Information (RFI). DARPA-SN-09-46, 2009.
[24]
J. W. Young. A first-order approximation to the optimum checkpoint interval. In Communications of the ACM, pages 530--531, September 1974.

Cited By

View all
  • (2024)Reliable Networked and Distributed SystemsDependable Computing10.1002/9781119743453.ch8(337-411)Online publication date: 26-Apr-2024
  • (2023)Persistent Processor ArchitectureProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623772(1075-1091)Online publication date: 28-Oct-2023
  • (2021)Differential Shadowing: A Resilience Framework for Extreme-scale, Heterogeneous Environments with Non-Uniform Node Failure Distribution2021 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC51483.2021.9679435(1-8)Online publication date: 29-Oct-2021
  • Show More Cited By

Index Terms

  1. Application monitoring and checkpointing in HPC: looking towards exascale systems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ACMSE '12: Proceedings of the 50th annual ACM Southeast Conference
    March 2012
    424 pages
    ISBN:9781450312035
    DOI:10.1145/2184512
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 March 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. checkpointing
    2. exascale
    3. prediction
    4. resilience
    5. simulation

    Qualifiers

    • Research-article

    Conference

    ACM SE '12
    Sponsor:
    ACM SE '12: ACM Southeast Regional Conference
    March 29 - 31, 2012
    Alabama, Tuscaloosa

    Acceptance Rates

    ACMSE '12 Paper Acceptance Rate 28 of 56 submissions, 50%;
    Overall Acceptance Rate 502 of 1,023 submissions, 49%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Reliable Networked and Distributed SystemsDependable Computing10.1002/9781119743453.ch8(337-411)Online publication date: 26-Apr-2024
    • (2023)Persistent Processor ArchitectureProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623772(1075-1091)Online publication date: 28-Oct-2023
    • (2021)Differential Shadowing: A Resilience Framework for Extreme-scale, Heterogeneous Environments with Non-Uniform Node Failure Distribution2021 IEEE International Performance, Computing, and Communications Conference (IPCCC)10.1109/IPCCC51483.2021.9679435(1-8)Online publication date: 29-Oct-2021
    • (2019)Performance-Aware Scheduling of Parallel Applications on Non-Dedicated ClustersElectronics10.3390/electronics80909828:9(982)Online publication date: 2-Sep-2019
    • (2019)From facility to application sensor dataProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356191(1-27)Online publication date: 17-Nov-2019
    • (2019)Mimic: Fast Recovery from Data Corruption Errors in Stencil Computations2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC)10.1109/IPCCC47392.2019.8958749(1-8)Online publication date: Oct-2019
    • (2019)Monitoring of Exascale data processing2019 IEEE International Conference on Advanced Scientific Computing (ICASC)10.1109/ICASC48083.2019.8946279(1-5)Online publication date: Sep-2019
    • (2019)Reducing False Node Failure Predictions in HPC2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC.2019.00047(323-332)Online publication date: Dec-2019
    • (2018)Cognified Distributed Computing2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS.2018.00118(1180-1191)Online publication date: Jul-2018
    • (2017)Failures in large scale systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126937(1-12)Online publication date: 12-Nov-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media