[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/DSN.2015.52guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems

Published: 22 June 2015 Publication History

Abstract

As we approach exascale, the scientific simulations are expected to experience more interruptions due to increased system failures. Designing better HPC resilience techniques requires understanding the key characteristics of system failures on these systems. While temporal properties of system failures on HPC systems have been well-investigated, there is limited understanding about the spatial characteristics of system failures and its impact on the resilience mechanisms. Therefore, we examine the spatial characteristics and behavior of system failures. We investigate the interaction between spatial and temporal characteristics of failures and its implications for system operations and resilience mechanisms on large-scale HPC systems. We show that system failures have "spatial locality" at different granularity in the system, study impact of different failure-types, and investigate the correlation among different failure-types. Finally, we propose a novel scheme that exploits the spatial locality in failures to improve application and system performance. Our evaluation shows that the proposed scheme significantly improves the system performance in a dynamic and production-level HPC system.

Cited By

View all
  • (2023)Predicting GPU Failures With High Precision Under Deep Learning WorkloadsProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594777(124-135)Online publication date: 5-Jun-2023
  • (2022)ClairvoyantProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532374(1-14)Online publication date: 28-Jun-2022
  • (2022)Understanding Memory Failures on a Petascale Arm SystemProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531465(84-96)Online publication date: 27-Jun-2022
  • Show More Cited By
  1. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    DSN '15: Proceedings of the 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks
    June 2015
    573 pages
    ISBN:9781479986293

    Publisher

    IEEE Computer Society

    United States

    Publication History

    Published: 22 June 2015

    Author Tags

    1. Fault tolerance
    2. High Performance Computing
    3. Resilience
    4. Spatial Locality
    5. System Failures

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Predicting GPU Failures With High Precision Under Deep Learning WorkloadsProceedings of the 16th ACM International Conference on Systems and Storage10.1145/3579370.3594777(124-135)Online publication date: 5-Jun-2023
    • (2022)ClairvoyantProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532374(1-14)Online publication date: 28-Jun-2022
    • (2022)Understanding Memory Failures on a Petascale Arm SystemProceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing10.1145/3502181.3531465(84-96)Online publication date: 27-Jun-2022
    • (2020)Job characteristics on large-scale systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433812(1-17)Online publication date: 9-Nov-2020
    • (2020)GPU lifetimes on titan supercomputerProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.5555/3433701.3433755(1-14)Online publication date: 9-Nov-2020
    • (2020)CoRECACM Transactions on Parallel Computing10.1145/33914487:2(1-29)Online publication date: 18-May-2020
    • (2020)A Methodology for Comparing the Reliability of GPU-Based and CPU-Based HPCsACM Computing Surveys10.1145/337279053:1(1-33)Online publication date: 6-Feb-2020
    • (2019)FailAmpACM Transactions on Architecture and Code Optimization10.1145/336938116:4(1-21)Online publication date: 18-Dec-2019
    • (2019)Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.286418430:2(361-374)Online publication date: 1-Feb-2019
    • (2018)Partial redundancy in HPC systems with non-uniform node reliabilitiesProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291715(1-11)Online publication date: 11-Nov-2018
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media