[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2802658.2802665acmotherconferencesArticle/Chapter ViewAbstractPublication PageseurompiConference Proceedingsconference-collections
research-article
Public Access

Detecting Silent Data Corruption for Extreme-Scale MPI Applications

Published: 21 September 2015 Publication History

Abstract

Next-generation supercomputers are expected to have more components and, at the same time, consume several times less energy per operation. These trends are pushing supercomputer construction to the limits of miniaturization and energy-saving strategies. Consequently, the number of soft errors is expected to increase dramatically in the coming years. While mechanisms are in place to correct or at least detect some soft errors, a significant percentage of those errors pass unnoticed by the hardware. Such silent errors are extremely damaging because they can make applications silently produce wrong results. In this work we propose a technique that leverages certain properties of high-performance computing applications in order to detect silent errors at the application level. Our technique detects corruption based solely on the behavior of the application datasets and is application-agnostic. We propose multiple corruption detectors, and we couple them to work together in a fashion transparent to the user. We demonstrate that this strategy can detect over 80% of corruptions, while incurring less than 1% of overhead. We show that the false positive rate is less than 1% and that when multi-bit corruptions are taken into account, the detection recall increases to over 95%.

References

[1]
L. Bautista Gomez and F. Cappello. Detecting silent data corruption through data dynamic monitoring for scientific applications. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '14, pages 381--382, New York, NY, USA, 2014.
[2]
L. A. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. FTI: High performance fault tolerance interface for hybrid systems. In SC, page 32. ACM, 2011.
[3]
A. R. Benson, S. Schmit, and R. Schreiber. Silent error detection in numerical time-stepping schemes. International Journal of High Performance Computing Applications, page 1094342014532297, 2014.
[4]
E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan, and F. Cappello. Lightweight silent data corruption detection based on runtime data analysis for hpc applications (to appear). In The ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), HPDC '15. ACM, 2015.
[5]
S. Borkar. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro, 25:10--16, November 2005.
[6]
G. Bosilca, A. Bouteiller, T. Herault, Y. Robert, and J. Dongarra. Assessing the impact of abft and checkpoint composite strategies. In 2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW), pages 679--688. IEEE, 2014.
[7]
G. Bosilca, A. Bouteiller, T. Herault, Y. Robert, and J. Dongarra. Composing resilience techniques: Abft, periodic and incremental checkpointing. International Journal of Networking and Computing, 5(1):2--25, 2015.
[8]
C. Braun, S. Halder, and H. J. Wunderlich. A-abft: Autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 443--454. IEEE, 2014.
[9]
A. Cataldo. Mosys, iroc target ic error protection, 2002.
[10]
Z. Chen. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, pages 167--176. ACM, 2013.
[11]
T. J. Dell. A white paper on the benefits of Chipkill-correct ECC for PC server main memory. IBM Microelectronics Division, pages 1--23, 1997.
[12]
D. Fiala, F. Mueller, C. Engelmann, R. Riesen, K. Ferreira, and R. Brightwell. Detection and correction of silent data corruption for large-scale high-performance computing. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 78. IEEE Computer Society Press, 2012.
[13]
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 100(6):518--528, 1984.
[14]
IEEE. IEEE Standard for Floating-Point Arithmetic. http://standards.ieee.org/findstds/standard/754-2008.html.
[15]
D. Li, J. S. Vetter, and W. Yu. Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, page 57. IEEE Computer Society Press, 2012.
[16]
Y. Li, S. Makar, and S. Mitra. Casp: concurrent autonomous chip self-test using stored test patterns. In Proceedings of the Conference on Design, Automation and Test in Europe, pages 885--890. ACM, 2008.
[17]
S. Mitra, K. Brelsford, Y. M. Kim, H.-H. Lee, and Y. Li. Robust system design to overcome cmos reliability challenges. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 1(1):30--41, 2011.
[18]
S. S. Mukherjee, J. Emer, and S. K. Reinhardt. The soft error problem: An architectural perspective. In 11th International Symposium on High-Performance Computer Architecture., pages 243--247. IEEE, 2005.
[19]
E. Normand. Single event upset at ground level. IEEE Transactions on Nuclear Science, 43(6):2742--2750, 1996.
[20]
T. Semiconductor. Soft errors in electronic memory - a white paper, 2004.
[21]
O. Subasi, J. Arias, J. Labarta, O. Unsal, and A. Cristal. Leveraging a task-based asynchronous dataflow substrate for efficient and scalable resiliency. In Workshop on Dependable Multicore and Transactional Memory Systems. DMTM, 2014.

Cited By

View all
  • (2024)A Visual Comparison of Silent Error PropagationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.323063630:7(3268-3282)Online publication date: Jul-2024
  • (2023)Recovering Detectable Uncorrectable Errors via Spatial Data PredictionProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624120(507-515)Online publication date: 12-Nov-2023
  • (2021)Resilience and fault tolerance in high-performance computing for numerical weather and climate predictionThe International Journal of High Performance Computing Applications10.1177/1094342021990433(109434202199043)Online publication date: 8-Feb-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
EuroMPI '15: Proceedings of the 22nd European MPI Users' Group Meeting
September 2015
149 pages
ISBN:9781450337953
DOI:10.1145/2802658
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

  • Conseil Régional d'Aquitaine
  • Communauté Urbaine de Bordeaux
  • INRIA: INRIA Rhône-Alpes

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 September 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. High-performance computing
  2. anomaly detection
  3. fault tolerance
  4. silent data corruption
  5. soft errors
  6. supercomputers

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

EuroMPI '15
EuroMPI '15: The 22nd European MPI Users' Group Meeting
September 21 - 23, 2015
Bordeaux, France

Acceptance Rates

EuroMPI '15 Paper Acceptance Rate 14 of 29 submissions, 48%;
Overall Acceptance Rate 66 of 139 submissions, 47%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)6
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Visual Comparison of Silent Error PropagationIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.323063630:7(3268-3282)Online publication date: Jul-2024
  • (2023)Recovering Detectable Uncorrectable Errors via Spatial Data PredictionProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624120(507-515)Online publication date: 12-Nov-2023
  • (2021)Resilience and fault tolerance in high-performance computing for numerical weather and climate predictionThe International Journal of High Performance Computing Applications10.1177/1094342021990433(109434202199043)Online publication date: 8-Feb-2021
  • (2021)Efficient detection of silent data corruption in HPC applications with synchronization-free message verificationThe Journal of Supercomputing10.1007/s11227-021-03892-4Online publication date: 9-Jun-2021
  • (2021)Sensitivity of computational fluid dynamics simulations against soft errorsComputing10.1007/s00607-021-00976-0Online publication date: 13-Jul-2021
  • (2019)Addressing data resiliency for staging based scientific workflowsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356158(1-22)Online publication date: 17-Nov-2019
  • (2019)FaultSight: A Fault Analysis Tool for HPC Researchers2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS49593.2019.00008(21-30)Online publication date: Nov-2019
  • (2018)Scalable Algorithmic Detection of Silent Data Corruption for High-Dimensional PDEsSparse Grids and Applications - Miami 201610.1007/978-3-319-75426-0_5(93-115)Online publication date: 21-Jun-2018
  • (2018)EXAHD: An Exa-Scalable Two-Level Sparse Grid Approach for Higher-Dimensional Problems in Plasma Physics and BeyondHigh Performance Computing in Science and Engineering ' 1710.1007/978-3-319-68394-2_31(513-529)Online publication date: 17-Feb-2018
  • (2017)Towards a More Complete Understanding of SDC PropagationProceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3078597.3078617(131-142)Online publication date: 26-Jun-2017

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media