Evaluating the Viability of Application-Driven Cooperative CPU/GPU Fault Detection

Dong Li²⁷,
Seyong Lee²⁷ &
Jeffrey S. Vetter²⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8374))

Included in the following conference series:

European Conference on Parallel Processing

1882 Accesses

Abstract

Trends in high performance computing are bringing increased heterogeneity among the computational resources within a single machine. The heterogeneous CPU/GPU platforms, however, exacerbate resilience problems faced by current large-scale systems. How to design efficient resilience strategies is critical for the wider adoption of heterogeneous platforms for future exascale systems. The conventional resilience strategy for GPU brings significant performance and power overhead, because they employ a one-size-fits-all approach to enforce uniform data protection. In addition, the isolation between CPU and GPU protection loses potential optimization opportunities provided by the heterogeneous CPU/GPU platforms. In this paper, we explore the viability of using an application-driven CPU/GPU cooperative method to detect faults occurred on GPU global memory. By selectively protecting application-critical data and leveraging time and space redundancy in CPU to detect faults, we bring only 2.2% performance overhead while capturing more than 90% errors that cause incorrect application results.

Download to read the full chapter text

Chapter PDF

A Lightweight Approach to GPU Resilience

Fault-Tolerant MPI

Reinit $$^{++}$$ : Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance

Keywords

References

Li, D., Vetter, S.J., Yu, W.: Classifying Soft Error Vulnerabilities in Extreme-Scale Scientific Applications Using a Binary Instrumentation Tool. In: SC 2012 (2012)
Google Scholar
Honarkhah, M., Caers, J.: Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling. Mathematical Geosciences (2010)
Google Scholar
Iyer, R., Nakka, N., Kalbarczyk, Z., Myra, S.: Recent Advances and New Avenues in Hardware-Level Reliability Support. IEEE Micro (2005)
Google Scholar
Haque, I.S., Pande, V.S.: Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU. In: CCGRID 2010 (2010)
Google Scholar
Brown, W.M.: GPU Acceleration in LAMMPS. In: LAMMPS User’s Workshop and Symposium (2011)
Google Scholar
Jeon, H., Annavaram, M.: Warped-DMR: Light-Weight Error Detection for GPGPU. IEEE Micro (2012)
Google Scholar
Tan, J., Fu, X.: RISE: Improving the Streaming Processors Reliability Against Soft Errors in GPGPUs. In: PACT 2012 (2012)
Google Scholar
Sheaffer, J., Luebke, D., Skadron, K.: A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors. In: Proceedings of Graphic Hardware (2007)
Google Scholar
ECC and Keeneland GPUs, http://keeneland.gatech.edu/
Yim, K.S., Pham, C., Saleheen, M., Kalbarczyk, Z., Iyer, R.K.: Hauberk: Lightweight Silient Data Corruption Error Detector for GPGPU. In: IPDPS 2011 (2011)
Google Scholar
Dimitrov, M., Mantor, M., Zhou, H.: Understanding Software Approaches for GPGPU Reliability. In: GPGPU 2009 (2009)
Google Scholar
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.-H., Skadron, K.: Rodinia: A Benchmark Suite for Heterogeneous Computing. In: IISWC 2009 (2009)
Google Scholar
Wu, P., Ding, C., Chen, L., Gao, F., Davies, T., Karlsslon, C., Chen, Z.: Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line. In: Workshop on Latest Advances in Scalable Algorithm for Large-Scale Systems (2011)
Google Scholar
Huang, K.-H., Abraham, J.A.: Algorithm-Based Fault Tolerance for Matrix Operations. IEEE Transactions on Computers C-33(6), 518–528 (1984)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Oak Ridge National Laboratory, Oak Ridge, TN, USA
Dong Li, Seyong Lee & Jeffrey S. Vetter

Authors

Dong Li
View author publications
You can also search for this author in PubMed Google Scholar
Seyong Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey S. Vetter
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Rechen- und Kommunikationszentrum, RWTH Aachen, Seffenter Weg 23, 52074, Aachen, Germany
Dieter an Mey
TU Vienna, 1040, Vienna, Austria
Michael Alexander
RWTH Aachen University, Seffenter Weg 23, 52074, Aachen, Germany
Paolo Bientinesi & Carsten Clauss &
University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy
Mario Cannataro
Inria Rennes - Bretagne Atlantique, 35042, Rennes, France
Alexandru Costan & Christine Morin &
University of Innsbruck, 6020, Innsbruck, Austria
Gabor Kecskemeti
Department of Computer Science, University of Pisa, 56126, Pisa, Italy
Laura Ricci
Universitat Politècnica de València, 46022, València, Spain
Julio Sahuquillo
LLNL, USA
Martin Schulz
Dipartimento di Informatica, Università di Salerno, 84084, Salerno, Italy
Vittorio Scarano
Tennessee Tech University and Oak Ridge National Laboratory, 38505, Cookeville, TN, USA
Stephen L. Scott
Technische Universität München, 80333, Munich, Germany
Josef Weidendorfer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, D., Lee, S., Vetter, J.S. (2014). Evaluating the Viability of Application-Driven Cooperative CPU/GPU Fault Detection. In: an Mey, D., et al. Euro-Par 2013: Parallel Processing Workshops. Euro-Par 2013. Lecture Notes in Computer Science, vol 8374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54420-0_65

Download citation

DOI: https://doi.org/10.1007/978-3-642-54420-0_65
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54419-4
Online ISBN: 978-3-642-54420-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics