Abstract
Trends in high performance computing are bringing increased heterogeneity among the computational resources within a single machine. The heterogeneous CPU/GPU platforms, however, exacerbate resilience problems faced by current large-scale systems. How to design efficient resilience strategies is critical for the wider adoption of heterogeneous platforms for future exascale systems. The conventional resilience strategy for GPU brings significant performance and power overhead, because they employ a one-size-fits-all approach to enforce uniform data protection. In addition, the isolation between CPU and GPU protection loses potential optimization opportunities provided by the heterogeneous CPU/GPU platforms. In this paper, we explore the viability of using an application-driven CPU/GPU cooperative method to detect faults occurred on GPU global memory. By selectively protecting application-critical data and leveraging time and space redundancy in CPU to detect faults, we bring only 2.2% performance overhead while capturing more than 90% errors that cause incorrect application results.
Chapter PDF
Similar content being viewed by others
References
Li, D., Vetter, S.J., Yu, W.: Classifying Soft Error Vulnerabilities in Extreme-Scale Scientific Applications Using a Binary Instrumentation Tool. In: SC 2012 (2012)
Honarkhah, M., Caers, J.: Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling. Mathematical Geosciences (2010)
Iyer, R., Nakka, N., Kalbarczyk, Z., Myra, S.: Recent Advances and New Avenues in Hardware-Level Reliability Support. IEEE Micro (2005)
Haque, I.S., Pande, V.S.: Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU. In: CCGRID 2010 (2010)
Brown, W.M.: GPU Acceleration in LAMMPS. In: LAMMPS User’s Workshop and Symposium (2011)
Jeon, H., Annavaram, M.: Warped-DMR: Light-Weight Error Detection for GPGPU. IEEE Micro (2012)
Tan, J., Fu, X.: RISE: Improving the Streaming Processors Reliability Against Soft Errors in GPGPUs. In: PACT 2012 (2012)
Sheaffer, J., Luebke, D., Skadron, K.: A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors. In: Proceedings of Graphic Hardware (2007)
ECC and Keeneland GPUs, http://keeneland.gatech.edu/
Yim, K.S., Pham, C., Saleheen, M., Kalbarczyk, Z., Iyer, R.K.: Hauberk: Lightweight Silient Data Corruption Error Detector for GPGPU. In: IPDPS 2011 (2011)
Dimitrov, M., Mantor, M., Zhou, H.: Understanding Software Approaches for GPGPU Reliability. In: GPGPU 2009 (2009)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.-H., Skadron, K.: Rodinia: A Benchmark Suite for Heterogeneous Computing. In: IISWC 2009 (2009)
Wu, P., Ding, C., Chen, L., Gao, F., Davies, T., Karlsslon, C., Chen, Z.: Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line. In: Workshop on Latest Advances in Scalable Algorithm for Large-Scale Systems (2011)
Huang, K.-H., Abraham, J.A.: Algorithm-Based Fault Tolerance for Matrix Operations. IEEE Transactions on Computers C-33(6), 518–528 (1984)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, D., Lee, S., Vetter, J.S. (2014). Evaluating the Viability of Application-Driven Cooperative CPU/GPU Fault Detection. In: an Mey, D., et al. Euro-Par 2013: Parallel Processing Workshops. Euro-Par 2013. Lecture Notes in Computer Science, vol 8374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54420-0_65
Download citation
DOI: https://doi.org/10.1007/978-3-642-54420-0_65
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54419-4
Online ISBN: 978-3-642-54420-0
eBook Packages: Computer ScienceComputer Science (R0)