[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3581784.3607078acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Open access

Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction Duplication

Published: 11 November 2023 Publication History

Abstract

Soft errors are prevalent in modern High-Performance Computing (HPC) systems, resulting in silent data corruptions (SDCs), compromising system reliability. Instruction duplication is a widely used software-based protection technique against SDCs. Existing instruction duplication techniques are mostly implemented at LLVM level and may suffer from low SDC coverage at assembly level. In this paper, we evaluate instruction duplication at both LLVM and assembly levels. Our study shows that existing instruction duplication techniques have protection deficiency at assembly level and are usually over-optimistic in the protection. We investigate the root-causes of the protection deficiency and propose a mitigation technique, Flowery, to solve the problem. Our evaluation shows that Flowery can effectively protect programs from SDCs evaluated at assembly level.

Supplemental Material

MP4 File - SC23 paper presentation recording for "Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction Duplication"
SC23 paper presentation recording for "Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction Duplication", by Zhengyang He, Yafan Huang, Hui Xu, Dingwen Tao and Guanpeng Li

References

[1]
Rizwan A. Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F. DeMara, Chen-Yong Cher, and Pradip Bose. 2015. Understanding the propagation of transient errors in HPC applications. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--12.
[2]
Chun-Kai Chang, Guanpeng Li, and Mattan Erez. 2019. Evaluating Compiler IR-Level Selective Instruction Duplication with Realistic Hardware Errors. In 2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS). 41--49.
[3]
Mojtaba Ebrahimi and Mehdi B. Tahoori. 2016. Invited - Cross-Layer Approaches for Soft Error Modeling and Mitigation (DAC '16). Association for Computing Machinery, New York, NY, USA, Article 32, 6 pages.
[4]
Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic Soft Error Reliability on the Cheap. In Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (Pittsburgh, Pennsylvania, USA) (ASP-LOS XV). Association for Computing Machinery, New York, NY, USA, 385--396.
[5]
Giorgis Georgakoudis, Ignacio Laguna, Dimitrios S. Nikolopoulos, and Martin Schulz. 2017. REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed. In SC17: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.
[6]
R. W. Hamming. 1950. Error detecting and error correcting codes. The Bell System Technical Journal 29, 2 (1950), 147--160.
[7]
Siva Kumar Sastry Hari, Sarita V. Adve, and Helia Naeimi. 2012. Low-cost program-level detectors for reducing silent data corruptions. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012). 1--12.
[8]
Jie Hu, Feihui Li, Vijay Degalahal, Mahmut Kandemir, Vijaykrishnan Narayanan, and Mary Irwin. 2005. Compiler-Directed Instruction Duplication for Soft Error Detection. 1056--1057.
[9]
Yafan Huang, Shengjian Guo, Sheng Di, Guanpeng Li, and Franck Cappello. 2022. Hardening selective protection across multiple program inputs for HPC applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 437--438.
[10]
Yafan Huang, Shengjian Guo, Sheng Di, Guanpeng Li, and Franck Cappello. 2022. Mitigating Silent Data Corruptions in HPC Applications across Multiple Program Inputs. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.
[11]
Charu Kalra, Fritz Previlon, Norm Rubin, and David Kaeli. 2020. ArmorAll: Compiler-based resilience targeting GPU applications. ACM Transactions on Architecture and Code Optimization (TACO) 17, 2 (2020), 1--24.
[12]
Ignacio Laguna, Martin Schulz, David F Richards, Jon Calhoun, and Luke Olson. 2016. Ipas: Intelligent protection against silent output corruption in scientific applications. In 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 227--238.
[13]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization. IEEE, 75.
[14]
Guanpeng Li and Karthik Pattabiraman. 2018. Modeling Input-Dependent Error Propagation in Programs. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 279--290.
[15]
Guanpeng Li, Karthik Pattabiraman, Siva Kumar Sastry Hari, Michael Sullivan, and Timothy Tsai. 2018. Modeling Soft-Error Propagation in Programs. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 27--38.
[16]
Zhimin Li, Harshitha Menon, Kathryn Mohror, Peer-Timo Bremer, Yarden Livant, and Valerio Pascucci. 2021. Understanding a program's resiliency through error propagation. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 362--373.
[17]
Qining Lu, Guanpeng Li, Karthik Pattabiraman, Meeta S. Gupta, and Jude A. Rivers. 2017. Configurable Detection of SDC-Causing Errors in Programs. 16, 3, Article 88 (mar 2017), 25 pages.
[18]
Qining Lu, Karthik Pattabiraman, Meeta S Gupta, and Jude A Rivers. 2014. SDC-Tune: a model for predicting the SDC proneness of an application for configurable protection. In Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. 1--10.
[19]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. SIGPLAN Not. 40, 6 (2005), 190--200.
[20]
H. Madeira, D. Costa, and M. Vieira. 2000. On the emulation of software faults by software fault injection. In Proceeding International Conference on Dependable Systems and Networks. DSN 2000. 417--426.
[21]
Abdulrahman Mahmoud, Siva Kumar Sastry Hari, Michael B Sullivan, Timothy Tsai, and Stephen W Keckler. 2018. Optimizing software-directed instruction replication for gpu error detection. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 842--854.
[22]
Shubhendu S Mukherjee, Christopher Weaver, Joel Emer, Steven K Reinhardt, and Todd Austin. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36. IEEE, 29--40.
[23]
Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. 2017. Characterizing temperature, power, and soft-error behaviors in data center systems: Insights, challenges, and opportunities. In 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 22--31.
[24]
Nahmsuk Oh, Philip P Shirvani, and Edward J McCluskey. 2002. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51, 1 (2002), 63--75.
[25]
Md Hasanur Rahman, Aabid Shamji, Shengjian Guo, and Guanpeng Li. 2021. PEPPA-X: Finding Program Test Inputs to Bound Silent Data Corruption Vulnerability in HPC Applications. In SC21: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--14.
[26]
Philip P Shirvani, Namsuk Oh, Edward J Mccluskey, DL Wood, Michael N Lovellette, and KS Wood. 2000. Software-implemented hardware fault tolerance experiments: COTS in space. In International Conference on Dependable Systems and Networks (FTCS-30 and DCCA-8), New York (NY).
[27]
Philip P Shirvani, Nirmal Saxena, Nahmsuk Oh, Subhasish Mitra, Shu-Yi Yu, Wei-Je Huang, Santiago Fernandez-Gomez, Nur A Touba, and Edward J McCluskey. 1999. Fault-Tolerance Projects at Stanford CRC. In MAPLD 1999- Annual Military and Aerospace Applications of Programmable Devices and Technologies Conference, 2 nd, Johns Hopkins Univ, APL, Laurel, MD. Citeseer.
[28]
Premkishore Shivakumar, Michael Kistler, Stephen W Keckler, Doug Burger, and Lorenzo Alvisi. 2002. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings International Conference on Dependable Systems and Networks. IEEE, 389--398.
[29]
D.T. Stott, B. Floering, D. Burke, Z. Kalbarczpk, and R.K. Iyer. 2000. NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors. In Proceedings IEEE International Computer Performance and Dependability Symposium. IPDS 2000. 91--100.
[30]
Devesh Tiwari, Saurabh Gupta, James Rogers, Don Maxwell, Paolo Rech, Sudharshan Vazhkudai, Daniel Oliveira, Dave Londo, Nathan DeBardeleben, Philippe Navaux, Luigi Carro, and Arthur Bland. 2015. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 331--342.
[31]
A. Vallero, A. Savino, G. Politano, S. Di Carlo, A. Chatzidimitriou, S. Tselonis, M. Kaliorakis, D. Gizopoulos, M. Riera, R. Canal, A. Gonzalez, M. Kooli, A. Bosio, and G. Di Natale. 2016. Cross-layer system reliability assessment framework for hardware faults. In 2016 IEEE International Test Conference (ITC). 1--10.
[32]
Jiesheng Wei, Anna Thomas, Guanpeng Li, and Karthik Pattabiraman. 2014. Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 375--382.
[33]
Lishan Yang, Bin Nie, Adwait Jog, and Evgenia Smirni. 2021. Enabling software resilience in gpgpu applications via partial thread protection. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1248--1259.

Cited By

View all
  • (2024)Versatile Datapath Soft Error Detection on the Cheap for HPC ApplicationsSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00061(1-15)Online publication date: 17-Nov-2024
  • (2024)Exploring the Behavior of Soft-Error Rate Reduction Algorithms in Digital Circuits2024 International Conference on Optimization Computing and Wireless Communication (ICOCWC)10.1109/ICOCWC60930.2024.10470803(1-5)Online publication date: 29-Jan-2024
  • (2024)HPC-Crash: Characterizing Crash-Proneness of HPC Programs from Various Perspectives2024 10th IEEE International Conference on High Performance and Smart Computing (HPSC)10.1109/HPSC62738.2024.00023(89-94)Online publication date: 10-May-2024
  • Show More Cited By

Index Terms

  1. Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction Duplication

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2023
    1428 pages
    ISBN:9798400701092
    DOI:10.1145/3581784
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 November 2023

    Check for updates

    Badges

    Author Tags

    1. system reliability
    2. hardware transient faults
    3. instruction duplication
    4. compiler transformation
    5. architecture
    6. fault injection

    Qualifiers

    • Research-article

    Conference

    SC '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)406
    • Downloads (Last 6 weeks)50
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Versatile Datapath Soft Error Detection on the Cheap for HPC ApplicationsSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00061(1-15)Online publication date: 17-Nov-2024
    • (2024)Exploring the Behavior of Soft-Error Rate Reduction Algorithms in Digital Circuits2024 International Conference on Optimization Computing and Wireless Communication (ICOCWC)10.1109/ICOCWC60930.2024.10470803(1-5)Online publication date: 29-Jan-2024
    • (2024)HPC-Crash: Characterizing Crash-Proneness of HPC Programs from Various Perspectives2024 10th IEEE International Conference on High Performance and Smart Computing (HPSC)10.1109/HPSC62738.2024.00023(89-94)Online publication date: 10-May-2024
    • (2024)A Fast Low-Level Error Detection Technique2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00023(90-98)Online publication date: 24-Jun-2024
    • (2024)GPU Reliability Assessment: Insights Across the Abstraction Layers2024 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER59578.2024.00008(1-13)Online publication date: 24-Sep-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media