[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

FailAmp: Relativization Transformation for Soft Error Detection in Structured Address Generation

Published: 18 December 2019 Publication History

Abstract

We present FailAmp, a novel LLVM program transformation algorithm that makes programs employing structured index calculations more robust against soft errors. Without FailAmp, an offset error can go undetected; with FailAmp, all subsequent offsets are relativized, building on the faulty one. FailAmp can exploit ISAs such as ARM to further reduce overheads. We verify correctness properties of FailAMP using an SMT solver, and present a thorough evaluation using many high-performance computing benchmarks under a fault injection campaign. FailAmp provides full soft-error detection for address calculation while incurring an average overhead of around 5%.

References

[1]
Rizwan Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F. DeMara, Chen-Yong Cher, and Pradip Bose. 2015. Understanding the propagation of transient errors in HPC applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15), Austin, TX, November 15--20, 2015, Jackie Kern and Jeffrey S. Vetter (Eds.). ACM, 72:1--72:12.
[2]
Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In International Conference on Compiler Construction (ETAPS CC’08). 132--146.
[3]
Franck Cappello, Geist Al, William Gropp, Sanjay Kale, Bill Kramer, and Marc Snir. 2014. Toward exascale resilience: 2014 update. Supercomputing Frontiers and Innovations 1, 1, 5--28.
[4]
Chun-Kai Chang, Sangkug Lym, Nicholas Kelly, Michael B. Sullivan, and Mattan Erez. 2018. Evaluating and accelerating high-fidelity error injection for HPC. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’18), Dallas, TX, November 11--16, 2018. IEEE / ACM, 45:1--45:13. http://dl.acm.org/citation.cfm?id=3291716.
[5]
S. Chen, L. Peng, and G. Bronevetsky. 2015. A Framework For Evaluating Comprehensive Fault Resilience Mechanisms In Numerical Programs. Technical Report LLNL-SR-666073 2963-2984.
[6]
Hyungmin Cho, Shahrzad Mirkhani, Chen-Yong Cher, Jacob A. Abraham, and Subhasish Mitra. 2013. Quantitative evaluation of soft error injection techniques for robust system design. In Proceedings of the 50th Annual Design Automation Conference (DAC’13). ACM, New York, NY, Article 101, 10 pages.
[7]
Daniel A. G. de Oliveira, Laércio Lima Pilla, Nathan DeBardeleben, Sean Blanchard, Heather Quinn, Israel Koren, Philippe O. A. Navaux, and Paolo Rech. 2017. Experimental and analytical study of Xeon Phi reliability. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17), Denver, CO, November 12--17, 2017, Bernd Mohr and Padma Raghavan (Eds.). ACM, 28:1--28:12.
[8]
Sheng Di and Franck Cappello. 2016. Adaptive impact-driven detection of silent data corruption for HPC applications. IEEE Transactions on Parallel Distributed Systems 27, 10, 2809--2823.
[9]
Milinda Fernando, David Neilsen, Hyun Lim, Eric Hirschmann, and Hari Sundar. 2019. Massively parallel simulations of binary black hole intermediate-mass-ratio inspirals. SIAM Journal on Scientific Computing 41, 2, C97-C138. Also https://arxiv.org/abs/1807.06128.
[10]
Ganesh Gopalakrishnan, Paul D. Hovland, Costin Iancu, Sriram Krishnamoorthy, Ignacio Laguna, Richard A. Lethin, Koushik Sen, Stephen F. Siegel, and Armando Solar-Lezama. 2017. Report of the HPC correctness summit, January 25-26, 2017, Washington, DC. CoRR abs/1705.07478 (2017). arxiv:1705.07478 http://arxiv.org/abs/1705.07478
[11]
P. Gopi, G. Singh, and G. Favor. 2012. X-Gene™: 64-bit ARM CPU and SoC. In IEEE Hot Chips 24 Symposium (HCS’12). 1--19.
[12]
Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. 2017. Failures in large scale systems: Long-term measurement, analysis, and implications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). ACM, New York, NY, Article 44, 12 pages.
[13]
Saurabh Gupta, Devesh Tiwari, Christopher Jantzi, James Rogers, and Don Maxwell. 2015. Understanding and exploiting spatial properties of system failures on extreme-scale HPC systems. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’15). IEEE Computer Society, Washington, DC, 37--44.
[14]
Zaeem Hussain, Taieb Znati, and Rami Melhem. 2018. Partial redundancy in HPC systems with non-uniform node reliabilities. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’18). IEEE Press, Piscataway, NJ, Article 44, 11 pages.
[15]
Intel. 2016. Intel 64 and IA-32 architectures optimization reference manual. Order Number: 248966-033 25 21-22.
[16]
Ian Karlin, Jeff Keasler, and Rob Neely. 2013. LULESH 2.0 Updates and Changes. Technical Report LLNL-TR-641973. https://computation.llnl.gov/projects/co-design/lulesh.
[17]
Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2016. Compiler-directed lightweight checkpointing for fine-grained guaranteed soft error recovery. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE Press, Piscataway, NJ, Article 20, 12 pages. http://dl.acm.org/citation.cfm?id=3014904.3014931
[18]
LLVM. [n.d.]. LLVM Language Reference Manual. Retrieved November 7, 2019 from http://llvm.org/docs/LangRef.html#getelementptr-instruction.
[19]
Shubhendu S. Mukherjee, Christopher Weaver, Joel Emer, Steven K. Reinhardt, and Todd Austin. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’03). IEEE, 29--40.
[20]
B. Nie, D. Tiwari, S. Gupta, E. Smirni, and J. H. Rogers. 2016. A large-scale study of soft-errors on GPUs in the field. In IEEE International Symposium on High Performance Computer Architecture (HPCA’16). 519--530.
[21]
Vojin G. Oklobdzija. 2001. The Computer Engineering Handbook: Electrical Engineering Handbook. CRC Press, Inc., Boca Raton, FL.
[22]
Sunghyun Park, Shikai Li, and Scott A. Mahlke. 2018. Low cost transient fault protection using loop output prediction. In 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN Workshops’18), Luxembourg, June 25--28, 2018. 109--113.
[23]
PolyBench. [n.d.]. PolyBench/C: The Polyhedral Benchmark suite. Retrieved November 7, 2019 from http://web.cse.ohio-state.edu/∼pouchet.2/software/polybench/.
[24]
Zvonimir Rakamarić and Michael Emmi. 2014. SMACK: Decoupling source language details from verifier implementations. In Proceedings of the 26th International Conference on Computer Aided Verification (CAV’14), Lecture Notes in Computer Science, Vol. 8559. Springer, Berlin, 106--113.
[25]
Jude A. Rivers, Meeta S. Gupta, Jeonghee Shin, Prabhakar N. Kudva, and Pradip Bose. 2011. Error tolerance in server class processors. in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (CADICS) 30, 7, 945--959.
[26]
Pia N. Sanda, Jeffrey W. Kellington, Prabhakar Kudva, Ronald N. Kalla, Ryan B. McBeth, Jerry Ackaret, Ryan Lockwood, John Schumann, and Christopher R. Jones. 2008. Soft-error resilience of the IBM POWER6 processor. IBM Journal of Research and Development 52, 3, 275--284.
[27]
Norbert Seifert, Vinod Ambrose, B. Gill, Q. Shi, R. Allmon, C. Recchia, S. Mukherjee, N. Nassif, J. Krause, J. Pickholtz, et al. 2010. On the radiation-induced soft error performance of hardened sequential elements in advanced bulk CMOS technologies. In IEEE International Reliability Physics Symposium (IRPS’10). IEEE, 188--197.
[28]
Vishal Chandra Sharma, Ganesh Gopalakrishnan, and Sriram Krishnamoorthy. 2016. PRESAGE: Protecting structured address generation against soft errors. In 23rd IEEE International Conference on High Performance Computing (HiPC’16), Hyderabad, India, December 19--22, 2016. IEEE, 252--261.
[29]
Vishal Chandra Sharma, Ganesh Gopalakrishnan, and Sriram Krishnamoorthy. 2016. Towards resiliency evaluation of vector programs. In IEEE International Parallel and Distributed Processing Symposium Workshops, (IPDPS Workshops’16), Chicago, IL, May 23--27, 2016. IEEE Computer Society, 1319--1328.
[30]
Vishal C. Sharma, Ganesh Gopalkrishnan, and Greg Bronevetsky. 2015. Detecting soft errors in stencil based computations. In Workshop on Silicon Errors in Logic —System Effects (SELSE’15). Austin, TX. Retrieved November 7, 2019 from http://formalverification.cs.utah.edu/fmr/.
[31]
Joseph Sloan, Rakesh Kumar, and Greg Bronevetsky. 2013. An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance. In 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13), Budapest, Hungary, June 24--27, 2013. 1--12.
[32]
Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, et al. 2014. Addressing failures in exascale computing. International Journal of High Performance Computing Applications 28, 2, 129--173.
[33]
Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman S. Ünsal, Jesús Labarta, Adrián Cristal, and Franck Cappello. 2016. Spatial support vector regression to detect silent errors in the exascale era. In IEEE/ACM 16th International Symposium on Cluster, Cloud and Grid Computing (CCGrid’16), Cartagena, Colombia, May 16--19, 2016. 413--424.
[34]
Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z. Zhang, Darren Kerbyson, and Zizhong Chen. 2016. New-Sum: A novel online ABFT scheme for general iterative methods. In International Symposium on High-Performance Parallel and Distributed Computing (HPDC’16). 43--55.
[35]
Sanket Tavarageri, Sriram Krishnamoorthy, and P. Sadayappan. 2014. Compiler-assisted detection of transient memory errors. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’14), Edinburgh, UK, June 09--11, 2014. 204--215.
[36]
D. Tiwari, S. Gupta, and S. S. Vazhkudai. 2014. Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems. In 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, 25--36.
[37]
Augusto Vega, Pradip Bose, and Alper Buyuktosunoglu. 2016. Rugged Embedded Systems: Computing in Harsh Environments. Morgan Kaufmann Publishers Inc., San Francisco, CA.
[38]
Nicholas J. Wang and Sanjay J. Patel. 2006. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure. Computing 3, 3, 188--201.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 4
December 2019
572 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3366460
Issue’s Table of Contents
© 2019 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 December 2019
Accepted: 01 October 2019
Revised: 01 September 2019
Received: 01 June 2019
Published in TACO Volume 16, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. LLVM transformation
  2. Soft error detection
  3. failure amplification
  4. structured address generation

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • DOE
  • U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research
  • NSF

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 574
    Total Downloads
  • Downloads (Last 12 months)71
  • Downloads (Last 6 weeks)9
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media