[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3225058.3225119acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Modeling Application Resilience in Large-scale Parallel Execution

Published: 13 August 2018 Publication History

Abstract

Understanding how the application is resilient to hardware and software errors is critical to high-performance computing. To evaluate application resilience, the application level fault injection is the most common method. However, the application level fault injection can be very expensive when running the application in parallel in large scales due to the high requirement for hardware resource during fault injection.
In this paper, we introduce a new methodology to evaluate the resilience of the application running in large scales. Instead of injecting errors into the application in large-scale execution, we inject errors into the application in small-scale execution and serial execution to model and predict the fault injection result for the application running in large scales. Our models are based on a series of empirical observations. Those observations characterize error occurrences and propagation across MPI processes in small-scale execution (including serial execution) and large-scale one. Our models achieve high prediction accuracy. Evaluating with four NAS parallel benchmarks and two proxy scientific applications, we demonstrate that using the fault injection result to predict for 64 MPI processes, the average prediction error is 8%. While using the fault injection result to make the same prediction for eight MPI processes, the average prediction error decreases to 7%.

References

[1]
J. Aidemark, J. Vinter, P. Folkesson, and J. Karlsson. 2001. GOOFI: Generic Object-oriented Fault Injection Tool. In International Conference on Dependable Systems and Networks.
[2]
Rizwan Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F. DeMara, Chen-Yong Cher, and Pradip Bose. 2015. Understanding the Propagation of Transient Errors in HPC Applications. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[3]
D.H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. 1992. NAS Parallel Benchmark Results. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[4]
Eli Bendersky. 2012. PYELFTOOLS. https://github.com/eliben/pyelftools. (2012).
[5]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7.
[6]
Jon Calhoun, Luke Olson, and Marc Snir. 2014. FlipIt: An LLVM Based Fault Injector for HPC. In Revised Selected Papers, Part I, of the Euro-Par 2014 International Workshops on Parallel Processing - Volume 8805.
[7]
Marc Casas, Bronis R. de Supinski, Greg Bronevetsky, and Martin Schulz. 2012. Fault Resilience of the Multi-grid Solver. In International Conference on Super-computing (ICS).
[8]
T. Chai and R. R. Draxler. 2014. Root Mean Square Error (RMSE) or Mean Absolute Error (MAE)? - Arguments Against Avoiding RMSE in the Literature. Geoscientific Model Development 7, 3 (2014), 1247--1250.
[9]
G. Chapuis, D. Nicholaeff, S. Eidenbenz, and R. S. Pavel. 2016. Predicting Performance of Smoothed Particle Hydrodynamics Codes at Large Scales. In 2016 Winter Simulation Conference (WSC).
[10]
C.-Y. Cher, M. S. Gupta, P. Bose, and K. P. Muller. 2014. Understanding Soft Error Resiliency of BlueGene/Q Compute Chip Through Hardware Proton Irradiation and Software Fault Injection. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[11]
A. Gainaru, F. Cappello, and W. Kramer. 2012. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[12]
Qiang Guan, Nathan BeBardeleben, Panruo Wu, Stephan Eidenbenz, Sean Blanchard, Laura Monroe, Elisabeth Baseman, and Li Tan. 2016. Design, Use and Evaluation of P-FSEFI: A Parallel Soft Error Fault Injection Framework for Emulating Soft Errors in Parallel Applications. In the 9th EAI International Conference on Simulation Tools and Techniques.
[13]
Qiang Guan, Nathan Debardeleben, Sean Blanchard, and Song Fu. 2014. F-sefi: A Fine-grained Soft Error Fault Injection Tool for Profiling Application Vulnerability. In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International. IEEE, 1245--1254.
[14]
Luanzheng Guo, Jing Liang, and Dong Li. 2016. Understanding Ineffectiveness of Application-Level Fault Injection. In Poster in International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[15]
Siva Kumar Sastry Hari, Sarita V. Adve, Helia Naeimi, and Pradeep Ramachandran. 2012. Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[16]
Eric Heien, Derrick Kondo, Ana Gainaru, Dan LaPine, Bill Kramer, and Franck Cappello. 2011. Modeling and Tolerating Heterogeneous Failures in Large Parallel Systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[17]
Michael A. Heroux, Douglas W. Doerfler, Paul S. Crozier, James M. Willenbring, H. Carter Edwards, Alan Williams, Mahesh Rajan, Eric R. Keiter, Heidi K. Thornquist, and Robert W. Numrich. 2009. Improving Performance via Miniapplications. In SANDIA REPORT.
[18]
Charles R. Ferenbaugh (Los Alamos National Laboratory). 2012. The PENNANT Mini-App. https://github.com/lanl/PENNANT. (2012).
[19]
R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert. 2009. Statistical Fault Injection: Quantified Error and Confidence. In Conference on Design, Automation and Test in Europe (DATE).
[20]
Scott Levy, Matthew G.F. Dosanjh, Patrick G. Bridges, and Kurt B. Ferreira. 2013. Using Unreliable Virtual Hardware to Inject Errors in Extreme-scale Systems. In Proceedings of the 3rd Workshop on Fault-tolerance for HPC at Extreme Scale.
[21]
Dong Li, Jeffrey S. Vetter, and Weikuan Yu. 2012. Classifying Soft Error Vulnerabilities in Extreme-Scale Scientific Applications Using a Binary Instrumentation Tool. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[22]
Yinglung Liang, Yanyong Zhang, A. Sivasubramaniam, M. Jette, and R. Sahoo. 2006. BlueGene/L Failure Analysis and Prediction Models. In International Conference on Dependable Systems and Networks (DSN'06).
[23]
J. Mambretti, J. Chen, and F. Yeh. 2015. Next Generation Clouds, the Chameleon Cloud Testbed, and Software Defined Networking (SDN). In 2015 International Conference on Cloud Computing Research and Innovation (ICCCRI).
[24]
K. Parasyris, G. Tziantzoulis, C. D. Antonopoulos, and N. Bellas. 2014. GemFI: A Fault Injection Tool for Studying the Behavior of Applications on Unreliable Substrates. In 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[25]
Karthik Pattabiraman, Zbigniew Kalbarczyk, and Ravishankar K. Iyer. 2005. Application-Based Metrics for Strategic Placement of Detectors. In Pacific Rim International Symposium on Dependable Computing(PRDC).
[26]
K. Pattabiraman, Z. Kalbarczyk, and R. K. Iyer. 2011. Automated Derivation of Application-Aware Error Detectors Using Static Analysis: The Trusted Illiac Approach. IEEE Transactions on Dependable and Secure Computing 8, 1 (Jan 2011), 44--57.
[27]
R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Sivasubramaniam. 2003. Critical Event Prediction for Proactive Management in Large-scale Computer Clusters. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03).
[28]
Siva Kumar Sastry Hari, Radha Venkatagiri, Sarita V. Adve, and Helia Naeimi. 2014. GangES: Gang Error Simulation for Hardware Resiliency Evaluation. In International Symposium on Computer Architecuture (ISCA).
[29]
M. Shantharam, S. Srinivasmurthy, and P. Raghavan. 2011. Characterizing the Impact of Soft Errors on Iterative Methods in Scientific Computing. In International Conference on Supercomputing(ICS).
[30]
V. C. Sharma, A. Haran, Z. Rakamaric, and G. Gopalakrishnan. 2013. Towards Formal Approaches to System Resilience. In 2013 IEEE 19th Pacific Rim International Symposium on Dependable Computing.
[31]
Amit Singhal. 2001. Modern Information Retrieval: A Brief Overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24, 4 (2001), 35--43.
[32]
Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt B. Ferreira, Jon Stearley, John Shalf, and Sudhanva Gurumurthi. 2015. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15).
[33]
Vilas Sridharan, Jon Stearley, Nathan DeBardeleben, Sean Blanchard, and Sudhanva Gurumurthi. 2013. Feng Shui of Supercomputer Memory: Positional Effects in DRAM and SRAM Faults. In International Conference on High Performance Computing, Networking, Storage and Analysis.
[34]
R. Venkatagiri, A. Mahmoud, S. K. S. Hari, and S. V. Adve. 2016. Approxilyzer: Towards a Systematic Framework for Instruction-level Approximate Computing and Its Application to Hardware Resiliency. In International Symposium on Microarchitecture (MICRO).
[35]
Lucas Wanner, Salma Elmalaki, Liangzhen Lai, Puneet Gupta, and Mani Srivastava. 2013. VarEMU: An Emulation Testbed for Variability-aware Software. In Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '13).
[36]
Jiesheng Wei, Anna Thomas, Guanpeng Li, and Karthik Pattabiraman. 2014. Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults. In International Conference on Dependable Systems and Networks (DSN).
[37]
Kai Wu, Qiang Guan, Nathan DeBardelebe, and Dong Li. 2017. Characterization and Comparison of Application Resilience for Serial and Parallel Executions. In Poster in International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[38]
Li Yu, Dong Li, Sparsh Mittal, and Jeffrey S. Vetter. 2014. Quantitatively Modeling Application Resiliency with the Data Vulnerability Factor. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
August 2018
945 pages
ISBN:9781450365109
DOI:10.1145/3225058
© 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

In-Cooperation

  • University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • U.S.National Science Foundation

Conference

ICPP 2018

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;
Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 113
    Total Downloads
  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)2
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media