[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2751504.2751507acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Public Access

Resilient Matrix Multiplication of Hierarchical Semi-Separable Matrices

Published: 15 June 2015 Publication History

Abstract

The hierarchical semi-separable (HSS) matrix factorization has useful characteristics for representing low-rank operators on extreme scale computing systems. To prepare for the higher error rates anticipated with future architectures, this paper introduces new fault-tolerant algorithms for HSS matrix multiplication that maintain efficient performance in the presence of high error rates. The measured runtime overhead for error checking and data preservation using the Containment Domains library is exceptionally small and encourages the use of frequent, fine-grained error checking when using algorithm based fault tolerance.

References

[1]
A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. Dependable and Secure Computing, IEEE Transactions on, 1(1):11--33, Jan 2004.
[2]
Greg Bronevetsky and Bronis de Supinski. Soft error vulnerability of iterative linear algebra methods. In Proceedings of the 22nd Annual International Conference on Supercomputing, ICS '08, pages 155--164, New York, NY, USA, 2008. ACM.
[3]
Zizhong Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. Parallel and Distributed Systems, IEEE Transactions on, 19(12):1628--1641, Dec 2008.
[4]
J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC). IEEE., 2012.
[5]
P. Du, P. Luszczek, and J. Dongarra. High Performance Dense Linear System Solver with Soft Error Resilience. In Proc. of 2011 IEEE International Conference on Cluster Computing, pages 272--280, 2011.
[6]
Kuang-Hua Huang and J.A. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, C-33(6):518--528, June 1984.
[7]
Y. Jia, G. Bosilca, P. Luszczek, and J. Dongarra. Parallel Reduction to Hessenberg Form with Algorithm-Based Fault Tolerance. In Proc. of SC13, Denver, CO, USA, November 17--21 2013.
[8]
Y. Jia, P. Luszczek, G. Bosilca, and J. Dongarra. CPU-GPU Hybrid Bidiagonal Reduction With Soft Error Resilience. In Proc. of ScalA'13 (4th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems), Denver, CO, USA, November 17--21 2013.
[9]
Franklin T. Luk and Haesun Park. An Analysis of Algorithm-Based Fault Tolerance Techniques. J. Parallel and Distributed Computing, 5:172--184, November 1988.
[10]
Franklin T. Luk and Haesun Park. Fault-Tolerance Matrix Trianglations on Systolic Arrays. Computers, IEEE Transactions on, 37(11):1434--1438, November 1988.
[11]
P.G. Martinsson. A fast randomized algorithm for computing a hierarchically semiseparable representation of a matrix. SIAM J. Matrix Analysis and Applications, 32(4):1251--1274, 2011.
[12]
F.-H. Rouet, X.S. Li, and P. Ghysels. A distributed-memory package for dense hierarchically semi-separable matrix computations using randomization. ACM Trans. Mathematical Software, 2015. (submitted).
[13]
Manu Shantharam, Sowmyalatha Srinivasmurthy, and Padma Raghavan. Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS '12, pages 69--78, New York, NY, USA, 2012. ACM.
[14]
M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, Pavan Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, Andrew A. Chien, P. Coteus, N. A. Debardeleben, P. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, Sriram Krishnamoorthy, Sven Leyffer, D. Liberty, S. Mitra, T. S. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen. Addressing failures in exascale computing. 2013.
[15]
Vilas Sridharan and Dean Liberty. A Study of DRAM Failures in the Field. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 76:1--76:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.
[16]
CD Team. Containment Domains API v0.1 (C++), 2014.
[17]
GVR Team. Global View Resilience, API Documentation R0.8.1-rc0. Technical Report TR-2014-05, University of Chicago, University of Chicago, 2014.
[18]
S. Wang, X.S. Li, J. Xia, Y. Situ, and M.V. de Hoop. Efficient parallel algorithms for solving linear systems with hierarchically semiseparable structures. SIAM J. Scientific Computing, 35(6):C519--C544, 2013.
[19]
Panruo Wu and Zizhong Chen. FT-ScaLAPACK: Correcting Soft Errors On-line for ScaLAPACK Cholesky, QR, and LU Factorization Routines. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 49--60, New York, NY, USA, 2014. ACM.
[20]
J. Xia. Randomized sparse direct solvers. SIAM J. Matrix Anal. Appl., 34(1):197--227, 2013.
[21]
Z. Zheng, A. A. Chien, and K. Teranishi. Fault Tolerance in an Inner-outer Solver: A GVR-enabled Case Study. In Proc. of VECPAR'14 11th International Meeting on High Performance Computing for Computational Science, Eugene, Oregon, USA, June 30-July 3 2014.

Cited By

View all
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2021)Doubt and Redundancy Kill Soft Errors—Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software2021 IEEE/ACM 11th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS54580.2021.00005(1-10)Online publication date: Nov-2021
  • (2018)Influence of A-Posteriori Subcell Limiting on Fault Frequency in Higher-Order DG Schemes2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS.2018.00012(79-86)Online publication date: Nov-2018

Index Terms

  1. Resilient Matrix Multiplication of Hierarchical Semi-Separable Matrices

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    FTXS '15: Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale
    June 2015
    78 pages
    ISBN:9781450335690
    DOI:10.1145/2751504
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 June 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. abft
    2. error detection
    3. hss
    4. numerical methods

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    HPDC'15
    Sponsor:

    Acceptance Rates

    FTXS '15 Paper Acceptance Rate 9 of 15 submissions, 60%;
    Overall Acceptance Rate 16 of 25 submissions, 64%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)58
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 10 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
    • (2021)Doubt and Redundancy Kill Soft Errors—Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software2021 IEEE/ACM 11th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS54580.2021.00005(1-10)Online publication date: Nov-2021
    • (2018)Influence of A-Posteriori Subcell Limiting on Fault Frequency in Higher-Order DG Schemes2018 IEEE/ACM 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)10.1109/FTXS.2018.00012(79-86)Online publication date: Nov-2018

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media