[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2503210.2503249acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Parallel reduction to hessenberg form with algorithm-based fault tolerance

Published: 17 November 2013 Publication History

Abstract

This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.

References

[1]
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. SIAM, Philadelphia, PA, Third edition, 1999.
[2]
M. Berry and M. Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia, Second edition, 2005.
[3]
C. H. Bischof and C. V. Loan. The WY Representation for Products of Householder Matrices. In Parallel Processing for Scientific Computing, pages 2--13, 1985.
[4]
L. S. Blackford, J. Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA, 1997.
[5]
W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI. In Proceedings of the 18th international conference on Parallel Processing, Euro-Par'12, pages 477--488, 2012.
[6]
W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Extending the scope of the checkpoint-on-failure protocol for forward recovery in standard MPI. Technical Report UT-CS-12-702, University of Tennessee, 2012.
[7]
D. Boley, G. H. Golub, S. Makar, N. Saxena, and E. J. McCluskey. Floating point fault tolerance with backward error assertions. IEEE Transactions on Computers, 44(2):302--311, February 1995.
[8]
G. Bosilca, A. Bouteiller, É. Brunet, F. Cappello, J. Dongarra, A. Guermouche, T. Hérault, Y. Robert, F. Vivien, and D. Zaidouni. Unified Model for Assessing Checkpointing Protocols at Extreme-Scale. Technical Report RR-7950, INRIA, October 2012.
[9]
G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput., 69(4):410--416, April 2009.
[10]
M. Bougeret, H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni. Using Group Replication for Resilience on Exascale Systems. Technical Report RR-7876, INRIA, February 2012.
[11]
K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm. ii. aggressive early deflation. SIAM J. Matrix Anal. Appl., 23:948--973, 2002.
[12]
S. Brin and L. Page. The antaomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 33:107--17, 1998. Also available online at http://infolab.stanford.edu/pub/papers/google.pdf.
[13]
K. Bryan and T. Leise. The $25,000,000,000 eigenvector. the linear algebra behind google. SIAM Review, 48(3):569--81, 2006. Also avaiable at http://www.rose-hulman.edu/~bryan/google.html.
[14]
H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni. Combining Process Replication and Checkpointing for Resilience on Exascale Systems. Technical Report RR-7951, INRIA, May 2012.
[15]
A. M. Castaldo, R. C. Whaley, and A. T. Chronopoulos. Reducing floating point error in dot product using the superblock family of algorithms. SIAM J. Sci. Comput., 31(2):1156--1174, 2008.
[16]
Z. Chen. Scalable techniques for fault tolerant high performance computing. PhD thesis, University of Tennessee, Knoxville, TN, USA, 2006.
[17]
T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High Performance Linpack Benchmark: A Fault Tolerant Implementation Without Checkpointing. In Proceedings of the International Conference on Supercomputing, ICS '11, pages 162--171, New York, NY, USA, 2011. ACM.
[18]
J. Dongarra, P. Beckman, and T. Moore. The international Exascale software project roadmap. Int. J. High Perform. Comput. Appl., 25(1):3--60, February 2011.
[19]
J. J. Dongarra, P. Luszczek, and A. Petitet. The LINPACK benchmark: Past, present, and future. Concurrency and Computation: Practice and Experience, 15:1--18, 2003.
[20]
J. J. Dongarra and R. A. van de Geijn. Reduction to condensed form for the eigenvalue problem on distributed memory architectures. Parallel Computing, 18(9):973--982, 1992.
[21]
P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. SIGPLAN Not., 47(8):225--234, February 2012.
[22]
J. G. F. Francis. The QR transformation a unitary analogue to the LR transformation. I. Comput. J., 4:265--271, 1961.
[23]
J. G. F. Francis. The QR transformation II. Comput. J., 4:332--345, 1962.
[24]
G. H. Golub and C. F. V. Loan. Matrix Computations. The John Hopkins University Press, 4th edition, December 27 2012. ISBN-10: 1421407949, ISBN-13: 978-1421407944.
[25]
L. A. B. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka. Distributed Diskless Checkpoint for Large Scale Systems. In Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on, pages 63--72, 2010.
[26]
R. Granat, B. Kågström, and D. Kressner. A novel parallel QR algorithm for hybrid distributed memory HPC systems. SIAM J. Sci. Comput., 32:2345--2378, 2010.
[27]
R. Granat, B. Kågström, D. Kressner, and M. Shao. Parallel library software for the multishift QR algorithm with aggressive early deflation. Technical Report UMINF-12.06, Dept. of Computing Science, Umeøa University, Sweden, 2012.
[28]
D. Hakkarinen and Z. Chen. Algorithmic Cholesky Factorization Fault Recovery. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--10, April 2010.
[29]
K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput., 33(6):518--528, June 1984.
[30]
R. M. K. Braman, R. Byers. The multishift QR algorithm. I. maintaining well-focused shifts and level 3 performance. SIAM J. Matrix Anal. Appl., 23:929--947, 2002.
[31]
B. Kågström, D. Kressner, and M. Shao. On aggressive early deflation in parallel variants of the QR algorithm. In PARA 2010, Applied Parallel and Scientific Computing, LNCS, volume 7134, pages 1--10. Springer, 2012.
[32]
Y. Kim, J. S. Plank, and J. Dongarra. Fault Tolerant Matrix Operations Using Checksum and Reverse Computation. In 6th Symposium on the Frontiers of Massively Parallel Computation, pages 70--77, Annapolis, MD, October 1996.
[33]
J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput., 30(1):102--116, November 2007.
[34]
A. Langville and C. Meyer. Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, 2006.
[35]
C.-D. Lu. Scalable Diskless Checkpointing for Large Parallel Systems. PhD thesis, University of Illinois, Urbana, Illinois, USA, 2005.
[36]
F. T. Luk and H. Park. Fault-tolerant matrix triangularizations on systolic arrays. IEEE Trans. Comput., 37(11):1434--1438, November 1988.
[37]
E. Meneses. Clustering Parallel Applications to Enhance Message Logging Protocol. https://wiki.ncsa.illinois.edu/download/attachments/17630761/INRIA-UIUC-WS4-emenese.pdf?version=1\&modificationDate=1290466786000.
[38]
A. Petitet, C. Whaley, J. Dongarra, and A. Cleary. HPL - a Portable Implementation of the High-Performance Linpack Benchmark for Distributed-memory Computers, September 2008. http://www.netlib.org/benchmark/hpl/.
[39]
J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst., 9(10):972--986, October 1998.
[40]
R. Schreiber and C. V. Loan. A storage efficient WY representation for products of householder transformations. SIAM Journal on Scientific and Statistical Computing, 10, 1989.
[41]
B. Schroeder and G. A. Gibson. Understanding failures in petascale computers. Journal of Physics: Conference Series, 78, 2007.
[42]
G. W. Stewart. Matrix Algorithms, Volume II: Eigensystems. SIAM: Society for Industrial and Applied Mathematics, First edition, August 2001.
[43]
U. von Luxburg. A tutorial on spectral clustering. Stat. Comput., 17:395--416, 2007.
[44]
J. H. Wilkinson. The Algebraic Eigenvalue Problem. Oxford University Press, Inc., New York, NY, USA, 1988. ISBN-10: 0198534183, ISBN-13: 978-0198534181.
[45]
J. H. Wilkinson. Rounding Errors in Algebraic Processes. Dover, New York, 1994.
[46]
E. Yao, R. Wang, M. Chen, G. Tan, and N. Sun. A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 438--448, May 2012.

Cited By

View all
  • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
  • (2021)Probabilistic and Temporal Failure Detectors for Solving Distributed ProblemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.017Online publication date: Jul-2021
  • (2017)Silent Data Corruption Resilient Two-sided Matrix FactorizationsACM SIGPLAN Notices10.1145/3155284.301875052:8(415-427)Online publication date: 26-Jan-2017
  • Show More Cited By

Index Terms

  1. Parallel reduction to hessenberg form with algorithm-based fault tolerance

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
    November 2013
    1123 pages
    ISBN:9781450323789
    DOI:10.1145/2503210
    • General Chair:
    • William Gropp,
    • Program Chair:
    • Satoshi Matsuoka
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 November 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. ScaLAPACK
    2. algorithm-based fault tolerance
    3. dense linear algebra
    4. hessenberg reduction
    5. parallel numerical libraries

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC13
    Sponsor:

    Acceptance Rates

    SC '13 Paper Acceptance Rate 91 of 449 submissions, 20%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 10 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Resiliency in numerical algorithm design for extreme scale simulationsInternational Journal of High Performance Computing Applications10.1177/1094342021105518836:2(251-285)Online publication date: 1-Mar-2022
    • (2021)Probabilistic and Temporal Failure Detectors for Solving Distributed ProblemsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.017Online publication date: Jul-2021
    • (2017)Silent Data Corruption Resilient Two-sided Matrix FactorizationsACM SIGPLAN Notices10.1145/3155284.301875052:8(415-427)Online publication date: 26-Jan-2017
    • (2017)Silent Data Corruption Resilient Two-sided Matrix FactorizationsProceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3018743.3018750(415-427)Online publication date: 26-Jan-2017
    • (2017)Fault tolerant communication-optimal 2.5D matrix multiplicationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2017.01.022104:C(179-190)Online publication date: 1-Jun-2017
    • (2017)A general CFD framework for fault-resilient simulations based on multi-resolution information fusionJournal of Computational Physics10.1016/j.jcp.2017.06.044347:C(290-304)Online publication date: 15-Oct-2017
    • (2016)Performance analysis of different matrix decomposition methods on face recognition2016 International Conference on Computer Communication and Informatics (ICCCI)10.1109/ICCCI.2016.7479981(1-6)Online publication date: Jan-2016
    • (2015)Resilient Matrix Multiplication of Hierarchical Semi-Separable MatricesProceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale10.1145/2751504.2751507(19-26)Online publication date: 15-Jun-2015

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media