More Web Proxy on the site http://driver.im/

research-article

Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems

Authors:

Emmanuel Agullo,

Alfredo Buttari,

Abdou Guermouche,

Florent LopezAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 43, Issue 2

Article No.: 13, Pages 1 - 22

https://doi.org/10.1145/2898348

Published: 16 August 2016 Publication History

Abstract

To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This article evaluates the usability and effectiveness of runtime systems based on the Sequential Task Flow model for complex applications, namely, sparse matrix multifrontal factorizations that feature extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Most importantly, it shows how this parallel programming model eases the development of complex features that benefit the performance of sparse, direct solvers as well as their memory consumption. We illustrate our discussion with the multifrontal QR factorization running on top of the StarPU runtime system.

References

[1]

Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche, and Florent Lopez. 2013. Multifrontal QR factorization for multicore architectures over runtime systems. In Proceedings of the 19th international Conference on Parallel Processing (Euro-Par 2013). Springer, Berlin, 521--532. http://dx.doi.org/10.1007/978-3-642-40047-6_53

Digital Library

[2]

Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. Journal of Physics: Conference Series 180, 1 (2009), 012037. http://stacks.iop.org/1742-6596/180/i=1/a=012037

[3]

Emmanuel Agullo, Jack Dongarra, Rajib Nath, and Stanimire Tomov. 2011. Fully empirical autotuned QR factorization for multicore architectures. CoRR abs/1102.5328 (2011).

[4]

Randy Allen and Ken Kennedy. 2002. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann.

Digital Library

[5]

Patrick R. Amestoy, Iain S. Duff, and Chiara Puglisi. 1996. Multifrontal QR factorization in a multiprocessor environment. Numerical Linear Algebra with Applications 3(4) (1996), 275--300.

[6]

The OpenMP architecture review board. 2013. OpenMP 4.0 Complete specifications. (2013).

[7]

Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 2001. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems 34, 2 (2001), 115--144.

[8]

Cedric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009 23 (Feb. 2011), 187--198.

Digital Library

[9]

Rosa M. Badia, José R. Herrero, Jesús Labarta, Josep M. Pérez, Enrique S. Quintana-Ortí, and Gregorio Quintana-Ortí. 2009. Parallelizing dense and banded linear algebra libraries using SMPSs. Concurrency and Computation: Practice and Experience 21, 18 (2009), 2438--2456.

Digital Library

[10]

George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Hérault, Pierre Lemarinier, and Jack Dongarra. 2012. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Computing 38, 1--2 (2012), 37--51.

Digital Library

[11]

George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Piotr Luszczek, and Jack Dongarra. 2013. Dense linear algebra on distributed heterogeneous hardware with a symbolic DAG approach. Scalable Computing and Communications: Theory and Practice (2013), 699--733.

[12]

Henricus Bouwmeester, Mathias Jacquelin, Julien Langou, and Yves Robert. 2011. Tiled QR factorization algorithms. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11). ACM, 7:1--7:11.

Digital Library

[13]

Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, and others. 2010. Concurrent collections. Scientific Programming 18, 3 (2010), 203--217.

Digital Library

[14]

Alfredo Buttari. 2013. Fine-grained multithreading for the multifrontal QR factorization of sparse matrices. SIAM Journal on Scientific Computing 35, 4 (2013), C323--C345. http://epubs.siam.org/doi/abs/10.1137/110846427

Digital Library

[15]

Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. 2009. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing 35 (Jan. 2009), 38--53.

Digital Library

[16]

Michel Cosnard and Emmanuel Jeannot. 2001. Automatic parallelization techniques based on compact DAG extraction and symbolic scheduling. Parallel Processing Letters 11, 01 (2001), 151--168.

[17]

Timothy A. Davis. 2011. Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization. ACM Transactions on Mathematical Software 38, 1 (Dec. 2011), 8:1--8:22.

Digital Library

[18]

Timothy A. Davis. 2014. SuiteSparse 4.4.0. (October 2014). Software package.

[19]

Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software 38, 1 (Dec. 2011), 1:1--1:25. http://doi.acm.org/10.1145/2049662.2049663

Digital Library

[20]

James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012. Communication-optimal parallel and sequential QR and LU factorizations. SIAM Journal of Scientific Computing 34, 1 (Feb. 2012), 206--239. http://dx.doi.org/10.1137/080731992

Digital Library

[21]

Edsger W. Dijkstra. 1965. Een algorithme ter voorkoming van de dodelijke omarming. (1965). http://www.cs.utexas.edu/users/EWD/ewd01xx/EWD108.PDF (circulated privately).

[22]

Edsger W. Dijkstra. 1982. The mathematics behind the banker's algorithm. In Selected Writings on Computing: A Personal Perspective. Springer, New York, 308--312. 4612-5695-3_54

Digital Library

[23]

Jack Dongarra, Mathieu Faverge, Thomas Hérault, Mathias Jacquelin, Julien Langou, and Yves Robert. 2013. Hierarchical QR factorization algorithms for multi-core clusters. Parallel Computing 39, 4--5 (2013), 212--232. http://hal.inria.fr/hal-00809770.

Digital Library

[24]

Iain S. Duff and John K. Reid. 1983. The multifrontal solution of indefinite sparse symmetric linear systems. ACM Transactions On Mathematical Software 9 (1983), 302--325.

Digital Library

[25]

Lionel Eyraud-Dubois, Loris Marchal, Oliver Sinnen, and Frédéric Vivien. 2015. Parallel scheduling of task trees with limited memory. ACM Transactions on Parallel Computing 2, 2 (June 2015), 13:1--13:37.

Digital Library

[26]

Al Geist and Esmond G. Ng. 1989. Task scheduling for parallel sparse Cholesky factorization. International Journal of Parallel Programming 18 (1989), 291--314.

Digital Library

[27]

Abdou Guermouche, Jean-Yves L'Excellent, and Gilles Utard. 2003. Impact of reordering on the memory of a multifrontal solver. Parallel Computing 29, 9 (2003), 1191--1218.

Digital Library

[28]

Everton Hermann, Bruno Raffin, François Faure, Thierry Gautier, and Jérémie Allard. 2010. Multi-GPU and multi-CPU parallelization for interactive physics simulations. In 16th International Euro-Par Conference on Parallel Processing: Part II. 235--246.

Digital Library

[29]

Jonathan Hogg, John K. Reid, and Jennifer A. Scott. 2010. Design of a multicore sparse Cholesky factorization using DAGs. SIAM Journal of Scientific Computing 32, 6 (2010), 3627--3649.

Digital Library

[30]

Kyungjoo Kim and Victor Eijkhout. 2014. A parallel sparse direct solver via hierarchical DAG scheduling. ACM Transactions on Mathematical Software 41, 1 (Oct. 2014), 3:1--3:27.

Digital Library

[31]

Jakub Kurzak and Jack Dongarra. 2009. Fully dynamic scheduler for numerical computing on multicore processors. LAPACK Working Note lawn220 (2009).

[32]

Xavier Lacoste, Mathieu Faverge, Pierre Ramet, Samuel Thibault, and George Bosilca. 2014. Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes. Research Report RR-8446, INRIA.

[33]

Joseph W. H. Liu. 1986. On the storage requirement in the out-of-core multifrontal method for sparse factorization. ACM Transactions on Mathematical Software 12 (1986), 127--148.

Digital Library

[34]

François Pellegrini and Jean Roman. 1996. Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs. In Proceedings of HPCN'96, Lecture Notes in Computer Science, Vol. 1067. Springer, 493--498.

Digital Library

[35]

Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. Van De Geijn, Field G. Van Zee, and Ernie Chan. 2009. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software 36, 3 (2009).

Digital Library

[36]

Franois-Henry Rouet. 2012. Memory and Performance Issues in Parallel Multifrontal Factorizations and Triangular Solutions with Sparse Right-Hand Sides. Thèse de doctorat. Institut National Polytechnique de Toulouse, Toulouse, France. http://tel.archives-ouvertes.fr/tel-00785748

[37]

Robert Schreiber. 1982. A new implementation of sparse Gaussian elimination. ACM Transactions on Mathematical Software 8 (1982), 256--276.

Digital Library

[38]

Haluk Topcuouglu, Salim Hariri, and Min-You Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems 13, 3 (Mar. 2002), 260--274.

Digital Library

[39]

Sencer Yeralan, Timothy A. Davis, and Sanjay Ranka. 2015. Sparse QR factorization on the GPU. (Jan. 2015). Submitted to ACM Transactions on Mathematical Software.

Cited By

Agullo EButtari AGuermouche AHerrmann JJego A(2023)Task-based Parallel Programming for Scalable Matrix Product AlgorithmsACM Transactions on Mathematical Software10.1145/358356049:2(1-23)Online publication date: 15-Jun-2023
https://dl.acm.org/doi/10.1145/3583560
Gnanasekaran ADarve ERobinson T(2022)Scalable low-rank factorization using a task-based runtime system with distributed memoryProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3539781.3539791(1-11)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3539781.3539791
Al Daas HJolivet PScott J(2022)A Robust Algebraic Domain Decomposition Preconditioner for Sparse Normal EquationsSIAM Journal on Scientific Computing10.1137/21M143489144:3(A1047-A1068)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1137/21M1434891
Show More Cited By

Index Terms

Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems

Recommendations

Multifrontal QR factorization for multicore architectures over runtime systems
Euro-Par'13: Proceedings of the 19th international conference on Parallel Processing

To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer ...
An Evaluation of Task-Parallel Frameworks for Sparse Solvers on Multicore and Manycore CPU Architectures
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing

Recently, several task-parallel programming models have emerged to address the high synchronization and load imbalance issues as well as data movement overheads in modern shared memory architectures. OpenMP, the most commonly used shared memory ...
High performance sparse multifrontal solvers on modern GPUs
Abstract
We have ported the numerical factorization and triangular solve phases of the sparse direct solver STRUMPACK to GPU. STRUMPACK implements sparse LU factorization using the multifrontal algorithm, which performs most of its operations ...
Highlights
- Ported the numerical factorization and triangular solve phases of the sparse direct multifrontal solver STRUMPACK to GPU.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software

ACM Transactions on Mathematical Software Volume 43, Issue 2

June 2017

200 pages

ISSN:0098-3500

EISSN:1557-7295

DOI:10.1145/2988256

Editor:
Michael A. Heroux
Sandia National Laboratories, USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2016

Accepted: 01 February 2016

Revised: 01 November 2015

Received: 01 November 2014

Published in TOMS Volume 43, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Agence Nationale de la Recherche

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
264
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)3

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Agullo EButtari AGuermouche AHerrmann JJego A(2023)Task-based Parallel Programming for Scalable Matrix Product AlgorithmsACM Transactions on Mathematical Software10.1145/358356049:2(1-23)Online publication date: 15-Jun-2023
https://dl.acm.org/doi/10.1145/3583560
Gnanasekaran ADarve ERobinson T(2022)Scalable low-rank factorization using a task-based runtime system with distributed memoryProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3539781.3539791(1-11)Online publication date: 27-Jun-2022
https://dl.acm.org/doi/10.1145/3539781.3539791
Al Daas HJolivet PScott J(2022)A Robust Algebraic Domain Decomposition Preconditioner for Sparse Normal EquationsSIAM Journal on Scientific Computing10.1137/21M143489144:3(A1047-A1068)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1137/21M1434891
Nesi LSchnorr LLegrand A(2022)Multi-Phase Task-Based HPC Applications: Quickly Learning how to Run Fast2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00042(357-367)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00042
Le Fevre VUsui TCasas M(2022)A Selective Nesting Approach for the Sparse Multi-threaded Cholesky Factorization2022 IEEE/ACM 7th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM256814.2022.00006(1-9)Online publication date: Nov-2022
https://doi.org/10.1109/ESPM256814.2022.00006
Cogo Miletto MLeandro Nesi LMello Schnorr LLegrand A(2022)Performance analysis of task-based multi-frontal sparse linear solvers: Structure mattersFuture Generation Computer Systems10.1016/j.future.2022.05.013135(409-425)Online publication date: Oct-2022
https://doi.org/10.1016/j.future.2022.05.013
Bathie GMarchal LRobert YThibault S(2021)Dynamic DAG Scheduling Under Memory Constraints for Shared-Memory PlatformsInternational Journal of Networking and Computing10.15803/ijnc.11.1_2711:1(27-49)Online publication date: 2021
https://doi.org/10.15803/ijnc.11.1_27
Cambier LDarve ERobinson T(2021)A task-based distributed parallel sparsified nested dissection algorithmProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3468267.3470619(1-11)Online publication date: 5-Jul-2021
https://dl.acm.org/doi/10.1145/3468267.3470619
Mo TLi R(2020)Iteratively solving sparse linear system based on PaRSEC task schedulingInternational Journal of High Performance Computing Applications10.1177/109434201989999734:3(306-315)Online publication date: 1-May-2020
https://dl.acm.org/doi/10.1177/1094342019899997
Lu ZNiu YLiu W(2020)Efficient Block Algorithms for Parallel Sparse Triangular SolveProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404413(1-11)Online publication date: 17-Aug-2020
https://dl.acm.org/doi/10.1145/3404397.3404413
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents