[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems

Published: 16 August 2016 Publication History

Abstract

To face the advent of multicore processors and the ever increasing complexity of hardware architectures, programming models based on DAG parallelism regained popularity in the high performance, scientific computing community. Modern runtime systems offer a programming interface that complies with this paradigm and powerful engines for scheduling the tasks into which the application is decomposed. These tools have already proved their effectiveness on a number of dense linear algebra applications. This article evaluates the usability and effectiveness of runtime systems based on the Sequential Task Flow model for complex applications, namely, sparse matrix multifrontal factorizations that feature extremely irregular workloads, with tasks of different granularities and characteristics and with a variable memory consumption. Most importantly, it shows how this parallel programming model eases the development of complex features that benefit the performance of sparse, direct solvers as well as their memory consumption. We illustrate our discussion with the multifrontal QR factorization running on top of the StarPU runtime system.

References

[1]
Emmanuel Agullo, Alfredo Buttari, Abdou Guermouche, and Florent Lopez. 2013. Multifrontal QR factorization for multicore architectures over runtime systems. In Proceedings of the 19th international Conference on Parallel Processing (Euro-Par 2013). Springer, Berlin, 521--532. http://dx.doi.org/10.1007/978-3-642-40047-6_53
[2]
Emmanuel Agullo, Jim Demmel, Jack Dongarra, Bilel Hadri, Jakub Kurzak, Julien Langou, Hatem Ltaief, Piotr Luszczek, and Stanimire Tomov. 2009. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. Journal of Physics: Conference Series 180, 1 (2009), 012037. http://stacks.iop.org/1742-6596/180/i=1/a=012037
[3]
Emmanuel Agullo, Jack Dongarra, Rajib Nath, and Stanimire Tomov. 2011. Fully empirical autotuned QR factorization for multicore architectures. CoRR abs/1102.5328 (2011).
[4]
Randy Allen and Ken Kennedy. 2002. Optimizing Compilers for Modern Architectures: A Dependence-Based Approach. Morgan Kaufmann.
[5]
Patrick R. Amestoy, Iain S. Duff, and Chiara Puglisi. 1996. Multifrontal QR factorization in a multiprocessor environment. Numerical Linear Algebra with Applications 3(4) (1996), 275--300.
[6]
The OpenMP architecture review board. 2013. OpenMP 4.0 Complete specifications. (2013).
[7]
Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 2001. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems 34, 2 (2001), 115--144.
[8]
Cedric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, Special Issue: Euro-Par 2009 23 (Feb. 2011), 187--198.
[9]
Rosa M. Badia, José R. Herrero, Jesús Labarta, Josep M. Pérez, Enrique S. Quintana-Ortí, and Gregorio Quintana-Ortí. 2009. Parallelizing dense and banded linear algebra libraries using SMPSs. Concurrency and Computation: Practice and Experience 21, 18 (2009), 2438--2456.
[10]
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Hérault, Pierre Lemarinier, and Jack Dongarra. 2012. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Computing 38, 1--2 (2012), 37--51.
[11]
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Piotr Luszczek, and Jack Dongarra. 2013. Dense linear algebra on distributed heterogeneous hardware with a symbolic DAG approach. Scalable Computing and Communications: Theory and Practice (2013), 699--733.
[12]
Henricus Bouwmeester, Mathias Jacquelin, Julien Langou, and Yves Robert. 2011. Tiled QR factorization algorithms. In Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11). ACM, 7:1--7:11.
[13]
Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, and others. 2010. Concurrent collections. Scientific Programming 18, 3 (2010), 203--217.
[14]
Alfredo Buttari. 2013. Fine-grained multithreading for the multifrontal QR factorization of sparse matrices. SIAM Journal on Scientific Computing 35, 4 (2013), C323--C345. http://epubs.siam.org/doi/abs/10.1137/110846427
[15]
Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. 2009. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing 35 (Jan. 2009), 38--53.
[16]
Michel Cosnard and Emmanuel Jeannot. 2001. Automatic parallelization techniques based on compact DAG extraction and symbolic scheduling. Parallel Processing Letters 11, 01 (2001), 151--168.
[17]
Timothy A. Davis. 2011. Algorithm 915, SuiteSparseQR: Multifrontal multithreaded rank-revealing sparse QR factorization. ACM Transactions on Mathematical Software 38, 1 (Dec. 2011), 8:1--8:22.
[18]
Timothy A. Davis. 2014. SuiteSparse 4.4.0. (October 2014). Software package.
[19]
Timothy A. Davis and Yifan Hu. 2011. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software 38, 1 (Dec. 2011), 1:1--1:25. http://doi.acm.org/10.1145/2049662.2049663
[20]
James Demmel, Laura Grigori, Mark Hoemmen, and Julien Langou. 2012. Communication-optimal parallel and sequential QR and LU factorizations. SIAM Journal of Scientific Computing 34, 1 (Feb. 2012), 206--239. http://dx.doi.org/10.1137/080731992
[21]
Edsger W. Dijkstra. 1965. Een algorithme ter voorkoming van de dodelijke omarming. (1965). http://www.cs.utexas.edu/users/EWD/ewd01xx/EWD108.PDF (circulated privately).
[22]
Edsger W. Dijkstra. 1982. The mathematics behind the banker's algorithm. In Selected Writings on Computing: A Personal Perspective. Springer, New York, 308--312. 4612-5695-3_54
[23]
Jack Dongarra, Mathieu Faverge, Thomas Hérault, Mathias Jacquelin, Julien Langou, and Yves Robert. 2013. Hierarchical QR factorization algorithms for multi-core clusters. Parallel Computing 39, 4--5 (2013), 212--232. http://hal.inria.fr/hal-00809770.
[24]
Iain S. Duff and John K. Reid. 1983. The multifrontal solution of indefinite sparse symmetric linear systems. ACM Transactions On Mathematical Software 9 (1983), 302--325.
[25]
Lionel Eyraud-Dubois, Loris Marchal, Oliver Sinnen, and Frédéric Vivien. 2015. Parallel scheduling of task trees with limited memory. ACM Transactions on Parallel Computing 2, 2 (June 2015), 13:1--13:37.
[26]
Al Geist and Esmond G. Ng. 1989. Task scheduling for parallel sparse Cholesky factorization. International Journal of Parallel Programming 18 (1989), 291--314.
[27]
Abdou Guermouche, Jean-Yves L'Excellent, and Gilles Utard. 2003. Impact of reordering on the memory of a multifrontal solver. Parallel Computing 29, 9 (2003), 1191--1218.
[28]
Everton Hermann, Bruno Raffin, François Faure, Thierry Gautier, and Jérémie Allard. 2010. Multi-GPU and multi-CPU parallelization for interactive physics simulations. In 16th International Euro-Par Conference on Parallel Processing: Part II. 235--246.
[29]
Jonathan Hogg, John K. Reid, and Jennifer A. Scott. 2010. Design of a multicore sparse Cholesky factorization using DAGs. SIAM Journal of Scientific Computing 32, 6 (2010), 3627--3649.
[30]
Kyungjoo Kim and Victor Eijkhout. 2014. A parallel sparse direct solver via hierarchical DAG scheduling. ACM Transactions on Mathematical Software 41, 1 (Oct. 2014), 3:1--3:27.
[31]
Jakub Kurzak and Jack Dongarra. 2009. Fully dynamic scheduler for numerical computing on multicore processors. LAPACK Working Note lawn220 (2009).
[32]
Xavier Lacoste, Mathieu Faverge, Pierre Ramet, Samuel Thibault, and George Bosilca. 2014. Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes. Research Report RR-8446, INRIA.
[33]
Joseph W. H. Liu. 1986. On the storage requirement in the out-of-core multifrontal method for sparse factorization. ACM Transactions on Mathematical Software 12 (1986), 127--148.
[34]
François Pellegrini and Jean Roman. 1996. Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs. In Proceedings of HPCN'96, Lecture Notes in Computer Science, Vol. 1067. Springer, 493--498.
[35]
Gregorio Quintana-Ortí, Enrique S. Quintana-Ortí, Robert A. Van De Geijn, Field G. Van Zee, and Ernie Chan. 2009. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Transactions on Mathematical Software 36, 3 (2009).
[36]
Franois-Henry Rouet. 2012. Memory and Performance Issues in Parallel Multifrontal Factorizations and Triangular Solutions with Sparse Right-Hand Sides. Thèse de doctorat. Institut National Polytechnique de Toulouse, Toulouse, France. http://tel.archives-ouvertes.fr/tel-00785748
[37]
Robert Schreiber. 1982. A new implementation of sparse Gaussian elimination. ACM Transactions on Mathematical Software 8 (1982), 256--276.
[38]
Haluk Topcuouglu, Salim Hariri, and Min-You Wu. 2002. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems 13, 3 (Mar. 2002), 260--274.
[39]
Sencer Yeralan, Timothy A. Davis, and Sanjay Ranka. 2015. Sparse QR factorization on the GPU. (Jan. 2015). Submitted to ACM Transactions on Mathematical Software.

Cited By

View all
  • (2023)Task-based Parallel Programming for Scalable Matrix Product AlgorithmsACM Transactions on Mathematical Software10.1145/358356049:2(1-23)Online publication date: 15-Jun-2023
  • (2022)Scalable low-rank factorization using a task-based runtime system with distributed memoryProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3539781.3539791(1-11)Online publication date: 27-Jun-2022
  • (2022)A Robust Algebraic Domain Decomposition Preconditioner for Sparse Normal EquationsSIAM Journal on Scientific Computing10.1137/21M143489144:3(A1047-A1068)Online publication date: 1-Jan-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Mathematical Software
ACM Transactions on Mathematical Software  Volume 43, Issue 2
June 2017
200 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/2988256
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2016
Accepted: 01 February 2016
Revised: 01 November 2015
Received: 01 November 2014
Published in TOMS Volume 43, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Sparse direct solvers
  2. communication-avoiding
  3. memory-aware
  4. multicores
  5. runtime systems

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)3
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Task-based Parallel Programming for Scalable Matrix Product AlgorithmsACM Transactions on Mathematical Software10.1145/358356049:2(1-23)Online publication date: 15-Jun-2023
  • (2022)Scalable low-rank factorization using a task-based runtime system with distributed memoryProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3539781.3539791(1-11)Online publication date: 27-Jun-2022
  • (2022)A Robust Algebraic Domain Decomposition Preconditioner for Sparse Normal EquationsSIAM Journal on Scientific Computing10.1137/21M143489144:3(A1047-A1068)Online publication date: 1-Jan-2022
  • (2022)Multi-Phase Task-Based HPC Applications: Quickly Learning how to Run Fast2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00042(357-367)Online publication date: May-2022
  • (2022)A Selective Nesting Approach for the Sparse Multi-threaded Cholesky Factorization2022 IEEE/ACM 7th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM256814.2022.00006(1-9)Online publication date: Nov-2022
  • (2022)Performance analysis of task-based multi-frontal sparse linear solvers: Structure mattersFuture Generation Computer Systems10.1016/j.future.2022.05.013135(409-425)Online publication date: Oct-2022
  • (2021)Dynamic DAG Scheduling Under Memory Constraints for Shared-Memory PlatformsInternational Journal of Networking and Computing10.15803/ijnc.11.1_2711:1(27-49)Online publication date: 2021
  • (2021)A task-based distributed parallel sparsified nested dissection algorithmProceedings of the Platform for Advanced Scientific Computing Conference10.1145/3468267.3470619(1-11)Online publication date: 5-Jul-2021
  • (2020)Iteratively solving sparse linear system based on PaRSEC task schedulingInternational Journal of High Performance Computing Applications10.1177/109434201989999734:3(306-315)Online publication date: 1-May-2020
  • (2020)Efficient Block Algorithms for Parallel Sparse Triangular SolveProceedings of the 49th International Conference on Parallel Processing10.1145/3404397.3404413(1-11)Online publication date: 17-Aug-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media