More Web Proxy on the site http://driver.im/

research-article

Open access

DISTAL: the distributed tensor algebra compiler

Authors:

Fredrik KjolstadAuthors Info & Claims

PLDI 2022: Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation

Pages 286 - 300

https://doi.org/10.1145/3519939.3523437

Published: 09 June 2022 Publication History

Abstract

We introduce DISTAL, a compiler for dense tensor algebra that targets modern distributed and heterogeneous systems. DISTAL lets users independently describe how tensors and computation map onto target machines through separate format and scheduling languages. The combination of choices for data and computation distribution creates a large design space that includes many algorithms from both the past (e.g., Cannon’s algorithm) and the present (e.g., COSMA). DISTAL compiles a tensor algebra domain specific language to a distributed task-based runtime system and supports nodes with multi-core CPUs and multiple GPUs. Code generated by is competitive with optimized codes for matrix multiply on 256 nodes of the Lassen supercomputer and outperforms existing systems by between 1.8x to 3.7x (with a 45.7x outlier) on higher order tensor operations.

References

[1]

R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development, 39, 5 (1995), 575–582. https://doi.org/10.1147/rd.395.0575

Digital Library

[2]

Saman Amarasinghe and Monica Lam. 1993. Communication Optimization and Code Generation for Distributed Memory Machines. Sigplan Notices - SIGPLAN, 28, 126–138. https://doi.org/10.1145/173262.155102

Digital Library

[3]

Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2018. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. arxiv:1804.10694.

[4]

Grey Ballard, Nicholas Knight, and Kathryn Rouse. 2018. Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 557–567. https://doi.org/10.1109/IPDPS.2018.00065

[5]

Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12). IEEE Computer Society Press, Washington, DC, USA. Article 66, 11 pages. isbn:9781467308045

Digital Library

[6]

G. Baumgartner, A. Auer, D.E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, Xiaoyang Gao, R.J. Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, Chi chung Lam, Qingda Lu, M. Nooijen, R.M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. 2005. Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models. Proc. IEEE, 93, 2 (2005), 276–292. https://doi.org/10.1109/JPROC.2004.840311

[7]

Uday Bondhugula. 2013. Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13). Association for Computing Machinery, New York, NY, USA. Article 33, 12 pages. isbn:9781450323789 https://doi.org/10.1145/2503210.2503289

Digital Library

[8]

Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph. D. Dissertation. USA. AAI7010025

[9]

B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. The International Journal of High Performance Computing Applications, 21, 3 (2007), 291–312. https://doi.org/10.1177/1094342007078442 arxiv:https://doi.org/10.1177/1094342007078442.

Digital Library

[10]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. arxiv:1802.04799.

[11]

J. Choi, J.J. Dongarra, R. Pozo, and D.W. Walker. 1992. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers. In [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation. 120–127. https://doi.org/10.1109/FMPC.1992.234898

[12]

Jaeyoung Choi, David W. Walker, and Jack J. Dongarra. 1994. Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurrency: Practice and Experience, 6, 7 (1994), 543–570. https://doi.org/10.1002/cpe.4330060702 arxiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4330060702.

[13]

S.J. Deitz, B.L. Chamberlain, and L. Snyder. 2004. Abstractions for dynamic data distribution. In Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2004. Proceedings. 42–51. https://doi.org/10.1109/HIPS.2004.1299189

[14]

James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. 261–272. https://doi.org/10.1109/IPDPS.2013.80

Digital Library

[15]

Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. 2016. Distributed Halide. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’16). Association for Computing Machinery, New York, NY, USA. Article 5, 12 pages. isbn:9781450340922 https://doi.org/10.1145/2851141.2851157

Digital Library

[16]

Nikoli Dryden, Naoya Maruyama, Tim Moon, Tom Benson, Marc Snir, and Brian Van Essen. 2019. Channel and Filter Parallelism for Large-Scale CNN Training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19). Association for Computing Machinery, New York, NY, USA. Article 10, 20 pages. isbn:9781450362290 https://doi.org/10.1145/3295500.3356207

Digital Library

[17]

Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC ’06). Association for Computing Machinery, New York, NY, USA. 83–es. isbn:0769527000 https://doi.org/10.1145/1188455.1188543

[18]

Thinhinane Ihadadene. 2019. Generating Communication Code Automatically for Distributed Programs in Tiramisu.

[19]

Dimitrije Jankov, Binhang Yuan, Shangyu Luo, and Chris Jermaine. 2021. Distributed Numerical and Machine Learning Computations via Two-Phase Execution of Aggregated Join Trees. Proc. VLDB Endow., 14, 7 (2021), March, 1228–1240. issn:2150-8097 https://doi.org/10.14778/3450980.3450991

Digital Library

[20]

Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2019. A Code Generator for High-Performance Tensor Contractions on GPUs. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 85–95. https://doi.org/10.1109/CGO.2019.8661182

[21]

Fredrik Kjolstad, Peter Ahrens, Shoaib Kamil, and Saman Amarasinghe. 2019. Tensor Algebra Compilation with Workspaces. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 180–192. https://doi.org/10.1109/CGO.2019.8661185

[22]

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM Program. Lang., 1, OOPSLA (2017), Article 77, Oct., 29 pages. https://doi.org/10.1145/3133901

Digital Library

[23]

Tamara G. Kolda and Brett W. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev., 51, 3 (2009), Aug., 455–500. issn:0036-1445 https://doi.org/10.1137/07070111X

Digital Library

[24]

Grzegorz Kwasniewski. 2021. "personal communication".

[25]

Grzegorz Kwasniewski, Marko Kabić, Maciej Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19). Association for Computing Machinery, New York, NY, USA. Article 24, 22 pages. isbn:9781450362290 https://doi.org/10.1145/3295500.3356181

Digital Library

[26]

LLNL. 2021. Lassen. https://hpc.llnl.gov/hardware/platforms/lassen

[27]

D.B. Loveman. 1993. High performance Fortran. IEEE Parallel Distributed Technology: Systems Applications, 1, 1 (1993), 25–42. https://doi.org/10.1109/88.219857

Digital Library

[28]

Shangyu Luo, Zekai J. Gao, Michael Gubanov, Luis L. Perez, and Christopher Jermaine. 2017. Scalable Linear Algebra on a Relational Database System. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). 523–534. https://doi.org/10.1109/ICDE.2017.108

[29]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. SIGPLAN Not., 48, 6 (2013), June, 519–530. issn:0362-1340 https://doi.org/10.1145/2499370.2462176

Digital Library

[30]

Martin Daniel Schatz. 2015. Distributed Tensor Computations: Formalizing Distributions, Redistributions, and Algorithm Derivation. Ph. D. Dissertation. USA.

[31]

Martin D. Schatz, Robert A. Geijn, and Jack Poulson. 2016. Parallel Matrix Multiplication: A Systematic Journey. SIAM J. Sci. Comput., 38 (2016).

[32]

Ryan Senanayake, Changwan Hong, Ziheng Wang, Amalee Wilson, Stephen Chou, Shoaib Kamil, Saman Amarasinghe, and Fredrik Kjolstad. 2020. A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra. Proc. ACM Program. Lang., 4, OOPSLA (2020), Article 158, Nov., 30 pages. https://doi.org/10.1145/3428226

Digital Library

[33]

Edgar Solomonik and James Demmel. 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Euro-Par 2011 Parallel Processing, Emmanuel Jeannot, Raymond Namyst, and Jean Roman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 90–109. isbn:978-3-642-23397-5

Digital Library

[34]

Edgar Solomonik, Devin Matthews, Jeff R. Hammond, John F. Stanton, and James Demmel. 2014. A massively parallel tensor contraction framework for coupled-cluster computations. J. Parallel and Distrib. Comput., 74, 12 (2014), 3176–3190. issn:0743-7315 https://doi.org/10.1016/j.jpdc.2014.06.002 Domain-Specific Languages and High-Level Frameworks for High-Performance Computing

Digital Library

[35]

Robert A. van de Geijn and Jerrell Watts. 1995. SUMMA: Scalable Universal Matrix Multiplication Algorithm. USA.

[36]

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. arxiv:1802.04730.

[37]

Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. GraphIt: A High-Performance Graph DSL. Proc. ACM Program. Lang., 2, OOPSLA (2018), Article 121, Oct., 30 pages. https://doi.org/10.1145/3276491

Digital Library

Cited By

Alabed SBelov DChrzaszcz BFranco JGrewe DMaclaurin DMolloy JNatan TNorman TPan XPaszke ARink NSchaarschmidt MSitdikov TSwietlik AVytiniotis DWee JEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)PartIR: Composing SPMD Partitioning Strategies for Machine LearningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707284(794-810)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707284
Yadav RSundram SLee WGarland MBauer MAiken AKjolstad FEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Composing Distributed Computations Through Task and Kernel FusionProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707216(182-197)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707216
Liu YXue YCheng YMa LMiao ZXue JHuang JWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695955(505-521)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695955
Show More Cited By

Index Terms

DISTAL: the distributed tensor algebra compiler
1. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation
    2. Context specific languages
      1. Domain specific languages

Recommendations

Autoscheduling for sparse tensor algebra with an asymptotic cost model
PLDI 2022: Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation

While loop reordering and fusion can make big impacts on the constant-factor performance of dense tensor programs, the effects on sparse tensor programs are asymptotic, often leading to orders of magnitude performance differences in practice. Sparse ...
The tensor algebra compiler

Tensor algebra is a powerful tool with applications in machine learning, data analytics, engineering and the physical sciences. Tensors are often sparse and compound operations must frequently be computed in a single kernel for performance and to save ...
Sparse Tensor Algebra Compilation

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI 2022: Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation

June 2022

1038 pages

ISBN:9781450392655

DOI:10.1145/3519939

General Chair:
Ranjit Jhala
University of California at San Diego, USA
,
Program Chair:
Işil Dillig
University of Texas at Austin, USA

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '22

Sponsor:

SIGPLAN

PLDI '22: 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation

June 13 - 17, 2022

CA, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
1,040
Total Downloads

Downloads (Last 12 months)393
Downloads (Last 6 weeks)43

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Alabed SBelov DChrzaszcz BFranco JGrewe DMaclaurin DMolloy JNatan TNorman TPan XPaszke ARink NSchaarschmidt MSitdikov TSwietlik AVytiniotis DWee JEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)PartIR: Composing SPMD Partitioning Strategies for Machine LearningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707284(794-810)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707284
Yadav RSundram SLee WGarland MBauer MAiken AKjolstad FEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Composing Distributed Computations Through Task and Kernel FusionProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707216(182-197)Online publication date: 3-Feb-2025
https://dl.acm.org/doi/10.1145/3669940.3707216
Liu YXue YCheng YMa LMiao ZXue JHuang JWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695955(505-521)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695955
Rasch A(2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
https://dl.acm.org/doi/10.1145/3665643
Zhang GHsu OKjolstad F(2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
https://dl.acm.org/doi/10.1145/3656426
Kanakagiri RSolomonik EAgrawal KPetrank E(2024)Minimum Cost Loop Nests for Contraction of a Sparse Tensor with a Tensor NetworkProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659985(169-181)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3659985
Khan TFegaras L(2024)A Planner for Scalable Tensor Programs2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825779(54-63)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825779
Rao SMonil MMankad HVetter JFranchetti F(2023)FFTX-IRIS: Towards Performance Portability and Heterogeneity for SPIRAL Generated CodeProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624242(1635-1641)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624242
Bansal MHsu OOlukotun KKjolstad F(2023)Mosaic: An Interoperable Compiler for Tensor AlgebraProceedings of the ACM on Programming Languages10.1145/35912367:PLDI(394-419)Online publication date: 6-Jun-2023
https://dl.acm.org/doi/10.1145/3591236
Kong MAbu Yosef RRountev ASadayappan PMohror KArnold DBadia R(2023)Automatic Generation of Distributed-Memory Mappings for Tensor ComputationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607096(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607096
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten