[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3519939.3523437acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article
Open access

DISTAL: the distributed tensor algebra compiler

Published: 09 June 2022 Publication History

Abstract

We introduce DISTAL, a compiler for dense tensor algebra that targets modern distributed and heterogeneous systems. DISTAL lets users independently describe how tensors and computation map onto target machines through separate format and scheduling languages. The combination of choices for data and computation distribution creates a large design space that includes many algorithms from both the past (e.g., Cannon’s algorithm) and the present (e.g., COSMA). DISTAL compiles a tensor algebra domain specific language to a distributed task-based runtime system and supports nodes with multi-core CPUs and multiple GPUs. Code generated by is competitive with optimized codes for matrix multiply on 256 nodes of the Lassen supercomputer and outperforms existing systems by between 1.8x to 3.7x (with a 45.7x outlier) on higher order tensor operations.

References

[1]
R. C. Agarwal, S. M. Balle, F. G. Gustavson, M. Joshi, and P. Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development, 39, 5 (1995), 575–582. https://doi.org/10.1147/rd.395.0575
[2]
Saman Amarasinghe and Monica Lam. 1993. Communication Optimization and Code Generation for Distributed Memory Machines. Sigplan Notices - SIGPLAN, 28, 126–138. https://doi.org/10.1145/173262.155102
[3]
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2018. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. arxiv:1804.10694.
[4]
Grey Ballard, Nicholas Knight, and Kathryn Rouse. 2018. Communication Lower Bounds for Matricized Tensor Times Khatri-Rao Product. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 557–567. https://doi.org/10.1109/IPDPS.2018.00065
[5]
Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: Expressing Locality and Independence with Logical Regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’12). IEEE Computer Society Press, Washington, DC, USA. Article 66, 11 pages. isbn:9781467308045
[6]
G. Baumgartner, A. Auer, D.E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, Xiaoyang Gao, R.J. Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, Chi chung Lam, Qingda Lu, M. Nooijen, R.M. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. 2005. Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models. Proc. IEEE, 93, 2 (2005), 276–292. https://doi.org/10.1109/JPROC.2004.840311
[7]
Uday Bondhugula. 2013. Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13). Association for Computing Machinery, New York, NY, USA. Article 33, 12 pages. isbn:9781450323789 https://doi.org/10.1145/2503210.2503289
[8]
Lynn Elliot Cannon. 1969. A Cellular Computer to Implement the Kalman Filter Algorithm. Ph. D. Dissertation. USA. AAI7010025
[9]
B.L. Chamberlain, D. Callahan, and H.P. Zima. 2007. Parallel Programmability and the Chapel Language. The International Journal of High Performance Computing Applications, 21, 3 (2007), 291–312. https://doi.org/10.1177/1094342007078442 arxiv:https://doi.org/10.1177/1094342007078442.
[10]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. arxiv:1802.04799.
[11]
J. Choi, J.J. Dongarra, R. Pozo, and D.W. Walker. 1992. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers. In [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation. 120–127. https://doi.org/10.1109/FMPC.1992.234898
[12]
Jaeyoung Choi, David W. Walker, and Jack J. Dongarra. 1994. Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers. Concurrency: Practice and Experience, 6, 7 (1994), 543–570. https://doi.org/10.1002/cpe.4330060702 arxiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.4330060702.
[13]
S.J. Deitz, B.L. Chamberlain, and L. Snyder. 2004. Abstractions for dynamic data distribution. In Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2004. Proceedings. 42–51. https://doi.org/10.1109/HIPS.2004.1299189
[14]
James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. 2013. Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. 261–272. https://doi.org/10.1109/IPDPS.2013.80
[15]
Tyler Denniston, Shoaib Kamil, and Saman Amarasinghe. 2016. Distributed Halide. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’16). Association for Computing Machinery, New York, NY, USA. Article 5, 12 pages. isbn:9781450340922 https://doi.org/10.1145/2851141.2851157
[16]
Nikoli Dryden, Naoya Maruyama, Tim Moon, Tom Benson, Marc Snir, and Brian Van Essen. 2019. Channel and Filter Parallelism for Large-Scale CNN Training. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19). Association for Computing Machinery, New York, NY, USA. Article 10, 20 pages. isbn:9781450362290 https://doi.org/10.1145/3295500.3356207
[17]
Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. 2006. Sequoia: Programming the Memory Hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC ’06). Association for Computing Machinery, New York, NY, USA. 83–es. isbn:0769527000 https://doi.org/10.1145/1188455.1188543
[18]
Thinhinane Ihadadene. 2019. Generating Communication Code Automatically for Distributed Programs in Tiramisu.
[19]
Dimitrije Jankov, Binhang Yuan, Shangyu Luo, and Chris Jermaine. 2021. Distributed Numerical and Machine Learning Computations via Two-Phase Execution of Aggregated Join Trees. Proc. VLDB Endow., 14, 7 (2021), March, 1228–1240. issn:2150-8097 https://doi.org/10.14778/3450980.3450991
[20]
Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2019. A Code Generator for High-Performance Tensor Contractions on GPUs. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 85–95. https://doi.org/10.1109/CGO.2019.8661182
[21]
Fredrik Kjolstad, Peter Ahrens, Shoaib Kamil, and Saman Amarasinghe. 2019. Tensor Algebra Compilation with Workspaces. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 180–192. https://doi.org/10.1109/CGO.2019.8661185
[22]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM Program. Lang., 1, OOPSLA (2017), Article 77, Oct., 29 pages. https://doi.org/10.1145/3133901
[23]
Tamara G. Kolda and Brett W. Bader. 2009. Tensor Decompositions and Applications. SIAM Rev., 51, 3 (2009), Aug., 455–500. issn:0036-1445 https://doi.org/10.1137/07070111X
[24]
Grzegorz Kwasniewski. 2021. "personal communication".
[25]
Grzegorz Kwasniewski, Marko Kabić, Maciej Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. 2019. Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19). Association for Computing Machinery, New York, NY, USA. Article 24, 22 pages. isbn:9781450362290 https://doi.org/10.1145/3295500.3356181
[26]
LLNL. 2021. Lassen. https://hpc.llnl.gov/hardware/platforms/lassen
[27]
D.B. Loveman. 1993. High performance Fortran. IEEE Parallel Distributed Technology: Systems Applications, 1, 1 (1993), 25–42. https://doi.org/10.1109/88.219857
[28]
Shangyu Luo, Zekai J. Gao, Michael Gubanov, Luis L. Perez, and Christopher Jermaine. 2017. Scalable Linear Algebra on a Relational Database System. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). 523–534. https://doi.org/10.1109/ICDE.2017.108
[29]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. SIGPLAN Not., 48, 6 (2013), June, 519–530. issn:0362-1340 https://doi.org/10.1145/2499370.2462176
[30]
Martin Daniel Schatz. 2015. Distributed Tensor Computations: Formalizing Distributions, Redistributions, and Algorithm Derivation. Ph. D. Dissertation. USA.
[31]
Martin D. Schatz, Robert A. Geijn, and Jack Poulson. 2016. Parallel Matrix Multiplication: A Systematic Journey. SIAM J. Sci. Comput., 38 (2016).
[32]
Ryan Senanayake, Changwan Hong, Ziheng Wang, Amalee Wilson, Stephen Chou, Shoaib Kamil, Saman Amarasinghe, and Fredrik Kjolstad. 2020. A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra. Proc. ACM Program. Lang., 4, OOPSLA (2020), Article 158, Nov., 30 pages. https://doi.org/10.1145/3428226
[33]
Edgar Solomonik and James Demmel. 2011. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms. In Euro-Par 2011 Parallel Processing, Emmanuel Jeannot, Raymond Namyst, and Jean Roman (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 90–109. isbn:978-3-642-23397-5
[34]
Edgar Solomonik, Devin Matthews, Jeff R. Hammond, John F. Stanton, and James Demmel. 2014. A massively parallel tensor contraction framework for coupled-cluster computations. J. Parallel and Distrib. Comput., 74, 12 (2014), 3176–3190. issn:0743-7315 https://doi.org/10.1016/j.jpdc.2014.06.002 Domain-Specific Languages and High-Level Frameworks for High-Performance Computing
[35]
Robert A. van de Geijn and Jerrell Watts. 1995. SUMMA: Scalable Universal Matrix Multiplication Algorithm. USA.
[36]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions. arxiv:1802.04730.
[37]
Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman Amarasinghe. 2018. GraphIt: A High-Performance Graph DSL. Proc. ACM Program. Lang., 2, OOPSLA (2018), Article 121, Oct., 30 pages. https://doi.org/10.1145/3276491

Cited By

View all
  • (2025)PartIR: Composing SPMD Partitioning Strategies for Machine LearningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707284(794-810)Online publication date: 3-Feb-2025
  • (2025)Composing Distributed Computations Through Task and Kernel FusionProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707216(182-197)Online publication date: 3-Feb-2025
  • (2024)Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695955(505-521)Online publication date: 4-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI 2022: Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation
June 2022
1038 pages
ISBN:9781450392655
DOI:10.1145/3519939
  • General Chair:
  • Ranjit Jhala,
  • Program Chair:
  • Işil Dillig
This work is licensed under a Creative Commons Attribution 4.0 International License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Compilers
  2. Distributed Systems
  3. High Performance Computing

Qualifiers

  • Research-article

Conference

PLDI '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)393
  • Downloads (Last 6 weeks)43
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)PartIR: Composing SPMD Partitioning Strategies for Machine LearningProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707284(794-810)Online publication date: 3-Feb-2025
  • (2025)Composing Distributed Computations Through Task and Kernel FusionProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707216(182-197)Online publication date: 3-Feb-2025
  • (2024)Scaling Deep Learning Computation over the Inter-Core Connected Intelligence Processor with T10Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695955(505-521)Online publication date: 4-Nov-2024
  • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
  • (2024)Compilation of Modular and General Sparse WorkspacesProceedings of the ACM on Programming Languages10.1145/36564268:PLDI(1213-1238)Online publication date: 20-Jun-2024
  • (2024)Minimum Cost Loop Nests for Contraction of a Sparse Tensor with a Tensor NetworkProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659985(169-181)Online publication date: 17-Jun-2024
  • (2024)A Planner for Scalable Tensor Programs2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825779(54-63)Online publication date: 15-Dec-2024
  • (2023)FFTX-IRIS: Towards Performance Portability and Heterogeneity for SPIRAL Generated CodeProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624242(1635-1641)Online publication date: 12-Nov-2023
  • (2023)Mosaic: An Interoperable Compiler for Tensor AlgebraProceedings of the ACM on Programming Languages10.1145/35912367:PLDI(394-419)Online publication date: 6-Jun-2023
  • (2023)Automatic Generation of Distributed-Memory Mappings for Tensor ComputationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607096(1-13)Online publication date: 12-Nov-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media