[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3437801.3441581acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory

Published: 17 February 2021 Publication History

Editorial Notes

The authors have requested minor, non-substantive changes to the VoR and, in accordance with ACM policies, a Corrected VoR was published on March 1, 2021. For reference purposes the VoR may still be accessed via the Supplemental Material section on this page.

Abstract

Sparse tensor contractions appear commonly in many applications. Efficiently computing a two sparse tensor product is challenging: It not only inherits the challenges from common sparse matrix-matrix multiplication (SpGEMM), i.e., indirect memory access and unknown output size before computation, but also raises new challenges because of high dimensionality of tensors, expensive multi-dimensional index search, and massive intermediate and output data. To address the above challenges, we introduce three optimization techniques by using multi-dimensional, efficient hashtable representation for the accumulator and larger input tensor, and all-stage parallelization. Evaluating with 15 datasets, we show that Sparta brings 28 -- 576× speedup over the traditional sparse tensor contraction with sparse accumulator. With our proposed algorithm- and memory heterogeneity-aware data management, Sparta brings extra performance improvement on the heterogeneous memory with DRAM and Intel Optane DC Persistent Memory Module (PMM) over a state-of-the-art software-based data management solution, a hardware-based data management solution, and PMM-only by 30.7% (up to 98.5%), 10.7% (up to 28.3%) and 17% (up to 65.1%) respectively.

Supplementary Material

3441581-vor (3441581-vor.pdf)
Version of Record for "Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory" by Liu et al., Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '21).

References

[1]
Neha Agarwal and Thomas F. Wenisch. Thermostat: Application-transparent page management for two-tiered main memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2017, Xi'an, China, April 8-12, 2017, pages 631--644, 2017.
[2]
Rasmus Resen Amossen, Andrea Campagna, and Rasmus Pagh. Better size estimation for sparse matrix products. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 406--419. Springer, 2010.
[3]
Pham Nguyen Quang Anh, Rui Fan, and Yonggang Wen. Balanced hashing and efficient gpu sparse general matrix-matrix multiplication. In Proceedings of the 2016 International Conference on Supercomputing, pages 1--12, 2016.
[4]
Edoardo Apra, Eric J Bylaska, Wibe A De Jong, Niranjan Govind, Karol Kowalski, Tjerk P Straatsma, Marat Valiev, HJJ van Dam, Yuri Alexeev, James Anchell, et al. Nwchem: Past, present, and future. The Journal of chemical physics, 152(18):184102, 2020.
[5]
Alexander A Auer, Gerald Baumgartner, David E Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert Harrison, Sriram Krishnamoorthy, Sandhya Krishnan, et al. Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Molecular Physics, 104(2):211--228, 2006.
[6]
Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, and Samuel Williams. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing, 38(6):C624--C651, 2016.
[7]
Brett W. Bader, Tamara G. Kolda, et al. Matlab tensor toolbox version 3.1. Available online, June 2019.
[8]
M. Baskaran, B. Meister, N. Vasilache, and R. Lethin. Efficient and scalable computations with sparse tensors. In High Performance Extreme Computing (HPEC), 2012 IEEE Conference on, pages 1--6, Sept 2012.
[9]
Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Prakash Murali, Shivmaran S. Pandian, Yogish Sabharwal, and Dheeraj Sreedhar. On optimizing distributed Tucker decomposition for sparse tensors. In Proceedings of the 32nd ACM International Conference on Supercomputing, ICS '18, 2018.
[10]
Andrzej Cichocki. Era of big data processing: A new approach via tensor networks and tensor decompositions. CoRR, abs/1403.2048, 2014.
[11]
Edith Cohen. On optimizing multiplications of sparse matrices. In International Conference on Integer Programming and Combinatorial Optimization, pages 219--233. Springer, 1996.
[12]
Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam. Performance-portable sparse matrix-matrix multiplication for many-core architectures. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 693--702. IEEE, 2017.
[13]
Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. Data Tiering in Heterogeneous Memory Systems. In European Conference on Computer Systems, 2016.
[14]
Evgeny Epifanovsky, Karol Kowalski, Peng-Dong Fan, Marat Valiev, Spiridoula Matsika, and Anna I Krylov. On the electronically excited states of uracil. The Journal of Physical Chemistry A, 112(40):9983--9992, 2008.
[15]
Evgeny Epifanovsky, Michael Wormit, Tomasz Kuś, Arie Landau, Dmitry Zuev, Kirill Khistyaev, Prashant Manohar, Ilya Kaliman, Andreas Dreuw, and Anna I Krylov. New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations. Journal of computational chemistry, 34(26):2293--2309, 2013.
[16]
Tilman Esslinger. Fermi-hubbard physics with atoms in an optical lattice. 2010.
[17]
Matthew Fishman, Steven R. White, and E. Miles Stoudenmire. ITensor: A C++ library for efficient tensor network calculations. Available from https://github.com/ITensor/ITensor, August 2020.
[18]
Matthew Fishman, Steven R White, and E Miles Stoudenmire. The ITensor software library for tensor network calculations. arXiv preprint arXiv:2007.14822, 2020.
[19]
John R Gilbert, Cleve Moler, and Robert Schreiber. Sparse matrices in matlab: Design and implementation. SIAM Journal on Matrix Analysis and Applications, 13(1):333--356, 1992.
[20]
Fred G Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software (TOMS), 4(3):250--269, 1978.
[21]
Albert Hartono, Qingda Lu, Thomas Henretty, Sriram Krishnamoorthy, Huaijian Zhang, Gerald Baumgartner, David E Bernholdt, Marcel Nooijen, Russell Pitzer, J Ramanujam, et al. Performance optimization of tensor contraction expressions for many-body methods in quantum chemistry. The Journal of Physical Chemistry A, 113(45):12715--12723, 2009.
[22]
Thomas Hérault, Yves Robert, George Bosilca, Robert Harrison, Cannada Lewis, and Edward Valeev. Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure. PhD thesis, Inria-Research Centre Grenoble-Rhône-Alpes, 2020.
[23]
So Hirata. Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. The Journal of Physical Chemistry A, 107(46):9887--9897, 2003.
[24]
Takahiro Hirofuchi and Ryousei Takano. Raminate: Hypervisor-based virtualization for hybrid main memory systems. In Proceedings of the Seventh ACM Symposium on Cloud Computing, SoCC '16, pages 112--125, New York, NY, USA, 2016. ACM.
[25]
S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan. Heteroos --- os design for heterogeneous memory management in datacenter. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pages 521--534, June 2017.
[26]
Daniel Kats and Frederick R Manby. Sparse tensor framework for implementation of general local correlation methods. The Journal of Chemical Physics, 138(14):144101, 2013.
[27]
O. Kaya and B. Uçar. Parallel Candecomp/Parafac decomposition of sparse tensors using dimension trees. SIAM Journal on Scientific Computing, 40(1):C99--C130, 2018.
[28]
Jinsung Kim, Aravind Sukumaran-Rajam, Changwan Hong, Ajay Panyala, Rohit Kumar Srivastava, Sriram Krishnamoorthy, and Ponnuswamy Sadayappan. Optimizing tensor contractions in ccsd (t) for efficient execution on gpus. In Proceedings of the 2018 International Conference on Supercomputing, pages 96--106, 2018.
[29]
Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev, and Ponnuswamy Sadayappan. A code generator for high-performance tensor contractions on gpus. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 85--95. IEEE, 2019.
[30]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The tensor algebra compiler. Proc. ACM Program. Lang., 1(OOPSLA):77:1--77:29, October 2017.
[31]
Christoph Koppl and Hans-Joachim Werner. Parallel and low-order scaling implementation of hartree-fock exchange using local density fitting. Journal of chemical theory and computation, 12(7):3122--3134, 2016.
[32]
Jean Kossaifi, Yannis Panagakis, Anima Anandkumar, and Maja Pantic. TensorLy: Tensor learning in Python. CoRR, abs/1610.09555, 2018.
[33]
Pai-Wei Lai, Kevin Stock, Samyam Rajbhandari, Sriram Krishnamoorthy, and Ponnuswamy Sadayappan. A framework for load balancing of tensor contraction expressions via dynamic task partitioning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1--10, 2013.
[34]
Ryan Levy, Edgar Solomonik, and Bryan K Clark. Distributed-memory dmrg via sparse and dense parallel tensor contractions. arXiv preprint arXiv:2007.05540, 2020.
[35]
Jiajia Li, Jee Choi, Ioakeim Perros, Jimeng Sun, and Richard Vuduc. Model-driven sparse cp decomposition for higher-order tensors. In 2017 IEEE international parallel and distributed processing symposium (IPDPS), pages 1048--1057. IEEE, 2017.
[36]
Jiajia Li, Yuchen Ma, Chenggang Yan, and Richard Vuduc. Optimizing sparse tensor times matrix on multi-core and many-core architectures. In Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms, IA3 '16, pages 26--33, Piscataway, NJ, USA, 2016. IEEE Press.
[37]
Jiajia Li, Jimeng Sun, and Richard Vuduc. HiCOO: Hierarchical storage of sparse tensors. In Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Dallas, TX, USA, November 2018.
[38]
Jiajia Li, Bora Uçar, Ümit V. Çatalyürek, Jimeng Sun, Kevin Barker, and Richard Vuduc. Efficient and effective sparse tensor reordering. In Proceedings of the ACM International Conference on Supercomputing, ICS '19, pages 227--237, New York, NY, USA, 2019. ACM.
[39]
Lingjie Li, Wenjian Yu, and Kim Batselier. Faster tensor train decomposition for sparse data. arXiv preprint arXiv:1908.02721, 2019.
[40]
Rui Li, Aravind Sukumaran-Rajam, Richard Veras, Tze Meng Low, Fabrice Rastello, Atanas Rountev, and P Sadayappan. Analytical cache modeling and tilesize optimization for tensor contractions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--13, 2019.
[41]
Bangtian Liu, Chengyao Wen, Anand D Sarwate, and Maryam Mehri Dehnavi. A unified optimization approach for sparse tensor operations on gpus. In 2017 IEEE international conference on cluster computing (CLUSTER), pages 47--57. IEEE, 2017.
[42]
Jiawen Liu, Hengyu Zhao, Matheus A Ogleari, Dong Li, and Jishen Zhao. Processing-in-memory for energy-efficient neural network training: A heterogeneous approach. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 655--668. IEEE, 2018.
[43]
Weifeng Liu and Brian Vinter. An efficient gpu general sparse matrix-matrix multiplication for irregular data. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pages 370--381. IEEE, 2014.
[44]
Baotong Lu, Xiangpeng Hao, Tianzheng Wang, and Eric Lo. Dash: scalable hashing on persistent memory. arXiv preprint arXiv:2003.07302, 2020.
[45]
Samuel Manzer, Evgeny Epifanovsky, Anna I Krylov, and Martin Head-Gordon. A general sparse tensor framework for electronic structure theory. Journal of chemical theory and computation, 13(3):1108--1116, 2017.
[46]
Devin Matthews. High-performance tensor contraction without BLAS. CoRR, abs/1607.00291, 2016.
[47]
Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydın Buluç. High-performance sparse matrix-matrix products on intel knl and multicore architectures. In Proceedings of the 47th International Conference on Parallel Processing Companion, pages 1--10, 2018.
[48]
Yusuke Nagasaka, Satoshi Matsuoka, Ariful Azad, and Aydın Buluç. Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors. Parallel Computing, 90:102545, 2019.
[49]
Yusuke Nagasaka, Akira Nukada, and Satoshi Matsuoka. High-performance and memory-saving sparse general matrix-matrix multiplication for nvidia pascal gpu. In 2017 46th International Conference on Parallel Processing (ICPP), pages 101--110. IEEE, 2017.
[50]
Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Prasant Singh Rawat, Sriram Krishnamoorthy, and P. Sadayappan. An efficient mixed-mode representation of sparse tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '19, pages 49:1--49:25, New York, NY, USA, 2019. ACM.
[51]
Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Richard Vuduc, and P Sadayappan. Load-balanced sparse mttkrp on gpus. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 123--133. IEEE, 2019.
[52]
David Ozog, Jeff R Hammond, James Dinan, Pavan Balaji, Sameer Shende, and Allen Malony. Inspector-executor load balancing algorithms for block-sparse tensor contractions. In 2013 42nd International Conference on Parallel Processing, pages 30--39. IEEE, 2013.
[53]
Chong Peng, Justus A Calvin, Fabijan Pavosevic, Jinmei Zhang, and Edward F Valeev. Massively parallel implementation of explicitly correlated coupled-cluster singles and doubles using tiledarray framework. The Journal of Physical Chemistry A, 120(51):10231--10244, 2016.
[54]
Ioakeim Perros, Evangelos E. Papalexakis, Fei Wang, Richard Vuduc, Elizabeth Searles, Michael Thompson, and Jimeng Sun. SPARTan: Scalable PARAFAC2 for large & sparse data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '17, pages 375--384, New York, NY, USA, 2017. ACM.
[55]
Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. Page Placement in Hybrid Memory Systems. In International Conference on Supercomputing (ICS), May 2011.
[56]
Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran Jeon, and Dong Li. Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning. In IEEE International Symposium on High Performance Computer Architecture, 2021.
[57]
Jie Ren, Minjia Zhang, and Dong Li. HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory. In Neurips, 2020.
[58]
Christoph Riplinger, Peter Pinski, Ute Becker, Edward F Valeev, and Frank Neese. Sparse maps---a systematic infrastructure for reduced-scaling electronic structure methods. ii. linear scaling domain based pair natural orbital coupled cluster theory. The Journal of chemical physics, 144(2):024109, 2016.
[59]
Chase Roberts, Ashley Milsted, Martin Ganahl, Adam Zalcman, Bruce Fontaine, Yijian Zou, Jack Hidary, Guifre Vidal, and Stefan Leichenauer. Tensornetwork: A library for physics and machine learning. arXiv preprint arXiv:1905.01330, 2019.
[60]
Yang Shi, Uma Naresh Niranjan, Animashree Anandkumar, and Cris Cecka. Tensor contractions with extended blas kernels on cpu and gpu. In 2016 IEEE 23rd International Conference on High Performance Computing (HiPC), pages 193--202. IEEE, 2016.
[61]
Ilia Sivkov, Patrick Seewald, Alfio Lazzaro, and Jürg Hutter. DBCSR: A blocked sparse tensor algebra library. arXiv preprint arXiv:1910.13555, 2019.
[62]
Shaden Smith, Jee W Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis. Frostt: The formidable repository of open sparse tensors and tools, 2017.
[63]
Shaden Smith and George Karypis. A medium-grained algorithm for distributed sparse tensor factorization. In Parallel and Distributed Processing Symposium (IPDPS), 2016 IEEE International. IEEE, 2016.
[64]
Shaden Smith and George Karypis. Accelerating the Tucker decomposition with compressed sparse tensors. In European Conference on Parallel Processing. Springer, 2017.
[65]
Shaden Smith, Niranjay Ravindran, Nicholas Sidiropoulos, and George Karypis. SPLATT: Efficient and parallel sparse tensor-matrix multiplication. In Proceedings of the 29th IEEE International Parallel & Distributed Processing Symposium, IPDPS, 2015.
[66]
Edgar Solomonik, Devin Matthews, Jeff Hammond, and James Demmel. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pages 813--824. IEEE, 2013.
[67]
Edgar Solomonik, Devin Matthews, Jeff R Hammond, John F Stanton, and James Demmel. A massively parallel tensor contraction framework for coupled-cluster computations. Journal of Parallel and Distributed Computing, 74(12):3176--3190, 2014.
[68]
N. Vervliet, O. Debals, L. Sorber, M. Van Barel, and L. De Lathauwer. Tensorlab (Version 3.0). Available from http://www.tensorlab.net, March 2016.
[69]
Richard Wilson Vuduc and James W Demmel. Automatic performance tuning of sparse matrix kernels, volume 1. University of California, Berkeley Berkeley, CA, 2003.
[70]
Wei Wei, Dejun Jiang, Sally A. McKee, Jin Xiong, and Mingyu Chen. Exploiting Program Semantics to Place Data in Hybrid Memory. In PACT, 2015.
[71]
Wikipedia. Hash table. https://en.wikipedia.org/wiki/Hash_table, July 2020.
[72]
Kai Wu, Yingchao Huang, and Dong Li. Unimem: Runtime data managementon non-volatile memory-based heterogeneous main memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--14, 2017.
[73]
Kai Wu, Jie Ren Ivy Peng, and Dong Li. ArchTM: Architecture-Aware, High Performance Transaction for Persistent Memory. In USENIX Conference on File and Storage Technologies, 2021.
[74]
Kai Wu, Jie Ren, and Dong Li. Runtime Data Management on Non-Volatile Memory-Based Heterogeneous Memory for Task Parallel Programs. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2018.
[75]
Kai Wu, Jie Ren, and Dong Li. Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, page 31. IEEE Press, 2018.
[76]
Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. Nimble page management for tiered memory systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '19, pages 331--345, New York, NY, USA, 2019. ACM.
[77]
Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. Nimble Page Management for Tiered Memory Systems. In ASPLOS, 2019.
[78]
Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. Repository of Nimble Page Management for Tiered Memory Systems in ASPLOS2019. Available from https://github.com/ysarch-lab/nimble_page_management_asplos_2019, July 2020.
[79]
Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. An empirical guide to the behavior and use of scalable persistent memory. In 18th USENIX Conference on File and Storage Technologies (FAST 20), 2020.
[80]
HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael A Harding, and Onur Mutlu. Row buffer locality aware caching policies for hybrid memories. In 2012 IEEE 30th International Conference on Computer Design (ICCD), pages 337--344. IEEE, 2012.
[81]
Seongdae Yu, Seongbeom Park, and Woongki Baek. Design and Implementation of Bandwidth-aware Memory Placement and Migration Policies for Heterogeneous Memory Systems. In International Conference on Supercomputing (ICS), 2017.

Cited By

View all
  • (2024)FlexMemProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692042(817-833)Online publication date: 10-Jul-2024
  • (2024)SparseAuto: An Auto-scheduler for Sparse Tensor Computations using Recursive Loop Nest RestructuringProceedings of the ACM on Programming Languages10.1145/36897308:OOPSLA2(527-556)Online publication date: 8-Oct-2024
  • (2024)CoNST: Code Generator for Sparse Tensor NetworksACM Transactions on Architecture and Code Optimization10.1145/368934221:4(1-24)Online publication date: 20-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2021
507 pages
ISBN:9781450382946
DOI:10.1145/3437801
© 2021 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2021

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. heterogeneous memory
  2. multicore CPU
  3. non-volatile memory
  4. sparse tensor contraction
  5. tensor product

Qualifiers

  • Research-article

Funding Sources

  • US Department of Energy
  • US National Science Foundation

Conference

PPoPP '21

Acceptance Rates

PPoPP '21 Paper Acceptance Rate 31 of 150 submissions, 21%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)241
  • Downloads (Last 6 weeks)30
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)FlexMemProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692042(817-833)Online publication date: 10-Jul-2024
  • (2024)SparseAuto: An Auto-scheduler for Sparse Tensor Computations using Recursive Loop Nest RestructuringProceedings of the ACM on Programming Languages10.1145/36897308:OOPSLA2(527-556)Online publication date: 8-Oct-2024
  • (2024)CoNST: Code Generator for Sparse Tensor NetworksACM Transactions on Architecture and Code Optimization10.1145/368934221:4(1-24)Online publication date: 20-Nov-2024
  • (2024)POSTER: Optimizing Sparse Tensor Contraction with Revisiting Hash Table DesignProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638500(457-459)Online publication date: 2-Mar-2024
  • (2024)Minimum Cost Loop Nests for Contraction of a Sparse Tensor with a Tensor NetworkProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659985(169-181)Online publication date: 17-Jun-2024
  • (2024)BCB-SpTC: An Efficient Sparse High-Dimensional Tensor Contraction Employing Tensor Core AccelerationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347774635:12(2435-2448)Online publication date: Dec-2024
  • (2024)Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor ContractionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339125435:6(1044-1055)Online publication date: Jun-2024
  • (2024)Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express LinkProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00100(1-18)Online publication date: 17-Nov-2024
  • (2023)A Tensor Marshaling Unit for Sparse Tensor Algebra on General-Purpose ProcessorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614284(1332-1346)Online publication date: 28-Oct-2023
  • (2023)MerchandiserProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577497(204-217)Online publication date: 25-Feb-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media