More Web Proxy on the site http://driver.im/

research-article

Open access

Athena: high-performance sparse tensor contraction sequence on heterogeneous memory

Authors:

Roberto Gioiosa,

Jiajia LiAuthors Info & Claims

ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing

Pages 190 - 202

https://doi.org/10.1145/3447818.3460355

Published: 04 June 2021 Publication History

Abstract

Sparse tensor contraction sequence has been widely employed in many fields, such as chemistry and physics. However, how to efficiently implement the sequence faces multiple challenges, such as redundant computations and memory operations, massive memory consumption, and inefficient utilization of hardware. To address the above challenges, we introduce Athena, a high-performance framework for SpTC sequences. Athena introduces new data structures, leverages emerging Optane-based heterogeneous memory (HM) architecture, and adopts stage parallelism. In particular, Athena introduces shared hash table-represented sparse accumulator to eliminate unnecessary input processing and data migration; Athena uses a novel data-semantic guided dynamic migration solution to make the best use of the Optane-based HM for high performance; Athena also co-runs execution phases with different characteristics to enable high hardware utilization. Evaluating with 12 datasets, we show that Athena brings 327-7362× speedup over the state-of-the-art SpTC algorithm. With the dynamic data placement guided by data semantics, Athena brings performance improvement on Optane-based HM over a state-of-the-art software-based data management solution, a hardware-based data management solution, and PMM-only by 1.58×, 1.82×, and 2.34× respectively. Athena also showcases its effectiveness in quantum chemistry and physics scenarios.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 265--283, 2016.

[2]

Neha Agarwal and Thomas F. Wenisch. Thermostat: Application-transparent page management for two-tiered main memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2017, Xi'an, China, April 8-12, 2017, pages 631--644, 2017.

Digital Library

[3]

Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decompositions for learning latent variable models. J. Mach. Learn. Res., 15(1):2773--2832, January 2014.

Digital Library

[4]

Edoardo Apra, Eric J Bylaska, Wibe A De Jong, Niranjan Govind, Karol Kowalski, Tjerk P Straatsma, Marat Valiev, HJJ van Dam, Yuri Alexeev, James Anchell, et al. Nwchem: Past, present, and future. The Journal of chemical physics, 152(18):184102, 2020.

[5]

Alexander A Auer, Gerald Baumgartner, David E Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert Harrison, Sriram Krishnamoorthy, Sandhya Krishnan, et al. Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Molecular Physics, 104(2):211--228, 2006.

[6]

Brett W. Bader, Tamara G. Kolda, et al. Matlab tensor toolbox version 3.1. Available online, June 2019.

[7]

Venkatesan T. Chakaravarthy, Jee W. Choi, Douglas J. Joseph, Prakash Murali, Shivmaran S. Pandian, Yogish Sabharwal, and Dheeraj Sreedhar. On optimizing distributed Tucker decomposition for sparse tensors. In Proceedings of the 32nd ACM International Conference on Supercomputing, ICS '18, 2018.

Digital Library

[8]

Yu Chen, Ivy B. Peng, Zhen Peng, Xu Liu, and Bin Ren. Atmem: Adaptive data placement in graph applications on heterogeneous memories. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, CGO 2020, 2020.

Digital Library

[9]

Andrzej Cichocki. Era of big data processing: A new approach via tensor networks and tensor decompositions. CoRR, abs/1403.2048, 2014.

[10]

T Daniel Crawford and Henry F Schaefer. An introduction to coupled cluster theory for computational chemists. Reviews in computational chemistry, 14:33--136, 2000.

[11]

Bang Di, Jiawen Liu, Hao Chen, and Dong Li. Fast, flexible, and comprehensive bug detection for persistent memory programs. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pages 503--516, 2021.

Digital Library

[12]

Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. Data Tiering in Heterogeneous Memory Systems. In European Conference on Computer Systems, 2016.

[13]

Assaf Eisenman, Darryl Gardner, Islam AbdelRahman, Jens Axboe, Siying Dong, Kim Hazelwood, Chris Petersen, Asaf Cidon, and Sachin Katti. Reducing DRAM Footprint with NVM in Facebook. In Proceedings of the Thirteenth EuroSys Conference, 2018.

[14]

Evgeny Epifanovsky, Michael Wormit, Tomasz Kuś, Arie Landau, Dmitry Zuev, Kirill Khistyaev, Prashant Manohar, Ilya Kaliman, Andreas Dreuw, and Anna I Krylov. New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations. Journal of computational chemistry, 34(26):2293--2309, 2013.

[15]

Tilman Esslinger. Fermi-hubbard physics with atoms in an optical lattice. Annu. Rev. Condens. Matter Phys., 1(1):129--152, 2010.

[16]

Matthew Fishman, Steven R. White, and E. Miles Stoudenmire. ITensor: A C++ library for efficient tensor network calculations. Available from https://github.com/ITensor/ITensor, August 2020.

[17]

Matthew Fishman, Steven R White, and E Miles Stoudenmire. The ITensor software library for tensor network calculations. arXiv preprint arXiv:2007.14822, 2020.

[18]

Gurbinder Gill, Roshan Dathathri, Loc Hoang, Ramesh Peri, and Keshav Pingali. Single machine graph analytics on massive datasets using intel optane dc persistent memory, 2019.

[19]

Albert Hartono, Qingda Lu, Thomas Henretty, Sriram Krishnamoorthy, Huaijian Zhang, Gerald Baumgartner, David E Bernholdt, Marcel Nooijen, Russell Pitzer, J Ramanujam, et al. Performance optimization of tensor contraction expressions for many-body methods in quantum chemistry. The Journal of Physical Chemistry A, 113(45):12715--12723, 2009.

[20]

Thomas Hérault, Yves Robert, George Bosilca, Robert Harrison, Cannada Lewis, and Edward Valeev. Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure. PhD thesis, Inria-Research Centre Grenoble--Rhône-Alpes, 2020.

[21]

So Hirata. Tensor contraction engine: Abstraction and automated parallel implementation of configuration-interaction, coupled-cluster, and many-body perturbation theories. The Journal of Physical Chemistry A, 107(46):9887--9897, 2003.

[22]

Takahiro Hirofuchi and Ryousei Takano. Raminate: Hypervisor-based virtualization for hybrid main memory systems. In Proceedings of the Seventh ACM Symposium on Cloud Computing, SoCC '16, pages 112--125, New York, NY, USA, 2016. ACM.

Digital Library

[23]

Joyce C. Ho, Joydeep Ghosh, and Jimeng Sun. Marble: High-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 115--124, New York, NY, USA, 2014. ACM.

Digital Library

[24]

S. Kannan, A. Gavrilovska, V. Gupta, and K. Schwan. Heteroos --- os design for heterogeneous memory management in datacenter. In 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pages 521--534, June 2017.

[25]

Daniel Kats and Frederick R Manby. Sparse tensor framework for implementation of general local correlation methods. The Journal of Chemical Physics, 138(14):144101, 2013.

[26]

O. Kaya and B. Uçar. Parallel Candecomp/Parafac decomposition of sparse tensors using dimension trees. SIAM Journal on Scientific Computing, 40(1):C99--C130, 2018.

Digital Library

[27]

Jinsung Kim, Aravind Sukumaran-Rajam, Changwan Hong, Ajay Panyala, Rohit Kumar Srivastava, Sriram Krishnamoorthy, and Ponnuswamy Sadayappan. Optimizing tensor contractions in ccsd (t) for efficient execution on gpus. In Proceedings of the 2018 International Conference on Supercomputing, pages 96--106, 2018.

Digital Library

[28]

Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev, and Ponnuswamy Sadayappan. A code generator for high-performance tensor contractions on gpus. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 85--95. IEEE, 2019.

[29]

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The tensor algebra compiler. Proc. ACM Program. Lang., 1(OOPSLA):77:1--77:29, October 2017.

Digital Library

[30]

T. Kolda and B. Bader. Tensor decompositions and applications. SIAM Review, 51(3):455--500, 2009.

Digital Library

[31]

Christoph Koppl and Hans-Joachim Werner. Parallel and low-order scaling implementation of hartree--fock exchange using local density fitting. Journal of chemical theory and computation, 12(7):3122--3134, 2016.

[32]

Jean Kossaifi, Yannis Panagakis, Anima Anandkumar, and Maja Pantic. TensorLy: Tensor learning in Python. CoRR, abs/1610.09555, 2018.

[33]

R. Madhava Krishnan, Jaeho Kim, Ajit Mathew, Xinwei Fu, Anthony Demeri, Changwoo Min, and Sudarsun Kannan. Durable transactional memory can scale with timestone. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’20, 2020.

Digital Library

[34]

Pai-Wei Lai, Kevin Stock, Samyam Rajbhandari, Sriram Krishnamoorthy, and Ponnuswamy Sadayappan. A framework for load balancing of tensor contraction expressions via dynamic task partitioning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1--10, 2013.

Digital Library

[35]

Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, and Vijay Chidambaram. Recipe: Converting concurrent dram indexes to persistent-memory indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP ’19, 2019.

Digital Library

[36]

Ryan Levy, Edgar Solomonik, and Bryan K Clark. Distributed-memory dmrg via sparse and dense parallel tensor contractions. arXiv preprint arXiv:2007.05540, 2020.

[37]

Jiajia Li, Jee Choi, Ioakeim Perros, Jimeng Sun, and Richard Vuduc. Model-driven sparse cp decomposition for higher-order tensors. In 2017 IEEE international parallel and distributed processing symposium (IPDPS), pages 1048--1057. IEEE, 2017.

[38]

Jiajia Li, Yuchen Ma, Chenggang Yan, and Richard Vuduc. Optimizing sparse tensor times matrix on multi-core and many-core architectures. In Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms, IÂ3 '16, pages 26--33, Piscataway, NJ, USA, 2016. IEEE Press.

Digital Library

[39]

Jiajia Li, Jimeng Sun, and Richard Vuduc. HiCOO: Hierarchical storage of sparse tensors. In Proceedings of the ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC), Dallas, TX, USA, November 2018.

Digital Library

[40]

Jiajia Li, Bora Uçar, Ümit V. Çatalyürek, Jimeng Sun, Kevin Barker, and Richard Vuduc. Efficient and effective sparse tensor reordering. In Proceedings of the ACM International Conference on Supercomputing, ICS '19, pages 227--237, New York, NY, USA, 2019. ACM.

Digital Library

[41]

Lingjie Li, Wenjian Yu, and Kim Batselier. Faster tensor train decomposition for sparse data. arXiv preprint arXiv:1908.02721, 2019.

[42]

Rui Li, Aravind Sukumaran-Rajam, Richard Veras, Tze Meng Low, Fabrice Rastello, Atanas Rountev, and P Sadayappan. Analytical cache modeling and tilesize optimization for tensor contractions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--13, 2019.

Digital Library

[43]

B. Liu, C. Wen, A. D. Sarwate, and M. M. Dehnavi. A unified optimization approach for sparse tensor operations on GPUs. In 2017 IEEE International Conference on Cluster Computing (CLUSTER), pages 47--57, Sept 2017.

[44]

Jiawen Liu, Jie Ren, Roberto Gioiosa, Dong Li, and Jiajia Li. Sparta: High-performance, element-wise sparse tensor contraction on heterogeneous memory. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021.

Digital Library

[45]

Jiawen Liu, Zhen Xie, Dimitrios Nikolopoulos, and Dong Li. {RIANN}: Real-time incremental learning with approximate nearest neighbor on mobile devices. In 2020 {USENIX} Conference on Operational Machine Learning (OpML 20), 2020.

[46]

Jiawen Liu, Hengyu Zhao, Matheus A Ogleari, Dong Li, and Jishen Zhao. Processing-in-memory for energy-efficient neural network training: A heterogeneous approach. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 655--668. IEEE, 2018.

Digital Library

[47]

Jie Liu, Jiawen Liu, Wan Du, and Dong Li. Performance analysis and characterization of training deep learning models on mobile device. In 2019 IEEE 25th International Conference on Parallel and Distributed Systems (ICPADS), pages 506--515. IEEE, 2019.

[48]

Linjian Ma, Jiayu Ye, and Edgar Solomonik. Autohoot: Automatic high-order optimization for tensors. arXiv preprint arXiv:2005.04540, 2020.

[49]

Samuel Manzer, Evgeny Epifanovsky, Anna I Krylov, and Martin Head-Gordon. A general sparse tensor framework for electronic structure theory. Journal of chemical theory and computation, 13(3):1108--1116, 2017.

[50]

Devin Matthews. High-performance tensor contraction without BLAS. CoRR, abs/1607.00291, 2016.

[51]

Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Prasant Singh Rawat, Sriram Krishnamoorthy, and P. Sadayappan. An efficient mixed-mode representation of sparse tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '19, pages 49:1--49:25, New York, NY, USA, 2019. ACM.

Digital Library

[52]

Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Richard Vuduc, and P Sadayappan. Load-balanced sparse mttkrp on gpus. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 123--133. IEEE, 2019.

[53]

David Ozog, Jeff R Hammond, James Dinan, Pavan Balaji, Sameer Shende, and Allen Malony. Inspector-executor load balancing algorithms for block-sparse tensor contractions. In 2013 42nd International Conference on Parallel Processing, pages 30--39. IEEE, 2013.

Digital Library

[54]

Chong Peng, Justus A Calvin, Fabijan Pavosevic, Jinmei Zhang, and Edward F Valeev. Massively parallel implementation of explicitly correlated coupled-cluster singles and doubles using tiledarray framework. The Journal of Physical Chemistry A, 120(51):10231--10244, 2016.

[55]

Ioakeim Perros, Evangelos E. Papalexakis, Fei Wang, Richard Vuduc, Elizabeth Searles, Michael Thompson, and Jimeng Sun. SPARTan: Scalable PARAFAC2 for large & sparse data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '17, pages 375--384, New York, NY, USA, 2017. ACM.

Digital Library

[56]

Christos Psarras, Lars Karlsson, and Paolo Bientinesi. The landscape of software for tensor computations. arXiv preprint arXiv:2103.13756, 2021.

[57]

Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. Page Placement in Hybrid Memory Systems. In International Conference on Supercomputing (ICS), May 2011.

[58]

Jie Ren, Jiaolin Luo, Kai Wu, Minjia Zhang, Hyeran Jeon, and Dong Li. Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 598--611. IEEE, 2021.

[59]

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. Zero-offload: Democratizing billion-scale model training. In 2021 {USENIX} Annual Technical Conference ({USENIX}{ATC} 21), 2021.

[60]

Jie Ren, Kai Wu, and Dong Li. Exploring non-volatility of non-volatile memory for high performance computing under failures. In 2020 IEEE International Conference on Cluster Computing (CLUSTER), pages 237--247. IEEE, 2020.

[61]

Jie Ren, Minjia Zhang, and Dong Li. HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory. In Neurips, 2020.

[62]

Christoph Riplinger, Peter Pinski, Ute Becker, Edward F Valeev, and Frank Neese. Sparse maps---a systematic infrastructure for reduced-scaling electronic structure methods. ii. linear scaling domain based pair natural orbital coupled cluster theory. The Journal of chemical physics, 144(2):024109, 2016.

[63]

Chase Roberts, Ashley Milsted, Martin Ganahl, Adam Zalcman, Bruce Fontaine, Yijian Zou, Jack Hidary, Guifre Vidal, and Stefan Leichenauer. Tensornetwork: A library for physics and machine learning. arXiv preprint arXiv:1905.01330, 2019.

[64]

Zhenyuan Ruan, Malte Schwarzkopf, Marcos K. Aguilera, and Adam Belay. AIFM: High-Performance, Application-Integrated Far Memory. In Proceedings of the Symposium on Operating Systems Design and Implementation, 2020.

[65]

Y. Shi, U. N. Niranjan, A. Anandkumar, and C. Cecka. Tensor contractions with extended BLAS kernels on CPU and GPU. In 2016 IEEE 23rd International Conference on High Performance Computing (HiPC), pages 193--202, Dec 2016.

[66]

N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalexakis, and C. Faloutsos. Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing, 65(13):3551--3582, July 2017.

Digital Library

[67]

Ilia Sivkov, Patrick Seewald, Alfio Lazzaro, and Jürg Hutter. DBCSR: A blocked sparse tensor algebra library. arXiv preprint arXiv:1910.13555, 2019.

[68]

Shaden Smith, Jee W Choi, Jiajia Li, Richard Vuduc, Jongsoo Park, Xing Liu, and George Karypis. Frostt: The formidable repository of open sparse tensors and tools, 2017.

[69]

Shaden Smith and George Karypis. A medium-grained algorithm for distributed sparse tensor factorization. In Parallel and Distributed Processing Symposium (IPDPS), 2016 IEEE International. IEEE, 2016.

[70]

Shaden Smith and George Karypis. Accelerating the Tucker decomposition with compressed sparse tensors. In European Conference on Parallel Processing. Springer, 2017.

[71]

Shaden Smith, Niranjay Ravindran, Nicholas Sidiropoulos, and George Karypis. SPLATT: Efficient and parallel sparse tensor-matrix multiplication. In Proceedings of the 29th IEEE International Parallel & Distributed Processing Symposium, IPDPS, 2015.

Digital Library

[72]

Edgar Solomonik, Devin Matthews, Jeff Hammond, and James Demmel. Cyclops tensor framework: Reducing communication and eliminating load imbalance in massively parallel contractions. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pages 813--824. IEEE, 2013.

Digital Library

[73]

Edgar Solomonik, Devin Matthews, Jeff R Hammond, John F Stanton, and James Demmel. A massively parallel tensor contraction framework for coupled-cluster computations. Journal of Parallel and Distributed Computing, 74(12):3176--3190, 2014.

Digital Library

[74]

N. Vervliet, O. Debals, L. Sorber, M. Van Barel, and L. De Lathauwer. Tensorlab (Version 3.0). Available from http://www.tensorlab.net, March 2016.

[75]

Chenxi Wang, Huimin Cui, Ting Cao, John Zigman, Haris Volos, Onur Mutlu, Fang Lv, Xiaobing Feng, and Guoqing Harry Xu. Panthera: Holistic memory management for big data processing over hybrid memories. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, 2019.

Digital Library

[76]

Wei Wei, Dejun Jiang, Sally A. McKee, Jin Xiong, and Mingyu Chen. Exploiting Program Semantics to Place Data in Hybrid Memory. In PACT, 2015.

Digital Library

[77]

Samuel Webb Williams. Auto-tuning performance on multicore computers. University of California, Berkeley Berkeley, CA, 2008.

[78]

K. Wu, Y. Huang, and D. Li. Unimem: Runtime Data Management on Non-Volatile Memory-based Heterogeneous Main Memory. In International Conference for High Performance Computing, Networking, Storage and Analysis, 2017.

Digital Library

[79]

Kai Wu, Jie Ren, and Dong Li. Runtime Data Management on Non-Volatile Memory-Based Heterogeneous Memory for Task Parallel Programs. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2018.

Digital Library

[80]

Kai Wu, Jie Ren, and Dong Li. Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, page 31. IEEE Press, 2018.

[81]

Zhen Xie, Wenqian Dong, Jie Liu, Ivy Peng, Yanbao Ma, and Dong Li. Md-hm: Memoization-based molecular dynamics simulations on big memory system. In Proceedings of the 35th ACM International Conference on Supercomputing, 2021.

Digital Library

[82]

Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. Nimble page management for tiered memory systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '19, pages 331--345, New York, NY, USA, 2019. ACM.

Digital Library

[83]

Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. Nimble Page Management for Tiered Memory Systems. In ASPLOS, 2019.

Digital Library

[84]

Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. Repository of Nimble Page Management for Tiered Memory Systems in ASPLOS2019. Available from https://github.com/ysarch-lab/nimble_page_management_asplos_2019, July 2020.

[85]

Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. An empirical guide to the behavior and use of scalable persistent memory. In 18th USENIX Conference on File and Storage Technologies (FAST 20), 2020.

Digital Library

[86]

H. Yoon, J. Meza, R. Ausavarungnirun, R. A. Harding, and O. Mutlu. Row buffer locality aware caching policies for hybrid memories. In 2012 IEEE 30th International Conference on Computer Design (ICCD), 2012.

Digital Library

[87]

Seongdae Yu, Seongbeom Park, and Woongki Baek. Design and Implementation of Bandwidth-aware Memory Placement and Migration Policies for Heterogeneous Memory Systems. In International Conference on Supercomputing (ICS), 2017.

Cited By

Dias AAnderson LSundararajah KPelenitsyn AKulkarni M(2024)SparseAuto: An Auto-scheduler for Sparse Tensor Computations using Recursive Loop Nest RestructuringProceedings of the ACM on Programming Languages10.1145/36897308:OOPSLA2(527-556)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689730
Raje SXu YRountev AValeev ESadayappan P(2024)CoNST: Code Generator for Sparse Tensor NetworksACM Transactions on Architecture and Code Optimization10.1145/368934221:4(1-24)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3689342
Feng GJia WSun NTan GLi JLee IChabbi MSteuwer M(2024)POSTER: Optimizing Sparse Tensor Contraction with Revisiting Hash Table DesignProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638500(457-459)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638500
Show More Cited By

Index Terms

Athena: high-performance sparse tensor contraction sequence on heterogeneous memory
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Shared memory algorithms
2. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance

Recommendations

Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Sparse tensor contractions appear commonly in many applications. Efficiently computing a two sparse tensor product is challenging: It not only inherits the challenges from common sparse matrix-matrix multiplication (SpGEMM), i.e., indirect memory access ...
System evaluation of the Intel optane byte-addressable NVM
MEMSYS '19: Proceedings of the International Symposium on Memory Systems

Byte-addressable non-volatile memory (NVM) features high density, DRAM comparable performance, and persistence. These characteristics position NVM as a promising new tier in the memory hierarchy. Nevertheless, NVM has asymmetric read and write ...
H2M: Exploiting Heterogeneous Shared Memory Architectures
Abstract
Over the past decades, the performance gap between the memory subsystem and compute capabilities continued to spread. However, scientific applications and simulations show increasing demand for both memory speed and capacity. To tackle these ...
Graphical abstract

Display Omitted
Highlights
- Analysis and characterization of contemporary heterogeneous memory technologies.
- Novel methodology to efficiently manage data placement in heterogeneous memory.
- Evaluation with several kernels on regular, high-bandwidth and large-...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing

June 2021

506 pages

ISBN:9781450383356

DOI:10.1145/3447818

General Chairs:
Huiyang Zhou
North Carolina State University
,
Jose Moreira
IBM Research
,
Program Chairs:
Frank Mueller
North Carolina State University
,
Yoav Etsion
Technion

Copyright © 2021 ACM.

© 2021 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

US Department of Energy, Office for Advanced Scientific Computing (ASCR)
U.S. National Science Foundation
Chameleon Cloud
Laboratory Directed Research and Development program at PNNL

Conference

ICS '21

Sponsor:

SIGARCH

ICS '21: 2021 International Conference on Supercomputing

June 14 - 17, 2021

Virtual Event, USA

Acceptance Rates

ICS '21 Paper Acceptance Rate 39 of 157 submissions, 25%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
834
Total Downloads

Downloads (Last 12 months)201
Downloads (Last 6 weeks)39

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dias AAnderson LSundararajah KPelenitsyn AKulkarni M(2024)SparseAuto: An Auto-scheduler for Sparse Tensor Computations using Recursive Loop Nest RestructuringProceedings of the ACM on Programming Languages10.1145/36897308:OOPSLA2(527-556)Online publication date: 8-Oct-2024
https://dl.acm.org/doi/10.1145/3689730
Raje SXu YRountev AValeev ESadayappan P(2024)CoNST: Code Generator for Sparse Tensor NetworksACM Transactions on Architecture and Code Optimization10.1145/368934221:4(1-24)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3689342
Feng GJia WSun NTan GLi JLee IChabbi MSteuwer M(2024)POSTER: Optimizing Sparse Tensor Contraction with Revisiting Hash Table DesignProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638500(457-459)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3627535.3638500
Xiao GYin CChen YDuan MLi K(2024)Efficient Utilization of Multi-Threading Parallelism on Heterogeneous Systems for Sparse Tensor ContractionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.339125435:6(1044-1055)Online publication date: Jun-2024
https://doi.org/10.1109/TPDS.2024.3391254
Moinat MPapez VDenaxas S(2023)Data Integration and HarmonisationClinical Applications of Artificial Intelligence in Real-World Data10.1007/978-3-031-36678-9_4(51-67)Online publication date: 5-Nov-2023
https://doi.org/10.1007/978-3-031-36678-9_4
Dias ASundararajah KSaumya CKulkarni MRauchwerger LCameron KNikolopoulos DPnevmatikatos D(2022)SparseLNRProceedings of the 36th ACM International Conference on Supercomputing10.1145/3524059.3532386(1-14)Online publication date: 28-Jun-2022
https://dl.acm.org/doi/10.1145/3524059.3532386
Xiao GYin CChen YDuan MLi K(2022)GSpTC: High-Performance Sparse Tensor Contraction on CPU-GPU Heterogeneous Systems2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00080(380-387)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00080
Ghosh STallent NMinutoli MHalappanavar MPeri RKalyanaraman Ade Supinski BHall MGamblin T(2021)Single-node partitioned-memory for huge graph analyticsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476156(1-14)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476156

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents