More Web Proxy on the site http://driver.im/

research-article

fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms

Authors:

Francesco Piccialli,

Albert Y. ZomayaAuthors Info & Claims

ACM Transactions on Parallel Computing, Volume 9, Issue 2

Article No.: 8, Pages 1 - 29

https://doi.org/10.1145/3512770

Published: 11 April 2022 Publication History

Abstract

Sparse matrix-sparse vector (SpMSpV) multiplication is one of the fundamental and important operations in many high-performance scientific and engineering applications. The inherent irregularity and poor data locality lead to two main challenges to scaling SpMSpV over high-performance computing (HPC) systems: (i) a large amount of redundant data limits the utilization of bandwidth and parallel resources; (ii) the irregular access pattern limits the exploitation of computing resources. This paper proposes a fine-grained parallel SpMSpV (fgSpMSpV) framework on Sunway TaihuLight supercomputer to alleviate the challenges for large-scale real-world applications. First, fgSpMSpV adopts an MPI\(+\)OpenMP\(+X\) parallelization model to exploit the multi-stage and hybrid parallelism of heterogeneous HPC architectures and accelerate both pre-/post-processing and main SpMSpV computation. Second, fgSpMSpV utilizes an adaptive parallel execution to reduce the pre-processing, adapt to the parallelism and memory hierarchy of the Sunway system, while still tame redundant and random memory accesses in SpMSpV, including a set of techniques like the fine-grained partitioner, re-collection method, and Compressed Sparse Column Vector (CSCV) matrix format. Third, fgSpMSpV uses several optimization techniques to further utilize the computing resources. fgSpMSpV on the Sunway TaihuLight gains a noticeable performance improvement from the key optimization techniques with various sparsity of the input. Additionally, fgSpMSpV is implemented on an NVIDIA Tesal P100 GPU and applied to the breath-first-search (BFS) application. fgSpMSpV on a P100 GPU obtains the speedup of up to \(134.38\times\) over the state-of-the-art SpMSpV algorithms, and the BFS application using fgSpMSpV achieves the speedup of up to \(21.68\times\) over the state-of-the-arts.

References

[1]

Khalid Ahmad, Hari Sundar, and Mary W. Hall. 2020. Data-driven mixed precision sparse matrix vector multiplication for GPUs. ACM Trans. Archit. Code Optim. 16, 4 (2020), 51:1–51:24.

Digital Library

[2]

Kadir Akbudak and Cevdet Aykanat. 2017. Exploiting locality in sparse matrix-matrix multiplication on many-core architectures. IEEE Trans. Parallel Distrib. Syst. 28, 8 (2017), 2258–2271.

Digital Library

[3]

Kadir Akbudak, R. Oguz Selvitopi, and Cevdet Aykanat. 2018. Partitioning models for scaling parallel sparse matrix-matrix multiplication. ACM Trans. Parallel Computing 4, 3 (2018), 13:1–13:34.

Digital Library

[4]

Michael J. Anderson, Narayanan Sundaram, Nadathur Satish, Md. Mostofa Ali Patwary, Theodore L. Willke, and Pradeep Dubey. 2016. GraphPad: Optimized graph primitives for parallel and distributed platforms. In Proceedings of the International Parallel and Distributed Processing Symposium. 313–322.

[5]

Ariful Azad and Aydin Buluç. 2016. Distributed-Memory algorithms for maximum cardinality matching in bipartite graphs. In Proceedings of the International Parallel and Distributed Processing Symposium. 32–42.

[6]

Ariful Azad and Aydin Buluç. 2017. A work-efficient parallel sparse matrix-sparse vector multiplication algorithm. In Proceedings of the International Parallel and Distributed Processing Symposium. 688–697.

[7]

Ariful Azad and Aydin Buluç. 2019. LACC: A linear-algebraic algorithm for finding connected components in distributed memory. In Proceedings of the International Parallel and Distributed Processing Symposium. 2–12.

[8]

Ariful Azad, Aydin Buluç, Xiaoye S. Li, Xinliang Wang, and Johannes Langguth. 2020. A distributed-memory algorithm for computing a heavy-weight perfect matching on bipartite graphs. SIAM J. Sci. Comput. 42, 4 (2020), C143–C168.

Digital Library

[9]

Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. 2016. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Computing 3, 3 (2016), 18:1–18:34.

Digital Library

[10]

Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2018. BestSF: A sparse meta-format for optimizing SpMV on GPU. ACM Trans. Archit. Code Optim. 15, 3 (2018), 29:1–29:27.

Digital Library

[11]

Thierry P. Berger, Julien Francq, Marine Minier, and Gaël Thomas. 2016. Extended generalized feistel networks using matrix representation to propose a new lightweight block cipher: Lilliput. IEEE Trans. Computers 65, 7 (2016), 2074–2089.

[12]

Aysenur Bilgin, Hani Hagras, Joy van Helvert, and Daniyal M. Al-Ghazzawi. 2016. A linear general type-2 fuzzy-logic-based computing with words approach for realizing an ambient intelligent platform for cooking recipe recommendation. IEEE Trans. Fuzzy Systems 24, 2 (2016), 306–329.

Digital Library

[13]

Aydin Buluç, Erika Duriakova, Armando Fox, John R. Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, and Samuel Williams. 2013. High-Productivity and high-performance analysis of filtered semantic graphs. In Proceedings of the International Parallel and Distributed Processing Symposium. 237–248.

Digital Library

[14]

Aydin Buluç and Kamesh Madduri. 2011. Parallel breadth-first search on distributed memory systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 65:1–65:12.

[15]

Paolo Campigotto, Christian Rudloff, Maximilian Leodolter, and Dietmar Bauer. 2017. Personalized and situation-aware multimodal route recommendations: The FAVOUR algorithm. IEEE Trans. Intelligent Transportation Systems 18, 1 (2017), 92–102.

Digital Library

[16]

Yuedan Chen, Kenli Li, Wangdong Yang, Guoqing Xiao, Xianghui Xie, and Tao Li. 2019. Performance-Aware model for sparse matrix-matrix multiplication on the sunway taihulight supercomputer. IEEE Trans. Parallel Distrib. Syst. 30, 4 (2019), 923–938.

Digital Library

[17]

Yuedan Chen, Guoqing Xiao, M. Tamer Özsu, Chubo Liu, Albert Y. Zomaya, and Tao Li. 2020. aeSpTV: An adaptive and efficient framework for sparse tensor-vector product kernel on a high-performance computing platform. IEEE Trans. Parallel Distributed Syst. 31, 10 (2020), 2329–2345.

[18]

Yuedan Chen, Guoqing Xiao, Fan Wu, and Zhuo Tang. 2019. Towards large-scale sparse matrix-vector multiplication on the SW26010 manycore architecture. In Proceedings of the 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems. IEEE, 1469–1476.

[19]

Steven Dalton, Luke N. Olson, and Nathan Bell. 2015. Optimizing sparse matrix - matrix multiplication for the GPU. ACM Trans. Math. Softw. 41, 4 (2015), 25:1–25:20.

Digital Library

[20]

Timothy A. Davis and Yifan Hu. 2011. The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (2011), 1:1–1:25.

Digital Library

[21]

Fred G. Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Trans. Math. Softw. 4, 3 (1978), 250–269.

Digital Library

[22]

Lixin He, Hong An, Chao Yang, Fei Wang, Junshi Chen, Chao Wang, Weihao Liang, Shao-Jun Dong, Qiao Sun, Wenting Han, Wenyuan Liu, Yongjian Han, and Wenjun Yao. 2018. PEPS++: towards extreme-scale simulations of strongly correlated quantum many-particle models on sunway taihulight. IEEE Trans. Parallel Distrib. Syst. 29, 12 (2018), 2838–2848.

[23]

Yong-Yeon Jo, Sang-Wook Kim, and Duck-Ho Bae. 2015. Efficient sparse matrix multiplication on GPU for large social network analysis. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 1261–1270.

[24]

Byeongho Kim, Jongwook Chung, Eojin Lee, Wonkyung Jung, Sunjung Lee, Jaewan Choi, Jaehyun Park, Minbok Wi, Sukhan Lee, and Jung Ho Ahn. 2020. MViD: Sparse matrix-vector multiplication in mobile DRAM for accelerating recurrent neural networks. IEEE Trans. Computers 69, 7 (2020), 955–967.

Digital Library

[25]

Daniel Langr and Pavel Tvrdík. 2016. Evaluation criteria for sparse matrix storage formats. IEEE Trans. Parallel Distrib. Syst. 27, 2 (2016), 428–440.

Digital Library

[26]

Min Li, Yulong Ao, and Chao Yang. 2021. Adaptive SpMV/SpMSpV on GPUs for input vectors of varied sparsity. IEEE Trans. Parallel Distributed Syst. 32, 7 (2021), 1842–1853.

[27]

Siqiang Luo. 2019. Distributed PageRank Computation: An improved theoretical study. In Proceedings of the AAAI Conference on Artificial Intelligence. 4496–4503.

Digital Library

[28]

Xin Luo, Mengchu Zhou, Shuai Li, Lun Hu, and Mingsheng Shang. 2020. Non-negativity constrained missing data estimation for high-dimensional and sparse matrices from industrial applications. IEEE Trans. on Cybernetics 50, 5 (2020), 1844–1855.

[29]

Xin Luo, Mengchu Zhou, Shuai Li, Zhu-Hong You, Yunni Xia, and Qingsheng Zhu. 2016. A nonnegative latent factor model for large-scale sparse matrices in recommender systems via alternating direction method. IEEE Trans. Neural Netw. Learning Syst. 27, 3 (2016), 579–592.

[30]

Asit K. Mishra, Eriko Nurvitadhi, Ganesh Venkatesh, Jonathan Pearce, and Debbie Marr. 2017. Fine-grained accelerators for sparse machine learning workloads. In Proceedings of the 22nd Asia and South Pacific Design Automation Conference. 635–640.

Digital Library

[31]

Muhammet Mustafa Ozdal. 2019. Improving efficiency of parallel vertex-centric algorithms for irregular graphs. IEEE Trans. Parallel Distrib. Syst. 30, 10 (2019), 2265–2282.

[32]

Ryan A. Rossi and Nesreen K. Ahmed. 2015. The network data repository with interactive graph analytics and visualization. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 4292–4293.

[33]

Liang Sun, Shuiwang Ji, and Jieping Ye. 2009. A least squares formulation for a class of generalized eigenvalue problems in machine learning. In Proceedings of the 26th Annual International Conference on Machine Learning. 977–984.

Digital Library

[34]

Narayanan Sundaram, Nadathur Satish, Md. Mostofa Ali Patwary, Subramanya Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High performance graph analytics made productive. PVLDB 8, 11 (2015), 1214–1225.

Digital Library

[35]

Guoqing Xiao, Yuedan Chen, Chubo Liu, and Xu Zhou. 2020. ahSpMV: An auto-tuning hybrid computing scheme for SpMV on the sunway architecture. IEEE Internet of Things Journal 7, 3 (2020), 1736–1744.

[36]

Guoqing Xiao, Kenli Li, Yuedan Chen, Wangquan He, Albert Zomaya, and Tao Li. 2021. CASpMV: A customized and accelerative SPMV framework for the sunway TaihuLight. IEEE Trans. Parallel Distrib. Syst. 32, 1 (2021), 131–146.

[37]

Carl Yang, Yangzihao Wang, and John D. Owens. 2015. Fast sparse matrix and sparse vector multiplication algorithm on the GPU. In Proceedings of the International Parallel and Distributed Processing Symposium. 841–847.

Digital Library

[38]

Leonid Yavits and Ran Ginosar. 2018. Accelerator for sparse machine learning. IEEE Comput. Archit. Lett. 17, 1 (2018), 21–24.

Digital Library

[39]

Yongzhe Zhang, Ariful Azad, and Zhenjiang Hu. 2020. FastSV: A distributed-memory connected component algorithm with fast convergence. In Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP). 46–57.

[40]

Yunquan Zhang, Shigang Li, Shengen Yan, and Huiyang Zhou. 2016. A cross-platform SpMV framework on many-core architectures. ACM Trans. Archit. Code Optim. 13, 4 (2016), 33:1–33:25.

Digital Library

[41]

Weijie Zhou, Yue Zhao, Xipeng Shen, and Wang Chen. 2020. Enabling runtime SpMV format selection through an overhead conscious method. IEEE Trans. Parallel Distrib. Syst. 31, 1 (2020), 80–93.

Digital Library

[42]

Alwin Zulehner and Robert Wille. 2019. Matrix-Vector vs. matrix-matrix multiplication: Potential in DD-based simulation of quantum computations. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition. 90–95.

Cited By

Xiao GZhou TChen YHu YLi K(2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3652579
Zhang XXiao GDuan MChen YLi K(2024)APPQ-CNN: An Adaptive CNNs Inference Accelerator for Synergistically Exploiting Pruning and Quantization Based on FPGAIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33821579:6(874-888)Online publication date: Nov-2024
https://doi.org/10.1109/TSUSC.2024.3382157
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Show More Cited By

Index Terms

fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms
1. Computing methodologies

Recommendations

Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures

Multicore architectures provide scalable performance with a lower hardware design effort than single core processors. Our article presents a design methodology and an embedded multicore architecture, focusing on reducing the software design complexity ...
Number of Tasks, not Threads, is Key
PDP '15: Proceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

The concept of task already exists in many parallel programming models. Programmers express parallelism by defining tasks in their applications, and runtime libraries schedule tasks on threads. However, in many task-based parallel programming models, ...
OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing
HPCAsia '18 Workshops: Proceedings of Workshops of HPC Asia

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have emerged with 2D tile mesh architecture. Implementing of the general matrix-matrix multiplication on a new architecture is an important practice. To date, there has not ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 9, Issue 2

June 2022

130 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3514177

Editor:
David A. Bader
New Jersey Institute of Technology, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 April 2022

Accepted: 01 December 2021

Revised: 01 December 2021

Received: 01 October 2020

Published in TOPC Volume 9, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

National Key R&D Programs of China
Programs of National Natural Science Foundation of China
Programs of Hunan Province, China
Programs of China Postdoctoral Council
Program of Zhejiang Lab
General Program of Fundamental Research of Shen Zhen

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
565
Total Downloads

Downloads (Last 12 months)117
Downloads (Last 6 weeks)12

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xiao GZhou TChen YHu YLi K(2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
https://dl.acm.org/doi/10.1145/3652579
Zhang XXiao GDuan MChen YLi K(2024)APPQ-CNN: An Adaptive CNNs Inference Accelerator for Synergistically Exploiting Pruning and Quantization Based on FPGAIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33821579:6(874-888)Online publication date: Nov-2024
https://doi.org/10.1109/TSUSC.2024.3382157
Xiao GYin CZhou TLi XChen YLi K(2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
https://dl.acm.org/doi/10.1145/3604606
Wang HYang WOuyang RHu RLi KLi K(2023)A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCNACM Transactions on Parallel Computing10.1145/358437310:2(1-23)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3584373
Gao PDuan XSchmidt BWan WGuo JZhang WGan LFu HXue WLiu WYang G(2023)Redesign and Accelerate the AIREBO Bond-Order Potential on the New Sunway SupercomputerIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332192734:12(3117-3132)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3321927
Zhang HHuang ZChen YLiang JGao X(2023)ESA: An efficient sequence alignment algorithm for biological database search on Sunway TaihuLightParallel Computing10.1016/j.parco.2023.103043117(103043)Online publication date: Sep-2023
https://doi.org/10.1016/j.parco.2023.103043
Padilla-Perez DMedina-Sanchez IHernández JCouder-Castañeda C(2022)Accelerating Electromagnetic Field Simulations Based on Memory-Optimized CPML-FDTD with OpenACCApplied Sciences10.3390/app12221143012:22(11430)Online publication date: 10-Nov-2022
https://doi.org/10.3390/app122211430
Xiao GZhou TChen YHu YLi K(2022)DTSpMV: An Adaptive SpMV Framework for Graph Analysis on GPUs2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00039(35-42)Online publication date: Dec-2022
https://doi.org/10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00039

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents