[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms

Published: 11 April 2022 Publication History

Abstract

Sparse matrix-sparse vector (SpMSpV) multiplication is one of the fundamental and important operations in many high-performance scientific and engineering applications. The inherent irregularity and poor data locality lead to two main challenges to scaling SpMSpV over high-performance computing (HPC) systems: (i) a large amount of redundant data limits the utilization of bandwidth and parallel resources; (ii) the irregular access pattern limits the exploitation of computing resources. This paper proposes a fine-grained parallel SpMSpV (fgSpMSpV) framework on Sunway TaihuLight supercomputer to alleviate the challenges for large-scale real-world applications. First, fgSpMSpV adopts an MPI\(+\)OpenMP\(+X\) parallelization model to exploit the multi-stage and hybrid parallelism of heterogeneous HPC architectures and accelerate both pre-/post-processing and main SpMSpV computation. Second, fgSpMSpV utilizes an adaptive parallel execution to reduce the pre-processing, adapt to the parallelism and memory hierarchy of the Sunway system, while still tame redundant and random memory accesses in SpMSpV, including a set of techniques like the fine-grained partitioner, re-collection method, and Compressed Sparse Column Vector (CSCV) matrix format. Third, fgSpMSpV uses several optimization techniques to further utilize the computing resources. fgSpMSpV on the Sunway TaihuLight gains a noticeable performance improvement from the key optimization techniques with various sparsity of the input. Additionally, fgSpMSpV is implemented on an NVIDIA Tesal P100 GPU and applied to the breath-first-search (BFS) application. fgSpMSpV on a P100 GPU obtains the speedup of up to \(134.38\times\) over the state-of-the-art SpMSpV algorithms, and the BFS application using fgSpMSpV achieves the speedup of up to \(21.68\times\) over the state-of-the-arts.

References

[1]
Khalid Ahmad, Hari Sundar, and Mary W. Hall. 2020. Data-driven mixed precision sparse matrix vector multiplication for GPUs. ACM Trans. Archit. Code Optim. 16, 4 (2020), 51:1–51:24.
[2]
Kadir Akbudak and Cevdet Aykanat. 2017. Exploiting locality in sparse matrix-matrix multiplication on many-core architectures. IEEE Trans. Parallel Distrib. Syst. 28, 8 (2017), 2258–2271.
[3]
Kadir Akbudak, R. Oguz Selvitopi, and Cevdet Aykanat. 2018. Partitioning models for scaling parallel sparse matrix-matrix multiplication. ACM Trans. Parallel Computing 4, 3 (2018), 13:1–13:34.
[4]
Michael J. Anderson, Narayanan Sundaram, Nadathur Satish, Md. Mostofa Ali Patwary, Theodore L. Willke, and Pradeep Dubey. 2016. GraphPad: Optimized graph primitives for parallel and distributed platforms. In Proceedings of the International Parallel and Distributed Processing Symposium. 313–322.
[5]
Ariful Azad and Aydin Buluç. 2016. Distributed-Memory algorithms for maximum cardinality matching in bipartite graphs. In Proceedings of the International Parallel and Distributed Processing Symposium. 32–42.
[6]
Ariful Azad and Aydin Buluç. 2017. A work-efficient parallel sparse matrix-sparse vector multiplication algorithm. In Proceedings of the International Parallel and Distributed Processing Symposium. 688–697.
[7]
Ariful Azad and Aydin Buluç. 2019. LACC: A linear-algebraic algorithm for finding connected components in distributed memory. In Proceedings of the International Parallel and Distributed Processing Symposium. 2–12.
[8]
Ariful Azad, Aydin Buluç, Xiaoye S. Li, Xinliang Wang, and Johannes Langguth. 2020. A distributed-memory algorithm for computing a heavy-weight perfect matching on bipartite graphs. SIAM J. Sci. Comput. 42, 4 (2020), C143–C168.
[9]
Grey Ballard, Alex Druinsky, Nicholas Knight, and Oded Schwartz. 2016. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Computing 3, 3 (2016), 18:1–18:34.
[10]
Akrem Benatia, Weixing Ji, Yizhuo Wang, and Feng Shi. 2018. BestSF: A sparse meta-format for optimizing SpMV on GPU. ACM Trans. Archit. Code Optim. 15, 3 (2018), 29:1–29:27.
[11]
Thierry P. Berger, Julien Francq, Marine Minier, and Gaël Thomas. 2016. Extended generalized feistel networks using matrix representation to propose a new lightweight block cipher: Lilliput. IEEE Trans. Computers 65, 7 (2016), 2074–2089.
[12]
Aysenur Bilgin, Hani Hagras, Joy van Helvert, and Daniyal M. Al-Ghazzawi. 2016. A linear general type-2 fuzzy-logic-based computing with words approach for realizing an ambient intelligent platform for cooking recipe recommendation. IEEE Trans. Fuzzy Systems 24, 2 (2016), 306–329.
[13]
Aydin Buluç, Erika Duriakova, Armando Fox, John R. Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, and Samuel Williams. 2013. High-Productivity and high-performance analysis of filtered semantic graphs. In Proceedings of the International Parallel and Distributed Processing Symposium. 237–248.
[14]
Aydin Buluç and Kamesh Madduri. 2011. Parallel breadth-first search on distributed memory systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 65:1–65:12.
[15]
Paolo Campigotto, Christian Rudloff, Maximilian Leodolter, and Dietmar Bauer. 2017. Personalized and situation-aware multimodal route recommendations: The FAVOUR algorithm. IEEE Trans. Intelligent Transportation Systems 18, 1 (2017), 92–102.
[16]
Yuedan Chen, Kenli Li, Wangdong Yang, Guoqing Xiao, Xianghui Xie, and Tao Li. 2019. Performance-Aware model for sparse matrix-matrix multiplication on the sunway taihulight supercomputer. IEEE Trans. Parallel Distrib. Syst. 30, 4 (2019), 923–938.
[17]
Yuedan Chen, Guoqing Xiao, M. Tamer Özsu, Chubo Liu, Albert Y. Zomaya, and Tao Li. 2020. aeSpTV: An adaptive and efficient framework for sparse tensor-vector product kernel on a high-performance computing platform. IEEE Trans. Parallel Distributed Syst. 31, 10 (2020), 2329–2345.
[18]
Yuedan Chen, Guoqing Xiao, Fan Wu, and Zhuo Tang. 2019. Towards large-scale sparse matrix-vector multiplication on the SW26010 manycore architecture. In Proceedings of the 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems. IEEE, 1469–1476.
[19]
Steven Dalton, Luke N. Olson, and Nathan Bell. 2015. Optimizing sparse matrix - matrix multiplication for the GPU. ACM Trans. Math. Softw. 41, 4 (2015), 25:1–25:20.
[20]
Timothy A. Davis and Yifan Hu. 2011. The university of florida sparse matrix collection. ACM Trans. Math. Softw. 38, 1 (2011), 1:1–1:25.
[21]
Fred G. Gustavson. 1978. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Trans. Math. Softw. 4, 3 (1978), 250–269.
[22]
Lixin He, Hong An, Chao Yang, Fei Wang, Junshi Chen, Chao Wang, Weihao Liang, Shao-Jun Dong, Qiao Sun, Wenting Han, Wenyuan Liu, Yongjian Han, and Wenjun Yao. 2018. PEPS++: towards extreme-scale simulations of strongly correlated quantum many-particle models on sunway taihulight. IEEE Trans. Parallel Distrib. Syst. 29, 12 (2018), 2838–2848.
[23]
Yong-Yeon Jo, Sang-Wook Kim, and Duck-Ho Bae. 2015. Efficient sparse matrix multiplication on GPU for large social network analysis. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. 1261–1270.
[24]
Byeongho Kim, Jongwook Chung, Eojin Lee, Wonkyung Jung, Sunjung Lee, Jaewan Choi, Jaehyun Park, Minbok Wi, Sukhan Lee, and Jung Ho Ahn. 2020. MViD: Sparse matrix-vector multiplication in mobile DRAM for accelerating recurrent neural networks. IEEE Trans. Computers 69, 7 (2020), 955–967.
[25]
Daniel Langr and Pavel Tvrdík. 2016. Evaluation criteria for sparse matrix storage formats. IEEE Trans. Parallel Distrib. Syst. 27, 2 (2016), 428–440.
[26]
Min Li, Yulong Ao, and Chao Yang. 2021. Adaptive SpMV/SpMSpV on GPUs for input vectors of varied sparsity. IEEE Trans. Parallel Distributed Syst. 32, 7 (2021), 1842–1853.
[27]
Siqiang Luo. 2019. Distributed PageRank Computation: An improved theoretical study. In Proceedings of the AAAI Conference on Artificial Intelligence. 4496–4503.
[28]
Xin Luo, Mengchu Zhou, Shuai Li, Lun Hu, and Mingsheng Shang. 2020. Non-negativity constrained missing data estimation for high-dimensional and sparse matrices from industrial applications. IEEE Trans. on Cybernetics 50, 5 (2020), 1844–1855.
[29]
Xin Luo, Mengchu Zhou, Shuai Li, Zhu-Hong You, Yunni Xia, and Qingsheng Zhu. 2016. A nonnegative latent factor model for large-scale sparse matrices in recommender systems via alternating direction method. IEEE Trans. Neural Netw. Learning Syst. 27, 3 (2016), 579–592.
[30]
Asit K. Mishra, Eriko Nurvitadhi, Ganesh Venkatesh, Jonathan Pearce, and Debbie Marr. 2017. Fine-grained accelerators for sparse machine learning workloads. In Proceedings of the 22nd Asia and South Pacific Design Automation Conference. 635–640.
[31]
Muhammet Mustafa Ozdal. 2019. Improving efficiency of parallel vertex-centric algorithms for irregular graphs. IEEE Trans. Parallel Distrib. Syst. 30, 10 (2019), 2265–2282.
[32]
Ryan A. Rossi and Nesreen K. Ahmed. 2015. The network data repository with interactive graph analytics and visualization. In Proceedings of the 29th AAAI Conference on Artificial Intelligence. 4292–4293.
[33]
Liang Sun, Shuiwang Ji, and Jieping Ye. 2009. A least squares formulation for a class of generalized eigenvalue problems in machine learning. In Proceedings of the 26th Annual International Conference on Machine Learning. 977–984.
[34]
Narayanan Sundaram, Nadathur Satish, Md. Mostofa Ali Patwary, Subramanya Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High performance graph analytics made productive. PVLDB 8, 11 (2015), 1214–1225.
[35]
Guoqing Xiao, Yuedan Chen, Chubo Liu, and Xu Zhou. 2020. ahSpMV: An auto-tuning hybrid computing scheme for SpMV on the sunway architecture. IEEE Internet of Things Journal 7, 3 (2020), 1736–1744.
[36]
Guoqing Xiao, Kenli Li, Yuedan Chen, Wangquan He, Albert Zomaya, and Tao Li. 2021. CASpMV: A customized and accelerative SPMV framework for the sunway TaihuLight. IEEE Trans. Parallel Distrib. Syst. 32, 1 (2021), 131–146.
[37]
Carl Yang, Yangzihao Wang, and John D. Owens. 2015. Fast sparse matrix and sparse vector multiplication algorithm on the GPU. In Proceedings of the International Parallel and Distributed Processing Symposium. 841–847.
[38]
Leonid Yavits and Ran Ginosar. 2018. Accelerator for sparse machine learning. IEEE Comput. Archit. Lett. 17, 1 (2018), 21–24.
[39]
Yongzhe Zhang, Ariful Azad, and Zhenjiang Hu. 2020. FastSV: A distributed-memory connected component algorithm with fast convergence. In Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing (PP). 46–57.
[40]
Yunquan Zhang, Shigang Li, Shengen Yan, and Huiyang Zhou. 2016. A cross-platform SpMV framework on many-core architectures. ACM Trans. Archit. Code Optim. 13, 4 (2016), 33:1–33:25.
[41]
Weijie Zhou, Yue Zhao, Xipeng Shen, and Wang Chen. 2020. Enabling runtime SpMV format selection through an overhead conscious method. IEEE Trans. Parallel Distrib. Syst. 31, 1 (2020), 80–93.
[42]
Alwin Zulehner and Robert Wille. 2019. Matrix-Vector vs. matrix-matrix multiplication: Potential in DD-based simulation of quantum computations. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition. 90–95.

Cited By

View all
  • (2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
  • (2024)APPQ-CNN: An Adaptive CNNs Inference Accelerator for Synergistically Exploiting Pruning and Quantization Based on FPGAIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33821579:6(874-888)Online publication date: Nov-2024
  • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
  • Show More Cited By

Index Terms

  1. fgSpMSpV: A Fine-grained Parallel SpMSpV Framework on HPC Platforms

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Parallel Computing
    ACM Transactions on Parallel Computing  Volume 9, Issue 2
    June 2022
    130 pages
    ISSN:2329-4949
    EISSN:2329-4957
    DOI:10.1145/3514177
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 April 2022
    Accepted: 01 December 2021
    Revised: 01 December 2021
    Received: 01 October 2020
    Published in TOPC Volume 9, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Heterogeneous
    2. HPC
    3. manycore
    4. optimization
    5. parallelism
    6. SpMSpV

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Key R&D Programs of China
    • Programs of National Natural Science Foundation of China
    • Programs of Hunan Province, China
    • Programs of China Postdoctoral Council
    • Program of Zhejiang Lab
    • General Program of Fundamental Research of Shen Zhen

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)117
    • Downloads (Last 6 weeks)12
    Reflects downloads up to 15 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Machine Learning-Based Kernel Selector for SpMV Optimization in Graph AnalysisACM Transactions on Parallel Computing10.1145/365257911:2(1-25)Online publication date: 8-Jun-2024
    • (2024)APPQ-CNN: An Adaptive CNNs Inference Accelerator for Synergistically Exploiting Pruning and Quantization Based on FPGAIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33821579:6(874-888)Online publication date: Nov-2024
    • (2023)A Survey of Accelerating Parallel Sparse Linear AlgebraACM Computing Surveys10.1145/360460656:1(1-38)Online publication date: 28-Aug-2023
    • (2023)A Heterogeneous Parallel Computing Approach Optimizing SpTTM on CPU-GPU via GCNACM Transactions on Parallel Computing10.1145/358437310:2(1-23)Online publication date: 20-Jun-2023
    • (2023)Redesign and Accelerate the AIREBO Bond-Order Potential on the New Sunway SupercomputerIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.332192734:12(3117-3132)Online publication date: 1-Dec-2023
    • (2023)ESA: An efficient sequence alignment algorithm for biological database search on Sunway TaihuLightParallel Computing10.1016/j.parco.2023.103043117(103043)Online publication date: Sep-2023
    • (2022)Accelerating Electromagnetic Field Simulations Based on Memory-Optimized CPML-FDTD with OpenACCApplied Sciences10.3390/app12221143012:22(11430)Online publication date: 10-Nov-2022
    • (2022)DTSpMV: An Adaptive SpMV Framework for Graph Analysis on GPUs2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00039(35-42)Online publication date: Dec-2022

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media