[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3673038.3673101acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

Detailed Analysis and Optimization of Irregular-Shaped Matrix Multiplication on Multi-Core DSPs

Published: 12 August 2024 Publication History

Abstract

Irregular-shaped General Matrix Multiplication (GEMM) is extensively used in diverse workloads, such as scientific simulation and deep learning. In response to energy efficiency constraints, low-power multi-core digital signal processors (DSPs) have emerged as a viable alternative architecture in HPC systems. This study examines the performance of existing GEMM implementations working on irregular-shaped matrices and observes sub-optimal outcomes due to deficiencies in memory optimization and core-level parallelism extraction. For multi-core DSPs in FT-M7032, a CPU-DSP heterogeneous processor for HPC, we introduce dspIMM - a new multi-core parallel implementation for irregular-shaped matrix multiplications. dspIMM incorporates a new loop ordering, stack-space optimization, multi-dimensional core-level parallelization, a communication-computation overlap implementation with data prefetch across loops, and blocking optimization. These optimizations effectively enhance memory access and the core-level parallelism in irregular-shaped GEMMs. Experimental results demonstrate that the proposed communication-computation overlap optimization achieves the highest performance improvement in dspIMM, and the average speedup of dspIMM over previous implementations finally achieves up to 3.34 times.

References

[1]
Murtaza Ali, Eric Stotzer, Francisco D. Igual, and Robert A. van de Geijn. 2012. Level-3 BLAS on the TI C6678 Multi-core DSP. In 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing. 179–186.
[2]
Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft.
[3]
Jieyang Chen, Xin Liang, and Zizhong Chen. 2016. Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with gpus. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 993–1002.
[4]
Jieyang Chen, Nan Xiong, Xin Liang, Dingwen Tao, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben, Qiang Guan, and Zizhong Chen. 2019. TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs. In Proceedings of the ACM International Conference on Supercomputing. 106–116.
[5]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[6]
Jonathan Drake and Greg Hamerly. 2012. Accelerated k-means with adaptive distance bounds. In 5th NIPS workshop on optimization for machine learning, Vol. 8.
[7]
Dominik Ernst, Georg Hager, Jonas Thies, and Gerhard Wellein. 2021. Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs. The International Journal of High Performance Computing Applications 35, 1 (2021), 5–19.
[8]
Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy of high-performance deep learning convolutions on simd architectures. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 830–841. https://doi.org/10.1109/SC.2018.00069
[9]
Kazushige Goto and Robert Van De Geijn. 2008. High-Performance Implementation of the Level-3 BLAS. ACM Trans. Math. Softw. 35, 1, Article 4 (Jul 2008), 14 pages. https://doi.org/10.1145/1377603.1377607
[10]
Francisco D Igual, Murtaza Ali, Arnon Friedmann, Eric Stotzer, Timothy Wentz, and Robert A van de Geijn. 2012. Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.
[11]
Kyuho J. Lee. 2021. Chapter Seven - Architecture of neural processing unit for deep neural networks. In Hardware Accelerator Systems for Artificial Intelligence and Machine Learning, Shiho Kim and Ganesh Chandra Deka (Eds.). Advances in Computers, Vol. 122. Elsevier, 217–245.
[12]
Chendi Li, Haipeng Jia, Hang Cao, Jianyu Yao, Boqian Shi, Chunyang Xiang, Jinbo Sun, Pengqi Lu, and Yunquan Zhang. 2021. AutoTSMM: An Auto-tuning Framework for Building High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on CPUs. In 2021 IEEE Intl Conf on Parallel Distributed Processing with Applications, Big Data Cloud Computing, Sustainable Computing Communications, Social Computing Networking (ISPA/BDCloud/SocialCom/SustainCom). 159–166.
[13]
Zhong Liu and Xi Tian. 2018. Vectorization of Matrix Multiplication for Multi-Core Vector Processors. Jisuanji Xuebao/Chinese Journal of Computers 41, 10 (2018), 2251 – 2264.
[14]
Sheng Ma, Zhong Liu, Shenggang Chen, Libo Huang, Yang Guo, Zhiying Wang, and Meidi Zhang. 2019. Coordinated DMA: Improving the DRAM Access Efficiency for Matrix Multiplication. IEEE Transactions on Parallel and Distributed Systems 30, 10 (2019), 2148–2164. https://doi.org/10.1109/TPDS.2019.2906891
[15]
Xiangdong Pei, Qinglin Wang, Linyu Liao, Rongchun Li, Songzhu Mei, Jiu Liu, and Zhengbin Pang. 2023. Optimizing parallel matrix transpose algorithm on multi-core digital signal processors (In Chinese). Journal of National University of Defense Technology 45, 1 (2023), 57 – 66.
[16]
Cody Rivera, Jieyang Chen, Nan Xiong, Shuaiwen Leon Song, and Dingwen Tao. 2020. Ism2: Optimizing irregular-shaped matrix-matrix multiplication on gpus. arXiv preprint arXiv:2002.03258 (2020).
[17]
Cody Rivera, Jieyang Chen, Nan Xiong, Jing Zhang, Shuaiwen Leon Song, and Dingwen Tao. 2021. TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs. J. Parallel and Distrib. Comput. 151 (2021), 70–85.
[18]
Ilia Safonov, Anton Kornilov, and Daria Makienko. 2022. An Approach for Matrix Multiplication of 32-Bit Fixed Point Numbers by Means of 16-Bit SIMD Instructions on DSP. Electronics 12 (12 2022), 78.
[19]
Hao Tang, Kazuhiko Komatsu, Masayuki Sato, and Hiroaki Kobayashi. 2021. Efficient Mixed-Precision Tall-and-Skinny Matrix-Matrix Multiplication for GPUs. International Journal of Networking and Computing 11, 2 (2021), 267–282.
[20]
Abhiprayah Tiwari, Vivek Kumar, and Gaurav Mitra. 2018. High performance and energy optimal parallel programming on CPU and DSP based MPSoC. Ph. D. Dissertation. IIIT-Delhi.
[21]
Qinglin Wang, Xiangdong Pei, Linyu Liao, Haoxu Wang, Rongchun Li, Songzhu Mei, and Dongsheng Li. 2023. Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors (In Chinese). Journal of National University of Defense Technology 45, 1 (2023), 86 – 94.
[22]
Qinglin Wang, Mei Songzhu, Jie Liu, and Chunye Gong. 2019. Parallel convolution algorithm using implicit matrix multiplication on multi-core CPUs. In 2019 International Joint Conference on Neural Networks (IJCNN). 1–7. https://doi.org/10.1109/IJCNN.2019.8852012
[23]
Yaohua Wang, Chen Li, Chang Liu, Sheng Liu, Yuanwu Lei, Jian Zhang, Yang Zhang, and Yang Guo. 2021. Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions. CCF Transactions on High Performance Computing 3, 1 (2021), 114–125.
[24]
Yang Wang, Qinglin Wang, Xiangdong Pei, Songzhu Mei, Rongchun Li, and Jie Liu. 2023. High performance dilated convolutions on multi-core DSPs. CCF Transactions on High Performance Computing (2023), 1–16.
[25]
Weiling Yang, Jianbin Fang, Dezun Dong, Xing Su, and Zheng Wang. 2021. LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
[26]
Shangfei Yin, Qinglin Wang, Ruochen Hao, Tianyang Zhou, Songzhu Mei, and Jie Liu. 2022. Optimizing Irregular-Shaped Matrix-Matrix Multiplication on Multi-Core DSPs. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). 451–461.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing
August 2024
1279 pages
ISBN:9798400717932
DOI:10.1145/3673038
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

  1. DSPs
  2. Irregular-shaped matrix
  3. Matrix-matrix multiplication
  4. Parallel algorithm
  5. Performance optimization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • the National Key Research and Development Program of China
  • the National Natural Science Foundation of China

Conference

ICPP '24

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 270
    Total Downloads
  • Downloads (Last 12 months)270
  • Downloads (Last 6 weeks)96
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media