More Web Proxy on the site http://driver.im/

research-article

Open access

Detailed Analysis and Optimization of Irregular-Shaped Matrix Multiplication on Multi-Core DSPs

Authors:

Jie LiuAuthors Info & Claims

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

Pages 1176 - 1186

https://doi.org/10.1145/3673038.3673101

Published: 12 August 2024 Publication History

All formats PDF

Abstract

Irregular-shaped General Matrix Multiplication (GEMM) is extensively used in diverse workloads, such as scientific simulation and deep learning. In response to energy efficiency constraints, low-power multi-core digital signal processors (DSPs) have emerged as a viable alternative architecture in HPC systems. This study examines the performance of existing GEMM implementations working on irregular-shaped matrices and observes sub-optimal outcomes due to deficiencies in memory optimization and core-level parallelism extraction. For multi-core DSPs in FT-M7032, a CPU-DSP heterogeneous processor for HPC, we introduce dspIMM - a new multi-core parallel implementation for irregular-shaped matrix multiplications. dspIMM incorporates a new loop ordering, stack-space optimization, multi-dimensional core-level parallelization, a communication-computation overlap implementation with data prefetch across loops, and blocking optimization. These optimizations effectively enhance memory access and the core-level parallelism in irregular-shaped GEMMs. Experimental results demonstrate that the proposed communication-computation overlap optimization achieves the highest performance improvement in dspIMM, and the average speedup of dspIMM over previous implementations finally achieves up to 3.34 times.

References

[1]

Murtaza Ali, Eric Stotzer, Francisco D. Igual, and Robert A. van de Geijn. 2012. Level-3 BLAS on the TI C6678 Multi-core DSP. In 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing. 179–186.

Digital Library

[2]

Kumar Chellapilla, Sidd Puri, and Patrice Simard. 2006. High performance convolutional neural networks for document processing. In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft.

[3]

Jieyang Chen, Xin Liang, and Zizhong Chen. 2016. Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with gpus. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 993–1002.

[4]

Jieyang Chen, Nan Xiong, Xin Liang, Dingwen Tao, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben, Qiang Guan, and Zizhong Chen. 2019. TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs. In Proceedings of the ACM International Conference on Supercomputing. 106–116.

Digital Library

[5]

Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).

[6]

Jonathan Drake and Greg Hamerly. 2012. Accelerated k-means with adaptive distance bounds. In 5th NIPS workshop on optimization for machine learning, Vol. 8.

[7]

Dominik Ernst, Georg Hager, Jonas Thies, and Gerhard Wellein. 2021. Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs. The International Journal of High Performance Computing Applications 35, 1 (2021), 5–19.

Digital Library

[8]

Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy of high-performance deep learning convolutions on simd architectures. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 830–841. https://doi.org/10.1109/SC.2018.00069

Digital Library

[9]

Kazushige Goto and Robert Van De Geijn. 2008. High-Performance Implementation of the Level-3 BLAS. ACM Trans. Math. Softw. 35, 1, Article 4 (Jul 2008), 14 pages. https://doi.org/10.1145/1377603.1377607

Digital Library

[10]

Francisco D Igual, Murtaza Ali, Arnon Friedmann, Eric Stotzer, Timothy Wentz, and Robert A van de Geijn. 2012. Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.

Digital Library

[11]

Kyuho J. Lee. 2021. Chapter Seven - Architecture of neural processing unit for deep neural networks. In Hardware Accelerator Systems for Artificial Intelligence and Machine Learning, Shiho Kim and Ganesh Chandra Deka (Eds.). Advances in Computers, Vol. 122. Elsevier, 217–245.

[12]

Chendi Li, Haipeng Jia, Hang Cao, Jianyu Yao, Boqian Shi, Chunyang Xiang, Jinbo Sun, Pengqi Lu, and Yunquan Zhang. 2021. AutoTSMM: An Auto-tuning Framework for Building High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on CPUs. In 2021 IEEE Intl Conf on Parallel Distributed Processing with Applications, Big Data Cloud Computing, Sustainable Computing Communications, Social Computing Networking (ISPA/BDCloud/SocialCom/SustainCom). 159–166.

[13]

Zhong Liu and Xi Tian. 2018. Vectorization of Matrix Multiplication for Multi-Core Vector Processors. Jisuanji Xuebao/Chinese Journal of Computers 41, 10 (2018), 2251 – 2264.

[14]

Sheng Ma, Zhong Liu, Shenggang Chen, Libo Huang, Yang Guo, Zhiying Wang, and Meidi Zhang. 2019. Coordinated DMA: Improving the DRAM Access Efficiency for Matrix Multiplication. IEEE Transactions on Parallel and Distributed Systems 30, 10 (2019), 2148–2164. https://doi.org/10.1109/TPDS.2019.2906891

[15]

Xiangdong Pei, Qinglin Wang, Linyu Liao, Rongchun Li, Songzhu Mei, Jiu Liu, and Zhengbin Pang. 2023. Optimizing parallel matrix transpose algorithm on multi-core digital signal processors (In Chinese). Journal of National University of Defense Technology 45, 1 (2023), 57 – 66.

[16]

Cody Rivera, Jieyang Chen, Nan Xiong, Shuaiwen Leon Song, and Dingwen Tao. 2020. Ism2: Optimizing irregular-shaped matrix-matrix multiplication on gpus. arXiv preprint arXiv:2002.03258 (2020).

[17]

Cody Rivera, Jieyang Chen, Nan Xiong, Jing Zhang, Shuaiwen Leon Song, and Dingwen Tao. 2021. TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs. J. Parallel and Distrib. Comput. 151 (2021), 70–85.

[18]

Ilia Safonov, Anton Kornilov, and Daria Makienko. 2022. An Approach for Matrix Multiplication of 32-Bit Fixed Point Numbers by Means of 16-Bit SIMD Instructions on DSP. Electronics 12 (12 2022), 78.

[19]

Hao Tang, Kazuhiko Komatsu, Masayuki Sato, and Hiroaki Kobayashi. 2021. Efficient Mixed-Precision Tall-and-Skinny Matrix-Matrix Multiplication for GPUs. International Journal of Networking and Computing 11, 2 (2021), 267–282.

[20]

Abhiprayah Tiwari, Vivek Kumar, and Gaurav Mitra. 2018. High performance and energy optimal parallel programming on CPU and DSP based MPSoC. Ph. D. Dissertation. IIIT-Delhi.

[21]

Qinglin Wang, Xiangdong Pei, Linyu Liao, Haoxu Wang, Rongchun Li, Songzhu Mei, and Dongsheng Li. 2023. Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors (In Chinese). Journal of National University of Defense Technology 45, 1 (2023), 86 – 94.

[22]

Qinglin Wang, Mei Songzhu, Jie Liu, and Chunye Gong. 2019. Parallel convolution algorithm using implicit matrix multiplication on multi-core CPUs. In 2019 International Joint Conference on Neural Networks (IJCNN). 1–7. https://doi.org/10.1109/IJCNN.2019.8852012

[23]

Yaohua Wang, Chen Li, Chang Liu, Sheng Liu, Yuanwu Lei, Jian Zhang, Yang Zhang, and Yang Guo. 2021. Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions. CCF Transactions on High Performance Computing 3, 1 (2021), 114–125.

[24]

Yang Wang, Qinglin Wang, Xiangdong Pei, Songzhu Mei, Rongchun Li, and Jie Liu. 2023. High performance dilated convolutions on multi-core DSPs. CCF Transactions on High Performance Computing (2023), 1–16.

[25]

Weiling Yang, Jianbin Fang, Dezun Dong, Xing Su, and Zheng Wang. 2021. LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.

Digital Library

[26]

Shangfei Yin, Qinglin Wang, Ruochen Hao, Tianyang Zhou, Songzhu Mei, and Jie Liu. 2022. Optimizing Irregular-Shaped Matrix-Matrix Multiplication on Multi-Core DSPs. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). 451–461.

Index Terms

Detailed Analysis and Optimization of Irregular-Shaped Matrix Multiplication on Multi-Core DSPs

Recommendations

Optimizing SpMV on Heterogeneous Multi-Core DSPs through Improved Locality and Vectorization
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

The sparse matrix-vector multiplication (SpMV) is widely used in large-scale scientific computing and engineering. However, optimizing SpMV for high-performance digital signal processors (DSPs) has received limited attention. We present HaLAV, a method ...
Optimizing Stencil Computation on Multi-core DSPs
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

Stencil is a common computation pattern in high-performance computing (HPC) applications. While extensive work has been proposed to optimize stencil kernels on CPUs and GPUs, there is no consensus on how to best optimize stencils on multi-core Digital ...
OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing
HPCAsia '18 Workshops: Proceedings of Workshops of HPC Asia

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have emerged with 2D tile mesh architecture. Implementing of the general matrix-matrix multiplication on a new architecture is an important practice. To date, there has not ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

August 2024

1279 pages

ISBN:9798400717932

DOI:10.1145/3673038

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

the National Key Research and Development Program of China
the National Natural Science Foundation of China

Conference

ICPP '24

ICPP '24: the 53rd International Conference on Parallel Processing

August 12 - 15, 2024

Gotland, Sweden

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
270
Total Downloads

Downloads (Last 12 months)270
Downloads (Last 6 weeks)96

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents