More Web Proxy on the site http://driver.im/

research-article

Characterizing and Optimizing Transformer Inference on ARM Many-core Processor

Authors:

Yutong LuAuthors Info & Claims

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Article No.: 20, Pages 1 - 11

https://doi.org/10.1145/3545008.3545022

Published: 13 January 2023 Publication History

Abstract

Transformer has experienced tremendous success and revolutionized the field of natural language processing (NLP). While GPU has become the de facto standard for deep learning computation in many cases, there are still many scenarios where using CPU for deep learning remains a prevalent choice. In particular, ARM many-core processor is emerging as a competitive candidate for HPC systems, which is promising to deploy Transformer inference.

In this paper, we first position three performance bottlenecks of Transformer inference on many-core CPU, including isolated thread scheduling and configuration, inappropriate GEMM implementation and redundant computations for variable-length inputs. To tackle these problems, we proposed cross-layer optimizations for these challenges from operator to runtime layer. To improve parallel efficiency, we design NUMA-aware thread scheduling and a look-up table for optimal parallel configurations.The implementation of GEMM is tailored for some critical modules to suit the characteristics of Transformer workload. To eliminate redundant computations, a novel storage format is designed and implemented to pack sparse data and a load balancing distribution strategy is proposed for tasks with different sparsity. Our experimental results show that our implementation can outperform existing solutions by 1.1x to 6x for fixed-length inputs and 1.9x to 6x for variable-length inputs depending on different sequence lengths and batch sizes.

References

[1]

December. 15, 2021. Effective Transformer. https://github.com/bytedance/effectivetransformer.

[2]

December, 15, 2021. FasterTransformer. https://github.com/NVIDIA/FasterTransformer.

[3]

Reza Azimi, Tyler Fox, Wendy Gonzalez, and Sherief Reda. 2018. Scale-out vs scale-up: a study of arm-based socs on server-class workloads. ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS) 3, 4 (2018), 1–23.

Digital Library

[4]

Jieyang Chen, Nan Xiong, Xin Liang, Dingwen Tao, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben, Qiang Guan, and Zizhong Chen. 2019. TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs. In Proceedings of the ACM International Conference on Supercomputing. 106–116.

Digital Library

[5]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274(2015).

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).

[7]

Dave Dice and Alex Kogan. 2021. Optimizing Inference Performance of Transformers on CPUs. arXiv preprint arXiv:2102.06621(2021).

[8]

Jason Li Emma Ning, Nathan Yan. Published: 01-20-20, Accessed: 01-06-21. Microsoft open sources breakthrough optimizations for transformer inference on gpu and cpu.https://cloudblogs.microsoft.com/opensource/2020/01/21/microsoft-onnxopen- source-optimizations-transformer-inference-gpu-cpu/. (Published: 01-20-20, Accessed: 01-06-21.).

[9]

Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: an efficient GPU serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 389–402.

Digital Library

[10]

Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, 2016. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences 59, 7 (2016), 1–16.

[11]

Kazushige Goto and Robert Van De Geijn. 2008. High-performance implementation of the level-3 BLAS. ACM Transactions on Mathematical Software (TOMS) 35, 1 (2008), 1–14.

Digital Library

[12]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942(2019).

[13]

Dongsheng Li, Dan Huang, Zhiguang Chen, and Yutong Lu. 2021. Optimizing Massively Parallel Winograd Convolution on ARM Processor. In 50th International Conference on Parallel Processing. 1–12.

[14]

Filippo Mantovani, Marta Garcia-Gasulla, José Gracia, Esteban Stafford, Fabio Banchelli, Marc Josep-Fabrego, Joel Criado-Ledesma, and Mathias Nachtmann. 2020. Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU. Future generation computer systems 112 (2020), 800–818.

[15]

Yuan Meng, Yang Yang, Sanmukh Kuppannagari, Rajgopal Kannan, and Viktor Prasanna. 2020. How to efficiently train your ai agent? characterizing and evaluating deep reinforcement learning on heterogeneous platforms. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7.

[16]

Azita Nouri, Philip E Davis, Pradeep Subedi, and Manish Parashar. 2021. Exploring the Role of Machine Learning in Scientific Workflows: Opportunities and Challenges. arXiv preprint arXiv:2110.13999(2021).

[17]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026–8037.

[18]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.

[19]

Mitsuhisa Sato, Yutaka Ishikawa, Hirofumi Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, Hisashi Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, 2020. Co-Design for A64FX Manycore Processor and” Fugaku”. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.

Digital Library

[20]

Pengxin Yuan Shufan Wu, Tao Lv. Published: 09-12-19, Accessed: 01-06-21. Optimization for BERT Inference Performance on CPU. https://github.com/NVIDIA/FasterTransformer. (Published: 09-12-19, Accessed: 01-06-21.).

[21]

John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, 2021. Dorylus: Affordable, Scalable, and Accurate {GNN} Training with Distributed {CPU} Servers and Serverless Threads. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). 495–514.

[22]

Weiling Yang, Jianbin Fang, Dezun Dong, Xing Su, and Zheng Wang. 2021. LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.

Digital Library

[23]

Wei Zhang, Zihao Jiang, Zhiguang Chen, Nong Xiao, and Yang Ou. 2021. NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture. Electronics 10, 16 (2021), 1984.

Cited By

Wang XWang YJiang YSingh AYang M(2025)On Task Mapping in Multi-chiplet Based Many-Core Systems to Optimize Inter- and Intra-chiplet CommunicationsIEEE Transactions on Computers10.1109/TC.2024.350035474:2(510-525)Online publication date: Feb-2025
https://doi.org/10.1109/TC.2024.3500354
Liu HShi SWang XJiang ZChen Q(2024)Performance Analysis and Optimizations of Matrix Multiplications on ARMv8 Processors2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546786(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546786
Zhou CHassman ZShah DRichard VLi YRodríguez GSadayappan PSukumaran-Rajam A(2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641566
Show More Cited By

Index Terms

Characterizing and Optimizing Transformer Inference on ARM Many-core Processor
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

Optimizing massively parallel sparse matrix computing on ARM many-core processor
Abstract
Sparse matrix multiplication is ubiquitous in many applications such as graph processing and numerical simulation. In recent years, numerous efficient sparse matrix multiplication algorithms and computational libraries have been proposed. However,...
Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN Structure
The tremendous success of convolutional neural network (CNN) has made it ubiquitous in many fields of human endeavor. Many applications such as biomedical analysis and scientific data analysis involve analyzing volumetric data. This spawns huge demand for ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

August 2022

976 pages

ISBN:9781450397339

DOI:10.1145/3545008

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP '22

ICPP '22: 51st International Conference on Parallel Processing

August 29 - September 1, 2022

Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
344
Total Downloads

Downloads (Last 12 months)130
Downloads (Last 6 weeks)13

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang XWang YJiang YSingh AYang M(2025)On Task Mapping in Multi-chiplet Based Many-Core Systems to Optimize Inter- and Intra-chiplet CommunicationsIEEE Transactions on Computers10.1109/TC.2024.350035474:2(510-525)Online publication date: Feb-2025
https://doi.org/10.1109/TC.2024.3500354
Liu HShi SWang XJiang ZChen Q(2024)Performance Analysis and Optimizations of Matrix Multiplications on ARMv8 Processors2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546786(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546786
Zhou CHassman ZShah DRichard VLi YRodríguez GSadayappan PSukumaran-Rajam A(2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
https://dl.acm.org/doi/10.1145/3640537.3641566
Du JJiang JZheng JZhang HHuang DLu Y(2023)Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUsACM Transactions on Architecture and Code Optimization10.1145/361768920:4(1-22)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3617689
Jiang JHuang ZHuang DDu JChen LChen ZLu Y(2023)Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN StructureACM Transactions on Architecture and Code Optimization10.1145/360514920:3(1-21)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3605149
Jiang JDu JHuang DChen ZLu YLiao X(2023)Full-Stack Optimizing Transformer Inference on ARM Many-Core CPUIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328080534:7(2221-2235)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3280805

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten