[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3545008.3545022acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

Characterizing and Optimizing Transformer Inference on ARM Many-core Processor

Published: 13 January 2023 Publication History

Abstract

Transformer has experienced tremendous success and revolutionized the field of natural language processing (NLP). While GPU has become the de facto standard for deep learning computation in many cases, there are still many scenarios where using CPU for deep learning remains a prevalent choice. In particular, ARM many-core processor is emerging as a competitive candidate for HPC systems, which is promising to deploy Transformer inference.
In this paper, we first position three performance bottlenecks of Transformer inference on many-core CPU, including isolated thread scheduling and configuration, inappropriate GEMM implementation and redundant computations for variable-length inputs. To tackle these problems, we proposed cross-layer optimizations for these challenges from operator to runtime layer. To improve parallel efficiency, we design NUMA-aware thread scheduling and a look-up table for optimal parallel configurations.The implementation of GEMM is tailored for some critical modules to suit the characteristics of Transformer workload. To eliminate redundant computations, a novel storage format is designed and implemented to pack sparse data and a load balancing distribution strategy is proposed for tasks with different sparsity. Our experimental results show that our implementation can outperform existing solutions by 1.1x to 6x for fixed-length inputs and 1.9x to 6x for variable-length inputs depending on different sequence lengths and batch sizes.

References

[1]
December. 15, 2021. Effective Transformer. https://github.com/bytedance/effectivetransformer.
[2]
December, 15, 2021. FasterTransformer. https://github.com/NVIDIA/FasterTransformer.
[3]
Reza Azimi, Tyler Fox, Wendy Gonzalez, and Sherief Reda. 2018. Scale-out vs scale-up: a study of arm-based socs on server-class workloads. ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS) 3, 4 (2018), 1–23.
[4]
Jieyang Chen, Nan Xiong, Xin Liang, Dingwen Tao, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben, Qiang Guan, and Zizhong Chen. 2019. TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs. In Proceedings of the ACM International Conference on Supercomputing. 106–116.
[5]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274(2015).
[6]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).
[7]
Dave Dice and Alex Kogan. 2021. Optimizing Inference Performance of Transformers on CPUs. arXiv preprint arXiv:2102.06621(2021).
[8]
Jason Li Emma Ning, Nathan Yan. Published: 01-20-20, Accessed: 01-06-21. Microsoft open sources breakthrough optimizations for transformer inference on gpu and cpu.https://cloudblogs.microsoft.com/opensource/2020/01/21/microsoft-onnxopen- source-optimizations-transformer-inference-gpu-cpu/. (Published: 01-20-20, Accessed: 01-06-21.).
[9]
Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou. 2021. TurboTransformers: an efficient GPU serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 389–402.
[10]
Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, 2016. The Sunway TaihuLight supercomputer: system and applications. Science China Information Sciences 59, 7 (2016), 1–16.
[11]
Kazushige Goto and Robert Van De Geijn. 2008. High-performance implementation of the level-3 BLAS. ACM Transactions on Mathematical Software (TOMS) 35, 1 (2008), 1–14.
[12]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942(2019).
[13]
Dongsheng Li, Dan Huang, Zhiguang Chen, and Yutong Lu. 2021. Optimizing Massively Parallel Winograd Convolution on ARM Processor. In 50th International Conference on Parallel Processing. 1–12.
[14]
Filippo Mantovani, Marta Garcia-Gasulla, José Gracia, Esteban Stafford, Fabio Banchelli, Marc Josep-Fabrego, Joel Criado-Ledesma, and Mathias Nachtmann. 2020. Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU. Future generation computer systems 112 (2020), 800–818.
[15]
Yuan Meng, Yang Yang, Sanmukh Kuppannagari, Rajgopal Kannan, and Viktor Prasanna. 2020. How to efficiently train your ai agent? characterizing and evaluating deep reinforcement learning on heterogeneous platforms. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–7.
[16]
Azita Nouri, Philip E Davis, Pradeep Subedi, and Manish Parashar. 2021. Exploring the Role of Machine Learning in Scientific Workflows: Opportunities and Challenges. arXiv preprint arXiv:2110.13999(2021).
[17]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019), 8026–8037.
[18]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
[19]
Mitsuhisa Sato, Yutaka Ishikawa, Hirofumi Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, Hisashi Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, 2020. Co-Design for A64FX Manycore Processor and” Fugaku”. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
[20]
Pengxin Yuan Shufan Wu, Tao Lv. Published: 09-12-19, Accessed: 01-06-21. Optimization for BERT Inference Performance on CPU. https://github.com/NVIDIA/FasterTransformer. (Published: 09-12-19, Accessed: 01-06-21.).
[21]
John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, 2021. Dorylus: Affordable, Scalable, and Accurate {GNN} Training with Distributed {CPU} Servers and Serverless Threads. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). 495–514.
[22]
Weiling Yang, Jianbin Fang, Dezun Dong, Xing Su, and Zheng Wang. 2021. LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
[23]
Wei Zhang, Zihao Jiang, Zhiguang Chen, Nong Xiao, and Yang Ou. 2021. NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture. Electronics 10, 16 (2021), 1984.

Cited By

View all
  • (2024)Performance Analysis and Optimizations of Matrix Multiplications on ARMv8 Processors2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546786(1-6)Online publication date: 25-Mar-2024
  • (2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
  • (2023)Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUsACM Transactions on Architecture and Code Optimization10.1145/361768920:4(1-22)Online publication date: 26-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '22: Proceedings of the 51st International Conference on Parallel Processing
August 2022
976 pages
ISBN:9781450397339
DOI:10.1145/3545008
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ARM
  2. Transformer
  3. inference
  4. many-core CPU

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP '22
ICPP '22: 51st International Conference on Parallel Processing
August 29 - September 1, 2022
Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)164
  • Downloads (Last 6 weeks)25
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Performance Analysis and Optimizations of Matrix Multiplications on ARMv8 Processors2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546786(1-6)Online publication date: 25-Mar-2024
  • (2024)YFlows: Systematic Dataflow Exploration and Code Generation for Efficient Neural Network Inference using SIMD Architectures on CPUsProceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction10.1145/3640537.3641566(212-226)Online publication date: 17-Feb-2024
  • (2023)Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUsACM Transactions on Architecture and Code Optimization10.1145/361768920:4(1-22)Online publication date: 26-Oct-2023
  • (2023)Hierarchical Model Parallelism for Optimizing Inference on Many-core Processor via Decoupled 3D-CNN StructureACM Transactions on Architecture and Code Optimization10.1145/360514920:3(1-21)Online publication date: 19-Jul-2023
  • (2023)Full-Stack Optimizing Transformer Inference on ARM Many-Core CPUIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.328080534:7(2221-2235)Online publication date: 1-Jul-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media