More Web Proxy on the site http://driver.im/

research-article

DistSim: A performance model of large-scale hybrid distributed DNN training

Authors:

Minyi GuoAuthors Info & Claims

CF '23: Proceedings of the 20th ACM International Conference on Computing Frontiers

Pages 112 - 122

https://doi.org/10.1145/3587135.3592200

Published: 04 August 2023 Publication History

Abstract

With the ever-increasing computational demand of DNN training workloads, distributed training has been widely adopted. A combination of data, model and pipeline parallelism strategy, called hybrid parallelism distributed training, is imported to tackle the problem of deploying large-scale models. However, how to evaluate the hybrid strategy and the utilization of each device remains a challenge since existing works either profile on a real large-scale cluster with high time and money costs or only analyze a specific type of parallelism without considering the hybrid parallelism. In this work, we proposed DistSim, an event-based performance model to accurately analyze each device's computation and communication activities with low profiling costs. DistDim breaks down the model into events according to the given distributed strategy, which can be profiled on two nodes. Then DistSim leverages the hierarchy of different parallel strategies to generate the computation and communication event-flow from layer level to model level and finally the activity timeline of each device participating in training. Experiment shows that DistSim can reach <4% errors when predicting distributing training batch time and <5% errors when predicting a single device's activity time in various hybrid strategy settings. We also provide a use-case of DistSim, automatically evaluate and search the best distributed training strategy, and find a hybrid strategy with at most 7.37× throughput improvement.

References

[1]

Amazon. [n.d.]. AWS Pricing Calculator. https://calculator.aws/.

[2]

Baidu. [n. d.]. Ring AllReduce. https://github.com/baidu-research/baidu-allreduce.

[3]

Zhengda Bian, Qifan Xu, Boxiang Wang, and Yang You. 2021. Maximizing Parallelism in Distributed Training for Huge Neural Networks. CoRR abs/2105.14450 (2021). arXiv:2105.14450 https://arxiv.org/abs/2105.14450

[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877--1901.

[5]

Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, and Chuanxiong Guo. 2020. Elastic Parameter Server Load Distribution in Deep Learning Clusters. In Proceedings of the 11th ACM Symposium on Cloud Computing (Virtual Event, USA) (SoCC '20). Association for Computing Machinery, New York, NY, USA, 507--521. https://doi.org/10.1145/3419111.3421307

Digital Library

[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805

[7]

Venmugil Elango. 2021. Pase: Parallelization Strategies for Efficient DNN Training. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1025--1034. https://doi.org/10.1109/IPDPS49936.2021.00111

[8]

Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, Lansong Diao, Xiaoyong Liu, and Wei Lin. 2021. DAPPLE: A Pipelined Data Parallel Approach for Training Large Models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Virtual Event, Republic of Korea) (PPoPP '21). Association for Computing Machinery, New York, NY, USA, 431--445. https://doi.org/10.1145/3437801.3441593

Digital Library

[9]

Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li, Minyi Guo, and Yuhao Zhu. 2020. Accelerating sparse dnn models without hardware-support via tile-wise sparsity. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.

Digital Library

[10]

Cong Guo, Yuxian Qiu, Jingwen Leng, Xiaotian Gao, Chen Zhang, Yunxin Liu, Fan Yang, Yuhao Zhu, and Minyi Guo. 2022. SQuant: On-the-Fly Data-Free Quantization via Diagonal Hessian Approximation. In International Conference on Learning Representations. https://openreview.net/forum?id=JXhROKNZzOc

[11]

Cong Guo, Yuxian Qiu, Jingwen Leng, Chen Zhang, Ying Cao, Quanlu Zhang, Yunxin Liu, Fan Yang, and Minyi Guo. 2022. Nesting Forward Automatic Differentiation for Memory-Efficient Deep Neural Network Training. In 2022 IEEE 40th International Conference on Computer Design (ICCD). IEEE, 738--745.

[12]

Cong Guo, Chen Zhang, Jingwen Leng, Zihan Liu, Fan Yang, Yunxin Liu, Minyi Guo, and Yuhao Zhu. 2022. Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1414--1433.

Digital Library

[13]

Cong Guo, Yangjie Zhou, Jingwen Leng, Yuhao Zhu, Zidong Du, Quan Chen, Chao Li, Bin Yao, and Minyi Guo. 2020. Balancing efficiency and flexibility for DNN acceleration via temporal GPU-systolic array integration. In 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 1--6.

[14]

Hanpeng Hu, Chenyu Jiang, Yuchen Zhong, Yanghua Peng, Chuan Wu, Yibo Zhu, Haibin Lin, and Chuanxiong Guo. 2022. dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training. In Proceedings of Machine Learning and Systems, D. Marculescu, Y. Chi, and C. Wu (Eds.), Vol. 4. 623--637.

[15]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, and zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc.

[16]

Alex Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997 (2014). arXiv:1404.5997

[17]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger (Eds.), Vol. 25. Curran Associates, Inc.

[18]

Jingwen Leng, Alper Buyuktosunoglu, Ramon Bertran, Pradip Bose, Quan Chen, Minyi Guo, and Vijay Janapa Reddi. 2020. Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 44--57. https://doi.org/10.1109/HPCA47549.2020.00014

[19]

Shigang Li and Torsten Hoefler. 2021. Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC '21). Association for Computing Machinery, New York, NY, USA, Article 27, 14 pages. https://doi.org/10.1145/3458817.3476145

Digital Library

[20]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. CoRR abs/2006.15704 (2020). arXiv:2006.15704

Digital Library

[21]

Zihan Liu, Jingwen Leng, Zhihui Zhang, Quan Chen, Chao Li, and Minyi Guo. 2022. VELTAIR: Towards High-Performance Multi-Tenant Deep Learning Services via Adaptive Compilation and Scheduling. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS '22). Association for Computing Machinery, New York, NY, USA, 388--401. https://doi.org/10.1145/3503222.3507752

Digital Library

[22]

Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. Check-Freq: Frequent, Fine-Grained DNN Checkpointing. In 19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Association, 203--216. https://www.usenix.org/conference/fast21/presentation/mohan

[23]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP '19). Association for Computing Machinery, New York, NY, USA, 1--15. https://doi.org/10.1145/3341301.3359646

Digital Library

[24]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. 2021. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (St. Louis, Missouri) (SC '21). Association for Computing Machinery, New York, NY, USA, Article 58, 15 pages. https://doi.org/10.1145/3458817.3476209

Digital Library

[25]

NVIDIA. [n.d.]. CUPTI. https://docs.nvidia.com/cuda/cupti/.

[26]

Yuxian Qiu, Jingwen Leng, Cong Guo, Quan Chen, Chao Li, Minyi Guo, and Yuhao Zhu. 2019. Adversarial defense through network profiling based path extraction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4777--4786.

[27]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2018. Language Models are Unsupervised Multitask Learners. (2018).

[28]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. CoRR abs/1910.10683 (2019). arXiv:1910.10683

[29]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory optimizations Toward Training Trillion Parameter Models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1--16. https://doi.org/10.1109/SC41405.2020.00024

[30]

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deep-Speed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD '20). Association for Computing Machinery, New York, NY, USA, 3505--3506. https://doi.org/10.1145/3394486.3406703

Digital Library

[31]

Keshav Santhanam, Siddharth Krishna, Ryota Tomioka, Andrew Fitzgibbon, and Tim Harris. 2021. DistIR: An Intermediate Representation for Optimizing Distributed Neural Networks (EuroMLSys '21). Association for Computing Machinery, New York, NY, USA, 15--23. https://doi.org/10.1145/3437984.3458829

Digital Library

[32]

Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018). arXiv:1802.05799

[33]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR abs/1909.08053 (2019). arXiv:1909.08053

[34]

Linghao Song, Fan Chen, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. 2020. AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 342--355. https://doi.org/10.1109/HPCA47549.2020.00036

[35]

Yifan Sun, Trinayan Baruah, Saiful A. Mojumder, Shi Dong, Xiang Gong, Shane Treadway, Yuhui Bao, Spencer Hance, Carter McCardwell, Vincent Zhao, Harrison Barclay, Amir Kavyan Ziabari, Zhongliang Chen, Rafael Ubal, José L. Abellán, John Kim, Ajay Joshi, and David Kaeli. 2019. MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 197--209.

Digital Library

[36]

Jakub M Tarnawski, Deepak Narayanan, and Amar Phanishayee. 2021. Piper: Multidimensional Planner for DNN Parallelization. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 24829--24840.

[37]

Yang Wang, Chen Zhang, Zhiqiang Xie, Cong Guo, Yunxin Liu, and Jingwen Leng. 2021. Dual-side sparse tensor core. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 1083--1095.

Digital Library

[38]

Zhenning Wang, Jun Yang, Rami Melhem, Bruce Childers, Youtao Zhang, and Minyi Guo. 2017. Quality of Service Support for Fine-Grained Sharing on GPUs. SIGARCH Comput. Archit. News 45, 2 (jun 2017), 269--281. https://doi.org/10.1145/3140659.3080203

Digital Library

[39]

Geoffrey X. Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. 2021. Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 503--521. https://www.usenix.org/conference/atc21/presentation/yu

[40]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E. Gonzalez, and Ion Stoica. 2022. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning. CoRR abs/2201.12023 (2022). arXiv:2201.12023

[41]

Xiaojie Zhou, Kun Wang, Weijia Jia, and Minyi Guo. 2017. Reinforcement learning-based adaptive resource management of differentiated services in geo-distributed data centers. In 2017 IEEE/ACM 25th International Symposium on Quality of Service (IWQoS). 1--6. https://doi.org/10.1109/IWQoS.2017.7969161

[42]

Yangjie Zhou, Jingwen Leng, Yaoxu Song, Shuwen Lu, Mian Wang, Chao Li, Minyi Guo, Wenting Shen, Yong Li, Wei Lin, et al. 2023. uGrapher: High-Performance Graph Operator Computation via Unified Abstraction for Graph Neural Networks. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 878--891.

Digital Library

[43]

Yangjie Zhou, Mengtian Yang, Cong Guo, Jingwen Leng, Yun Liang, Quan Chen, Minyi Guo, and Yuhao Zhu. 2021. Characterizing and demystifying the implicit convolution algorithm on commercial matrix-multiplication accelerators. In 2021 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 214--225.

[44]

Hongyu Zhu, Mohamed Akrout, Bojian Zheng, Andrew Pelegris, Anand Jayarajan, Amar Phanishayee, Bianca Schroeder, and Gennady Pekhimenko. 2018. Benchmarking and Analyzing Deep Neural Network Training. In 2018 IEEE International Symposium on Workload Characterization (IISWC). 88--100. https://doi.org/10.1109/IISWC.2018.8573476

[45]

Hongyu Zhu, Amar Phanishayee, and Gennady Pekhimenko. 2020. Daydream: Accurately Estimating the Efficacy of Optimizations for DNN Training. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 337--352.

Cited By

Guo ZTang YZhai JYuan TJin JWang LZhao YLi R(2024)A Survey on Performance Modeling and Prediction for Distributed DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347639035:12(2463-2478)Online publication date: Dec-2024
https://doi.org/10.1109/TPDS.2024.3476390
Tang YYuan TCao FWang LGuo ZZhao YLi R(2024)Simulating LLM Training in CXL-Based Heterogeneous Computing ClusterIEEE INFOCOM 2024 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFOCOMWKSHPS61880.2024.10620705(1-6)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOMWKSHPS61880.2024.10620705

Index Terms

DistSim: A performance model of large-scale hybrid distributed DNN training
1. Computing methodologies

Recommendations

Cannikin: Optimal Adaptive Distributed DNN Training over Heterogeneous Clusters
MIDDLEWARE '24: Proceedings of the 25th International Middleware Conference

Adjusting batch sizes and adaptively tuning other hyperparameters can significantly speed up deep neural network (DNN) training. Despite the ubiquity of heterogeneous clusters, existing adaptive DNN training techniques solely consider homogeneous ...
Near-Lossless Gradient Compression for Data-Parallel Distributed DNN Training
SoCC '24: Proceedings of the 2024 ACM Symposium on Cloud Computing

Data parallelism has become a cornerstone in scaling up the training of deep neural networks (DNNs). However, the communication overhead associated with synchronizing gradients across multiple nodes has emerged as a significant bottleneck, adversely ...
Performance Comparison of Distributed DNN Training on Optical Versus Electrical Interconnect Systems
Algorithms and Architectures for Parallel Processing
Abstract
Parallel and distributed Deep Neural Network (DNN) training have become integral in data centers, significantly reducing DNN training time. The interconnection type among nodes and the chosen all-reduce algorithm critically impact this speed-up. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CF '23: Proceedings of the 20th ACM International Conference on Computing Frontiers

May 2023

419 pages

ISBN:9798400701405

DOI:10.1145/3587135

General Chairs:
Andrea Bartolini
Università di Bologna, IT
,
Kristian Rietveld
Leiden University, NL
,
Program Chairs:
Catherine Schuman
University of Tennessee, US
,
Jose Moreira
IBM, US

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

The National Key R&D Program of China under Grant
the National Natural Science Foundation of China (NSFC)

Conference

CF '23

Sponsor:

SIGMICRO

CF '23: 20th ACM International Conference on Computing Frontiers

May 9 - 11, 2023

Bologna, Italy

Acceptance Rates

CF '23 Paper Acceptance Rate 24 of 66 submissions, 36%;

Overall Acceptance Rate 273 of 785 submissions, 35%

Upcoming Conference

CF '25

Sponsor:
sigmicro

22nd ACM International Conference on Computing Frontiers

May 28 - 30, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
147
Total Downloads

Downloads (Last 12 months)115
Downloads (Last 6 weeks)14

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Guo ZTang YZhai JYuan TJin JWang LZhao YLi R(2024)A Survey on Performance Modeling and Prediction for Distributed DNN TrainingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347639035:12(2463-2478)Online publication date: Dec-2024
https://doi.org/10.1109/TPDS.2024.3476390
Tang YYuan TCao FWang LGuo ZZhao YLi R(2024)Simulating LLM Training in CXL-Based Heterogeneous Computing ClusterIEEE INFOCOM 2024 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFOCOMWKSHPS61880.2024.10620705(1-6)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOMWKSHPS61880.2024.10620705

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents