[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

SDPipe: A Semi-Decentralized Framework for Heterogeneity-Aware Pipeline-parallel Training

Published: 01 May 2023 Publication History

Abstract

The increasing size of both deep learning models and training data necessitates the ability to scale out model training through pipeline-parallel training, which combines pipelined model parallelism and data parallelism. However, most of them assume an ideal homogeneous dedicated cluster. As for real cloud clusters, these approaches suffer from the intensive model synchronization overheads due to the dynamic environment heterogeneity. Such a huge challenge leaves the design in a dilemma: either the performance bottleneck of the central parameter server (PS) or severe performance degradation caused by stragglers for decentralized synchronization (like All-Reduce). This approach presents SDPipe, a new semi-decentralized framework to get the best of both worlds, achieving both high heterogeneity tolerance and convergence efficiency in pipeline-parallel training. To provide high performance, we decentralize the communication model synchronization, which accounts for the largest proportion of synchronization overhead. In contrast, we centralize the process of group scheduling, which is lightweight but needs a global view for better performance and convergence speed against heterogeneity. We show via a prototype implementation the significant advantage of SDPipe on performance and scalability, facing different environments.

References

[1]
2017. PyTorch. https://github.com/pytorch/examples/tree/master/imagenet.
[2]
2021. NCCL. https://developer.nvidia.com/nccl.
[3]
2023. Alibaba Cloud Virtual GPU Instance. https://www.alibabacloud.com/help/en/elastic-gpu-service/latest/vgpu-accelerated-instance-families.
[4]
2023. SDPipe. https://github.com/Hsword/VLDB2023_SDPipe.
[5]
2023. SDPipe Artifacts and Proofs. https://github.com/Hsword/VLDB2023_SDPipe/blob/main/VLDB2023_SDPipe_Artifacts_and_Proofs.pdf.
[6]
2023. Vultr. https://www.vultr.com.
[7]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI. 265--283.
[8]
Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, and Nipun Kwatra. 2021. Varuna: scalable, low-cost training of massive deep learning models. Proceedings of the Seventeenth European Conference on Computer Systems (2021).
[9]
Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, and Nipun Kwatra. 2022. Varuna: scalable, low-cost training of massive deep learning models. In EuroSys. ACM, 472--487.
[10]
Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis. ACM Comput. Surv. 52, 4 (2019), 65:1--65:43.
[11]
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen Creel, Jared Quincy Davis, Dorottya Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah D. Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark S. Krass, Ranjay Krishna, Rohith Kuditipudi, and et al. 2021. On the Opportunities and Risks of Foundation Models. (2021). arXiv:2108.07258
[12]
Alexander Borzunov, Dmitry Baranchuk, Tim Dettmers, Max Ryabinin, Younes Belkada, Artem Chumachenko, Pavel K. Samygin, and Colin Raffel. 2022. Petals: Collaborative Inference and Fine-tuning of Large Models. ArXiv abs/2209.01188 (2022).
[13]
Léon Bottou, Frank E. Curtis, and Jorge Nocedal. 2018. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 60, 2 (2018), 223--311.
[14]
J. Chen, Rajat Monga, S. Bengio, and R. Józefowicz. 2016. Revisiting Distributed Synchronous SGD. ArXiv abs/1702.05800 (2016).
[15]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. CoRR abs/1512.01274 (2015). arXiv:1512.01274
[16]
Jichan Chung, Kangwook Lee, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan Ramchandran. 2017. Ubershuffle: Communication-efficient data shuffling for sgd via coding theory. NeurIPS.
[17]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In CVPR. 248--255.
[18]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.
[19]
Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. 2016. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155, 1--2 (2016), 267--305.
[20]
Runsheng Guo, Victor Guo, Antonio Kim, Josh Hildred, and Khuzaima Daudjee. 2022. Hydrozoa: Dynamic Hybrid-Parallel DNN Training on Serverless Containers. In MLSys.
[21]
William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NeurIPS. 1024--1034.
[22]
Aaron Harlap, Henggang Cui, Wei Dai, Jinliang Wei, Gregory R Ganger, Phillip B Gibbons, Garth A Gibson, and Eric P Xing. 2016. Addressing the straggler problem for iterative convergent parallel ML. In SoCC. 98--111.
[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770--778.
[24]
Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Gregory R. Ganger, and Eric P. Xing. 2013. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server. In NeurIPS. 1223--1231.
[25]
Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. 2017. Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds. In NSDI. 629--647.
[26]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Xu Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, and Zhifeng Chen. 2019. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In NeurIPS. 103--112.
[27]
Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware Distributed Parameter Servers. In SIGMOD. 463--478.
[28]
Jiawei Jiang, Shaoduo Gan, Yue Liu, Fanlin Wang, Gustavo Alonso, Ana Klimovic, Ankit Singla, Wentao Wu, and Ce Zhang. 2021. Towards Demystifying Serverless Machine Learning Training. In SIGMOD. ACM, 857--871.
[29]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In USENIX Symposium on Operating Systems Design and Implementation.
[30]
Paresh Kharya and Ali Alvi. 2021. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model. NVIDIA Developer Blog (2021).
[31]
Soojeong Kim, Gyeong-In Yu, Hojin Park, Sungwoo Cho, Eunji Jeong, Hyeonmin Ha, Sanha Lee, Joo Seong Jeong, and Byung-Gon Chun. 2019. Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks. In EuroSys. 43:1--43:15.
[32]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In OSDI. 583--598.
[33]
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. PVLDB 13, 12 (2020), 3005--3018.
[34]
Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. In NeruIPS. 5330--5340.
[35]
Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In ICML, Vol. 80. 3049--3058.
[36]
Yucheng Lu, Jack Nash, and Christopher De Sa. 2020. MixML: A Unified Analysis of Weakly Consistent Parallel Learning. CoRR abs/2005.06706 (2020). arXiv:2005.06706
[37]
Qinyi Luo, Jiaao He, Youwei Zhuo, and Xuehai Qian. 2020. Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training. In ASPLOS. 401--416.
[38]
Xupeng Miao, Xiaonan Nie, Yingxia Shao, Zhi Yang, Jiawei Jiang, Lingxiao Ma, and Bin Cui. 2021. Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce. In SIGMOD. ACM, 2262--2270.
[39]
Xupeng Miao, Xiaonan Nie, Hailin Zhang, Tong Zhao, and Bin Cui. 2022. Hetu: A highly efficient automatic parallel distributed deep learning system. Sci. China Inf. Sci. (2022).
[40]
Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang, and Bin Cui. 2023. Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. Proc. VLDB Endow. 16, 3 (2023), 470--479.
[41]
Xupeng Miao, Hailin Zhang, Yining Shi, Xiaonan Nie, Zhi Yang, Yangyu Tao, and Bin Cui. 2022. HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework. Proc. VLDB Endow. 15, 2 (2022), 312--320.
[42]
Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni, and Vijay Chidambaram. 2022. Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters. OSDI (2022).
[43]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In SOSP. 1--15.
[44]
Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2021. Memory-Efficient Pipeline-Parallel DNN Training. In ICML. 7937--7947.
[45]
Angelia Nedic, Alexander Olshevsky, Asuman E. Ozdaglar, and John N. Tsitsiklis. 2009. On Distributed Averaging Algorithms and Quantization Effects. IEEE Trans. Autom. Control. 54, 11 (2009), 2506--2517.
[46]
Angelia Nedic and Asuman E. Ozdaglar. 2009. Distributed Subgradient Methods for Multi-Agent Optimization. IEEE Trans. Autom. Control. 54, 1 (2009), 48--61.
[47]
Xiaonan Nie, Yi Liu, Fangcheng Fu, Jinbao Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, and Bin Cui. 2023. Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent. Proceedings of the VLDB Endowment (2023).
[48]
Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, and Bin Cui. 2023. FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement. In SIGMOD. ACM.
[49]
Jay H Park, Gyeongchan Yun, M Yi Chang, Nguyen T Nguyen, Seungmin Lee, Jaesik Choi, Sam H Noh, and Young-ri Choi. 2020. {HetPipe}: Enabling Large {DNN} Training on (Whimpy) Heterogeneous {GPU} Clusters through Integration of Pipelined Model Parallelism and Data Parallelism. In ATC. 307--321.
[50]
Adam Paszke and Sam Gross. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS. 8024--8035.
[51]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distributed Comput. 69, 2 (2009), 117--124.
[52]
Sundhar Srinivasan Ram, Angelia Nedic, and Venugopal V. Veeravalli. 2009. Asynchronous gossip algorithms for stochastic optimization. In CDC. IEEE, 3581--3586.
[53]
S. Sundhar Ram, Angelia Nedic, and Venugopal V. Veeravalli. 2010. Distributed Stochastic Subgradient Projection Algorithms for Convex Optimization. J. Optim. Theory Appl. 147, 3 (2010), 516--545.
[54]
Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. CoRR abs/1802.05799 (2018).
[55]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR.
[56]
Sebastian U. Stich. 2019. Local SGD Converges Fast and Communicates Little. In ICLR. OpenReview.net.
[57]
John Thorpe, Pengzhan Zhao, Jon Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, and Guoqing Harry Xu. 2022. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs. ArXiv abs/2204.12013 (2022).
[58]
John Thorpe, Pengzhan Zhao, Jonathan Eyolfson, Yifan Qiao, Zhihao Jia, Minjia Zhang, Ravi Netravali, and Guoqing Harry Xu. 2023. Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs. NSDI (2023).
[59]
Sonal Tuteja and Rajeev Kumar. 2022. A Unification of Heterogeneous Data Sources into a Graph Model in E-commerce. Data Sci. Eng. 7, 1 (2022), 57--70.
[60]
Guozheng Wang, Yongmei Lei, Zeyu Zhang, and Cunlu Peng. 2023. A Communication Efficient ADMM-based Distributed Algorithm Using Two-Dimensional Torus Grouping AllReduce. Data Sci. Eng. 8, 1 (2023), 61--72.
[61]
Jianyu Wang and Gauri Joshi. 2019. Cooperative SGD: A Unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms. In ICML Workshop.
[62]
Adam Weingram, Yuke Li, Hao Qi, Darren Ng, Liuyao Dai, and Xiaoyi Lu. 2023. xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning. J. Comput. Sci. Technol. 38, 1 (2023), 166--195.
[63]
Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, and Yu Ding. 2022. MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters. In NSDI. USENIX Association, 945--960.
[64]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In OSDI. 595--610.
[65]
Binhang Yuan, Yongjun He, Jared Quincy Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy Liang, Christopher Re, and Ce Zhang. 2022. Decentralized Training of Foundation Models in Heterogeneous Environments. In Advances in Neural Information Processing Systems, Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (Eds.). https://openreview.net/forum?id=UHoGOaGjEq
[66]
Kun Yuan, Qing Ling, and Wotao Yin. 2016. On the Convergence of Decentralized Gradient Descent. SIAM J. Optim. 26, 3 (2016), 1835--1854.
[67]
Ce Zhang and Christopher Ré. 2014. DimmWitted: A Study of Main-Memory Statistical Analytics. PVLDB 7, 12 (2014), 1283--1294.
[68]
Shixiong Zhao, Fanxin Li, Xusheng Chen, Tianxiang Shen, Li Chen, Sen Wang, Nicholas Zhang, Cheng Li, and Heming Cui. 2022. NASPipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism. In ASPLOS. ACM, 374--387.
[69]
Yihao Zhao, Yuanqiang Liu, Yanghua Peng, Yibo Zhu, Xuanzhe Liu, and Xin Jin. 2022. Multi-resource interleaving for deep learning training. In SIGCOMM. ACM, 428--440.
[70]
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E Gonzalez, et al. 2022. Alpa: Automating Inter-and Intra-Operator Parallelism for Distributed Deep Learning. OSDI (2022).
[71]
Martin Zinkevich, M. Weimer, Alex Smola, and L. Li. 2010. Parallelized Stochastic Gradient Descent. In NeurIPS.

Cited By

View all
  • (2025)MEMO: Fine-grained Tensor Management For Ultra-long Context LLM TrainingProceedings of the ACM on Management of Data10.1145/37097033:1(1-28)Online publication date: 11-Feb-2025
  • (2025)Communication-Efficient Federated Optimization Over Semi-Decentralized NetworksIEEE Transactions on Signal and Information Processing over Networks10.1109/TSIPN.2025.353900411(147-160)Online publication date: 2025
  • (2025)DePoL: Assuring training integrity in collaborative learning via decentralized verificationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2025.105056199(105056)Online publication date: May-2025
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 16, Issue 9
May 2023
330 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 May 2023
Published in PVLDB Volume 16, Issue 9

Check for updates

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)298
  • Downloads (Last 6 weeks)21
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)MEMO: Fine-grained Tensor Management For Ultra-long Context LLM TrainingProceedings of the ACM on Management of Data10.1145/37097033:1(1-28)Online publication date: 11-Feb-2025
  • (2025)Communication-Efficient Federated Optimization Over Semi-Decentralized NetworksIEEE Transactions on Signal and Information Processing over Networks10.1109/TSIPN.2025.353900411(147-160)Online publication date: 2025
  • (2025)DePoL: Assuring training integrity in collaborative learning via decentralized verificationJournal of Parallel and Distributed Computing10.1016/j.jpdc.2025.105056199(105056)Online publication date: May-2025
  • (2025)Alleviating straggler impacts for data parallel deep learning with hybrid parameter updateFuture Generation Computer Systems10.1016/j.future.2025.107775168(107775)Online publication date: Jul-2025
  • (2024)HEXGENProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692951(21946-21961)Online publication date: 21-Jul-2024
  • (2024)MetisProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692027(563-578)Online publication date: 10-Jul-2024
  • (2024)SpotServe: Serving Generative Large Language Models on Preemptible InstancesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640411(1112-1127)Online publication date: 27-Apr-2024
  • (2024)Sylvie: 3D-Adaptive and Universal System for Large-Scale Graph Neural Network Training2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00293(3823-3836)Online publication date: 13-May-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media