More Web Proxy on the site http://driver.im/

research-article

Dissecting the Communication Latency in Distributed Deep Sparse Learning

Authors:

Gaogang XieAuthors Info & Claims

IMC '20: Proceedings of the ACM Internet Measurement Conference

Pages 528 - 534

https://doi.org/10.1145/3419394.3423637

Published: 27 October 2020 Publication History

Abstract

Distributed deep learning (DDL) uses a cluster of servers to train models in parallel. This has been applied to a multiplicity of problems, e.g. online advertisement, friend recommendations. However, the distribution of training means that the communication network becomes a key component in system performance. In this paper, we measure the Alibaba's DDL system, with a focus on understanding the bottlenecks introduced by the network. Our key finding is that the communications overhead has a surprisingly large impact on performance. To explore this, we analyse latency logs of 1.38M Remote Procedure Calls between servers during model training for two real applications of high-dimensional sparse data. We reveal the major contributors of the latency, including concurrent write/read operations of different connections and network connection management. We further observe a skewed distribution of update frequency for individual parameters, motivating us to propose using in-network computation capacity to offload server tasks.

Supplementary Material

WMV File (imc2020-131.wmv)

Dissecting the Communication Latency in Distributed Deep Sparse Learning

Download
76.05 MB

References

[1]

[n. d.]. The Apache Software Foundation. Apache hadoop. http://hadoop.apache. org/core/. ([n. d.]).

[2]

[n. d.]. Facebook, Gloo. https://github.com/facebookincubator/gloo. ([n. d.]).

[3]

[n. d.]. Microsoft multiverso. https://github.com/Microsoft/multiverso/wiki. ([n. d.]).

[4]

[n. d.]. NVIDIA Collective Communication Library (NCCL). https://developer.nvidia.com/nccl. ([n. d.]).

[5]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems.1709--1720.

[6]

Ashish Goel Bahman Bahmani and Rajendra Shinde. 2012. Efficient distributed locality sensitive hashing. In ACM CIKM.

[7]

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv: Computation and Language (2014).

[8]

Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2014. Efficient coflow scheduling with varys. In Proceedings of the 2014 ACM conference on SIGCOMM.443--454.

Digital Library

[9]

Zhong Deng Gaurav Soni Jianxi Ye Jitu Padhye Chuanxiong Guo, HaitaoWu and Marina Lipshteyn. 2016. RDMA over Commodity Ethernet at Scale. In ACM SIGCOMM.

[10]

Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, Jeffrey M Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, et al. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 97--104.

[11]

Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. SysML (2019).

[12]

Biye Jiang, Chao Deng, Huimin Yi, Zelin Hu, Guorui Zhou, Yang Zheng, Sui Huang, Xinyang Guo, Dongyue Wang, Yue Song, et al. 2019. XDL: an industrial deep learning framework for high-dimensional sparse data. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data.1--9.

Digital Library

[13]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. (2012), 1097--1105.

[14]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th { USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14).583--598.

[15]

Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. In Proceedings of the 46th International Symposium on Computer Architecture.279--291.

Digital Library

[16]

Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K Panda. 2004. High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 32, 3 (2004), 167--198.

Digital Library

[17]

T. Das A. Dave J. M. Ma M. McCauley M. J. Franklin S. Shenker M. Zaharia, M. Chowdhury and I. Stoica. 2012. Fast and interactive analytics over Hadoop data with Spark. In USENIX; login:.

[18]

Luo Mai, Chuntao Hong, and Paolo Costa. 2015. Optimizing network performance in distributed machine learning. In 7th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 15).

[19]

Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed DNN training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles.16--29.

Digital Library

[20]

Hasim Sak, Andrew W Senior, and Francoise Beaufays. 2014. Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling. (2014), 338--342.

[21]

Amedeo Sapio, Ibrahim Abdelaziz, Abdulla Aldilaijan, Marco Canini, and Panos Kalnis. 2017. In-network computation is a dumb idea whose time has come. In Proceedings of the 16th ACM Workshop on Hot Topics in Networks.150--156.

Digital Library

[22]

Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan RK Ports, and Peter Richtárik. 2019. Scaling distributed machine learning with in-network aggregation. arXiv preprint arXiv:1903.06701 (2019).

[23]

Sangeetha Abdu Jyothi Sayed Hadi Hashemi and Roy H. Campbell. 2019. TicTac: Accelerating distributed deep learning with communication scheduling. SysML (2019).

[24]

Ananda Theertha Suresh, Felix X Yu, Sanjiv Kumar, and H Brendan McMahan. 2017. Distributed mean estimation with limited communication. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 3329--3337.

Digital Library

[25]

Yuta Tokusashi, Huynh Tu Dang, Fernando Pedone, Robert Soulé, and Noa Zilberman. 2019. The case for in-network computing on demand. In Proceedings of the Fourteenth EuroSys Conference 2019.1--16.

Digital Library

[26]

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems.1509--1519.

[27]

Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17).181--193.

Cited By

Luo SWang RLi KXing H(2024)Efficient Cross-Cloud Partial Reduce With CREWIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346018535:11(2224-2238)Online publication date: Nov-2024
https://doi.org/10.1109/TPDS.2024.3460185
Zhang QZhao GXu HYang P(2024)XAgg: Accelerating Heterogeneous Distributed Training Through XDP-Based Gradient AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2023.333952432:3(2174-2188)Online publication date: Jun-2024
https://doi.org/10.1109/TNET.2023.3339524
Liu TMiao TWu QLi ZHe GWu JZhang SYang XTyson GXie G(2022)Modeling and Optimizing the Scaling Performance in Distributed Deep Learning TrainingProceedings of the ACM Web Conference 202210.1145/3485447.3511981(1764-1773)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3511981
Show More Cited By

Index Terms

Dissecting the Communication Latency in Distributed Deep Sparse Learning
1. Networks
  1. Network performance evaluation
    1. Network performance analysis

Recommendations

Dissecting Web Latency in Ghana
IMC '14: Proceedings of the 2014 Conference on Internet Measurement Conference

Web access is prohibitively slow in many developing regions despite substantial effort to increase bandwidth and network penetration. In this paper, we explore the fundamental bottlenecks that cause poor web performance from a client's perspective by ...
Exploring latency-power tradeoffs in deep nonvolatile memory hierarchies
CF '12: Proceedings of the 9th conference on Computing Frontiers

To handle the demand for very large main memory, we are likely to use nonvolatile memory (NVM) as main memory. NVM main memory will have higher latency than DRAM. To cope with this, we advocate a less-deep cache hierarchy based on a large last-level, ...
Deep Label Distribution Learning With Label Ambiguity

Convolutional neural networks (ConvNets) have achieved excellent recognition performance in various visual recognition tasks. A large labeled training set is one of the most important factors for its success. However, it is difficult to collect ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

IMC '20: Proceedings of the ACM Internet Measurement Conference

October 2020

751 pages

ISBN:9781450381383

DOI:10.1145/3419394

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Youth Innovation Promotion Association CAS
National Science Fund of China
National Key R&D Program of China

Conference

IMC '20

Sponsor:

IMC '20: ACM Internet Measurement Conference

October 27 - 29, 2020

Virtual Event, USA

Acceptance Rates

IMC '20 Paper Acceptance Rate 53 of 216 submissions, 25%;

Overall Acceptance Rate 277 of 1,083 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
410
Total Downloads

Downloads (Last 12 months)49
Downloads (Last 6 weeks)6

Reflects downloads up to 19 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Luo SWang RLi KXing H(2024)Efficient Cross-Cloud Partial Reduce With CREWIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346018535:11(2224-2238)Online publication date: Nov-2024
https://doi.org/10.1109/TPDS.2024.3460185
Zhang QZhao GXu HYang P(2024)XAgg: Accelerating Heterogeneous Distributed Training Through XDP-Based Gradient AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2023.333952432:3(2174-2188)Online publication date: Jun-2024
https://doi.org/10.1109/TNET.2023.3339524
Liu TMiao TWu QLi ZHe GWu JZhang SYang XTyson GXie G(2022)Modeling and Optimizing the Scaling Performance in Distributed Deep Learning TrainingProceedings of the ACM Web Conference 202210.1145/3485447.3511981(1764-1773)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3511981
Li SQin YJiang ZYang W(2022)Efficient Communication Scheduling for Parameter Synchronization of DML in Data Center NetworksIEEE Transactions on Network Science and Engineering10.1109/TNSE.2021.30681559:4(1970-1985)Online publication date: 1-Jul-2022
https://doi.org/10.1109/TNSE.2021.3068155
Wang SZhang XUchiyama HMatsuda H(2022)HiveMind: Towards Cellular Native Machine Learning Model SplittingIEEE Journal on Selected Areas in Communications10.1109/JSAC.2021.311840340:2(626-640)Online publication date: Feb-2022
https://doi.org/10.1109/JSAC.2021.3118403
Wang ZLi ZLiu GChen YWu QCheng GLevin DMislove AAmann JLuckie M(2021)Examination of WAN traffic characteristics in a large-scale data center networkProceedings of the 21st ACM Internet Measurement Conference10.1145/3487552.3487860(1-14)Online publication date: 2-Nov-2021
https://dl.acm.org/doi/10.1145/3487552.3487860

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents