[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3419394.3423637acmconferencesArticle/Chapter ViewAbstractPublication PagesimcConference Proceedingsconference-collections
research-article

Dissecting the Communication Latency in Distributed Deep Sparse Learning

Published: 27 October 2020 Publication History

Abstract

Distributed deep learning (DDL) uses a cluster of servers to train models in parallel. This has been applied to a multiplicity of problems, e.g. online advertisement, friend recommendations. However, the distribution of training means that the communication network becomes a key component in system performance. In this paper, we measure the Alibaba's DDL system, with a focus on understanding the bottlenecks introduced by the network. Our key finding is that the communications overhead has a surprisingly large impact on performance. To explore this, we analyse latency logs of 1.38M Remote Procedure Calls between servers during model training for two real applications of high-dimensional sparse data. We reveal the major contributors of the latency, including concurrent write/read operations of different connections and network connection management. We further observe a skewed distribution of update frequency for individual parameters, motivating us to propose using in-network computation capacity to offload server tasks.

Supplementary Material

WMV File (imc2020-131.wmv)
Dissecting the Communication Latency in Distributed Deep Sparse Learning

References

[1]
[n. d.]. The Apache Software Foundation. Apache hadoop. http://hadoop.apache. org/core/. ([n. d.]).
[2]
[n. d.]. Facebook, Gloo. https://github.com/facebookincubator/gloo. ([n. d.]).
[3]
[n. d.]. Microsoft multiverso. https://github.com/Microsoft/multiverso/wiki. ([n. d.]).
[4]
[n. d.]. NVIDIA Collective Communication Library (NCCL). https://developer.nvidia.com/nccl. ([n. d.]).
[5]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems.1709--1720.
[6]
Ashish Goel Bahman Bahmani and Rajendra Shinde. 2012. Efficient distributed locality sensitive hashing. In ACM CIKM.
[7]
Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv: Computation and Language (2014).
[8]
Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2014. Efficient coflow scheduling with varys. In Proceedings of the 2014 ACM conference on SIGCOMM.443--454.
[9]
Zhong Deng Gaurav Soni Jianxi Ye Jitu Padhye Chuanxiong Guo, HaitaoWu and Marina Lipshteyn. 2016. RDMA over Commodity Ethernet at Scale. In ACM SIGCOMM.
[10]
Edgar Gabriel, Graham E Fagg, George Bosilca, Thara Angskun, Jack J Dongarra, Jeffrey M Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, Andrew Lumsdaine, et al. 2004. Open MPI: Goals, concept, and design of a next generation MPI implementation. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 97--104.
[11]
Anand Jayarajan, Jinliang Wei, Garth Gibson, Alexandra Fedorova, and Gennady Pekhimenko. 2019. Priority-based parameter propagation for distributed DNN training. SysML (2019).
[12]
Biye Jiang, Chao Deng, Huimin Yi, Zelin Hu, Guorui Zhou, Yang Zheng, Sui Huang, Xinyang Guo, Dongyue Wang, Yue Song, et al. 2019. XDL: an industrial deep learning framework for high-dimensional sparse data. In Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data.1--9.
[13]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. (2012), 1097--1105.
[14]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th { USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14).583--598.
[15]
Youjie Li, Iou-Jen Liu, Yifan Yuan, Deming Chen, Alexander Schwing, and Jian Huang. 2019. Accelerating distributed reinforcement learning with in-switch computing. In Proceedings of the 46th International Symposium on Computer Architecture.279--291.
[16]
Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K Panda. 2004. High performance RDMA-based MPI implementation over InfiniBand. International Journal of Parallel Programming 32, 3 (2004), 167--198.
[17]
T. Das A. Dave J. M. Ma M. McCauley M. J. Franklin S. Shenker M. Zaharia, M. Chowdhury and I. Stoica. 2012. Fast and interactive analytics over Hadoop data with Spark. In USENIX; login:.
[18]
Luo Mai, Chuntao Hong, and Paolo Costa. 2015. Optimizing network performance in distributed machine learning. In 7th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 15).
[19]
Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed DNN training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles.16--29.
[20]
Hasim Sak, Andrew W Senior, and Francoise Beaufays. 2014. Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling. (2014), 338--342.
[21]
Amedeo Sapio, Ibrahim Abdelaziz, Abdulla Aldilaijan, Marco Canini, and Panos Kalnis. 2017. In-network computation is a dumb idea whose time has come. In Proceedings of the 16th ACM Workshop on Hot Topics in Networks.150--156.
[22]
Amedeo Sapio, Marco Canini, Chen-Yu Ho, Jacob Nelson, Panos Kalnis, Changhoon Kim, Arvind Krishnamurthy, Masoud Moshref, Dan RK Ports, and Peter Richtárik. 2019. Scaling distributed machine learning with in-network aggregation. arXiv preprint arXiv:1903.06701 (2019).
[23]
Sangeetha Abdu Jyothi Sayed Hadi Hashemi and Roy H. Campbell. 2019. TicTac: Accelerating distributed deep learning with communication scheduling. SysML (2019).
[24]
Ananda Theertha Suresh, Felix X Yu, Sanjiv Kumar, and H Brendan McMahan. 2017. Distributed mean estimation with limited communication. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 3329--3337.
[25]
Yuta Tokusashi, Huynh Tu Dang, Fernando Pedone, Robert Soulé, and Noa Zilberman. 2019. The case for in-network computing on demand. In Proceedings of the Fourteenth EuroSys Conference 2019.1--16.
[26]
Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems.1509--1519.
[27]
Hao Zhang, Zeyu Zheng, Shizhen Xu, Wei Dai, Qirong Ho, Xiaodan Liang, Zhiting Hu, Jinliang Wei, Pengtao Xie, and Eric P. Xing. 2017. Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters. In 2017 USENIX Annual Technical Conference (USENIX ATC 17).181--193.

Cited By

View all
  • (2024)Efficient Cross-Cloud Partial Reduce With CREWIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346018535:11(2224-2238)Online publication date: Nov-2024
  • (2024)XAgg: Accelerating Heterogeneous Distributed Training Through XDP-Based Gradient AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2023.333952432:3(2174-2188)Online publication date: Jun-2024
  • (2022)Modeling and Optimizing the Scaling Performance in Distributed Deep Learning TrainingProceedings of the ACM Web Conference 202210.1145/3485447.3511981(1764-1773)Online publication date: 25-Apr-2022
  • Show More Cited By

Index Terms

  1. Dissecting the Communication Latency in Distributed Deep Sparse Learning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    IMC '20: Proceedings of the ACM Internet Measurement Conference
    October 2020
    751 pages
    ISBN:9781450381383
    DOI:10.1145/3419394
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • Youth Innovation Promotion Association CAS
    • National Science Fund of China
    • National Key R&D Program of China

    Conference

    IMC '20
    IMC '20: ACM Internet Measurement Conference
    October 27 - 29, 2020
    Virtual Event, USA

    Acceptance Rates

    IMC '20 Paper Acceptance Rate 53 of 216 submissions, 25%;
    Overall Acceptance Rate 277 of 1,083 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)49
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 19 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficient Cross-Cloud Partial Reduce With CREWIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.346018535:11(2224-2238)Online publication date: Nov-2024
    • (2024)XAgg: Accelerating Heterogeneous Distributed Training Through XDP-Based Gradient AggregationIEEE/ACM Transactions on Networking10.1109/TNET.2023.333952432:3(2174-2188)Online publication date: Jun-2024
    • (2022)Modeling and Optimizing the Scaling Performance in Distributed Deep Learning TrainingProceedings of the ACM Web Conference 202210.1145/3485447.3511981(1764-1773)Online publication date: 25-Apr-2022
    • (2022)Efficient Communication Scheduling for Parameter Synchronization of DML in Data Center NetworksIEEE Transactions on Network Science and Engineering10.1109/TNSE.2021.30681559:4(1970-1985)Online publication date: 1-Jul-2022
    • (2022)HiveMind: Towards Cellular Native Machine Learning Model SplittingIEEE Journal on Selected Areas in Communications10.1109/JSAC.2021.311840340:2(626-640)Online publication date: Feb-2022
    • (2021)Examination of WAN traffic characteristics in a large-scale data center networkProceedings of the 21st ACM Internet Measurement Conference10.1145/3487552.3487860(1-14)Online publication date: 2-Nov-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media