[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning

Published: 17 August 2020 Publication History

Abstract

In large-scale distributed machine learning (DML), the network performance between machines significantly impacts the speed of iterative training. In this paper we propose BML, a scalable, high-performance and fault-tolerant DML network architecture on top of Ethernet and commodity devices. BML builds on BCube topology, and runs a fully-distributed gradient synchronization algorithm. Compared to a Fat-Tree network with the same size, a BML network is expected to take much less time for gradient synchronization, for both low theoretical synchronization time and its benefit to RDMA transport. With server/link failures, the performance of BML degrades in a graceful way. Experiments of MNIST and VGG-19 benchmarks on a testbed with 9 dual-GPU servers show that, BML reduces the job completion time of DML training by up to 56.4% compared with Fat-Tree running state-of-the-art gradient synchronization algorithm.

References

[1]
Caffe2 Website. Accessed: May 25, 2018. [Online]. Available: http://caffe2.ai/
[2]
Pytorch Website. Accessed: May 25, 2018. [Online]. Available: http://pytorch.org/
[3]
Gloo Project. Accessed: May 25, 2018. [Online]. Available: https://github.com/facebookincubator/gloo
[4]
The Mnist Database of Handwritten Digits. Accessed: May 25, 2018. [Online]. Available: http://yann.lecun.com/exdb/mnist/
[5]
M. Abadiet al., “Tensorflow: A system for large-scale machine learning,” in Proc. USENIX (OSDI), 2016, pp. 1–20.
[6]
A. Agache, R. Deaconescu, and C. Raiciu, “Increasing datacenter network utilisation with GRIN,” in Proc. USENIX (NSDI), 2015, pp. 1–14.
[7]
M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” in Proc. ACM SIGCOMM Conf. Data Commun. (SIGCOMM), 2008, pp. 63–74.
[8]
T. Chenet al., “MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems,” 2015, arXiv:1512.01274. [Online]. Available: http://arxiv.org/abs/1512.01274
[9]
H.-T. Chenget al., “Wide & deep learning for recommender systems,” in Proc. 1st Workshop Deep Learn. Recommender Syst., 2016, pp. 7–10.
[10]
T. M. Chilimbiet al., “Project Adam: Building an efficient and scalable deep learning training system,” in Proc. USENIX (OSDI), 2014, pp. 1–13.
[11]
W. Daiet al., “High-performance distributed ML at scale through parameter server consistency models,” in Proc. 29th AAAI Conf. Artif. Intell., 2015, pp. 79–87.
[12]
J. J. Dongarra, R. Hempel, A. J. Hey, and D. W. Walker, “A proposal for a user-level, message passing interface in a distributed memory environment,” Oak Ridge National Lab., Oak Ridge, TN, USA, Tech. Rep. ORNL/TM-12231, 1993.
[13]
A. Gibiansky. (2017). Bringing HPC Techniques to Deep Learning. Baidu Research. [Online]. Available: http://research.baidu.com/bringing-hpc-techniques-deep-learning/
[14]
P. Goyalet al., “Accurate, large minibatch SGD: Training ImageNet in 1 hour,” 2017, arXiv:1706.02677. [Online]. Available: http://arxiv.org/abs/1706.02677
[15]
C. Guoet al., “BCube: A high performance, server-centric network architecture for modular data centers,” in Proc. ACM SIGCOMM, 2009, pp. 63–74.
[16]
C. Guoet al., “RDMA over commodity Ethernet at scale,” in Proc. Conf. ACM SIGCOMM Conf. (SIGCOMM), 2016, pp. 202–215.
[17]
C. Guoet al., “Dcell: A scalable and fault-tolerant network structure for data centers,” in Proc. ACM SIGCOMM, 2008, pp. 75–86.
[18]
K. Hazelwoodet al., “Applied machine learning at facebook: A datacenter infrastructure perspective,” in Proc. IEEE Int. Symp. High Perform. Comput. Architecture (HPCA), Feb. 2018, pp. 620–629.
[19]
K. Hsiehet al., “Gaia: Geo-distributed machine learning approaching LAN speeds,” in Proc. USENIX (NSDI), 2017, pp. 1–19.
[20]
D. Liet al., “Scalable and cost-effective interconnection of data-center servers using dual server ports,” IEEE/ACM Trans. Netw., vol. 19, no. 1, pp. 102–114, Feb. 2011.
[21]
M. Liet al., “Scaling distributed machine learning with the parameter server,” in Proc. USENIX (OSDI), 2014, pp. 1–16.
[22]
N. Luehr. (2016). Fast Multi-GPU Collectives With NCCL. NVIDIA Developer Blog. [Online]. Available: https://devblogs.nvidia.com/parallelforall/fast-multi-gpu-collectives-nccl/
[23]
P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms for clusters of workstations,” J. Parallel Distrib. Comput., vol. 69, no. 2, pp. 117–124, Feb. 2009.
[24]
D. E. Rumelhart, “Learning representations by back-propagation errors,” Nature, vol. 323, pp. 533–536, Oct. 1986.
[25]
D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Neurocomputing: Foundations of research,” Learning Representations by Backpropagating Errors. Cambridge, MA, USA: MIT Press, 1988, pp. 696–699.
[26]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. Available: http://arxiv.org/abs/1409.1556
[27]
A. Smola. (2017). Machine Learning–Progress and Opportunities. Speech AI World. [Online]. Available: https://goo.gl/emn8np
[28]
R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in MPICH,” Int. J. High Perform. Comput. Appl., vol. 19, no. 1, pp. 49–66, Feb. 2005.
[29]
E. P. Xinget al., “Petuum: A new platform for distributed machine learning on big data,” IEEE Trans. Big Data, vol. 1, no. 2, pp. 49–67, Jun. 2015.
[30]
B. Yi, J. Xia, L. Chen, and K. Chen, “Towards zero copy dataflows using RDMA,” in Proc. SIGCOMM Posters Demos (SIGCOMM Posters Demos), 2017, pp. 28–30.
[31]
Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “ImageNet training in minutes,” 2017, arXiv:1709.05011. [Online]. Available: http://arxiv.org/abs/1709.05011
[32]
H. Zhanget al., “Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters,” in Proc. USENIX (ATC), 2017, pp. 181–193.

Cited By

View all
  • (2023)Roar: A Router Microarchitecture for In-network AllreduceProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593711(423-436)Online publication date: 21-Jun-2023
  • (2022)HammingMeshProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571899(1-18)Online publication date: 13-Nov-2022

Index Terms

  1. A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image IEEE/ACM Transactions on Networking
        IEEE/ACM Transactions on Networking  Volume 28, Issue 4
        Aug. 2020
        477 pages

        Publisher

        IEEE Press

        Publication History

        Published: 17 August 2020
        Published in TON Volume 28, Issue 4

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)15
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 13 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2023)Roar: A Router Microarchitecture for In-network AllreduceProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593711(423-436)Online publication date: 21-Jun-2023
        • (2022)HammingMeshProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571899(1-18)Online publication date: 13-Nov-2022

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media