More Web Proxy on the site http://driver.im/

research-article

A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning

Authors:

Jianping WuAuthors Info & Claims

IEEE/ACM Transactions on Networking, Volume 28, Issue 4

Pages 1752 - 1764

https://doi.org/10.1109/TNET.2020.2999377

Published: 17 August 2020 Publication History

Abstract

In large-scale distributed machine learning (DML), the network performance between machines significantly impacts the speed of iterative training. In this paper we propose BML, a scalable, high-performance and fault-tolerant DML network architecture on top of Ethernet and commodity devices. BML builds on BCube topology, and runs a fully-distributed gradient synchronization algorithm. Compared to a Fat-Tree network with the same size, a BML network is expected to take much less time for gradient synchronization, for both low theoretical synchronization time and its benefit to RDMA transport. With server/link failures, the performance of BML degrades in a graceful way. Experiments of MNIST and VGG-19 benchmarks on a testbed with 9 dual-GPU servers show that, BML reduces the job completion time of DML training by up to 56.4% compared with Fat-Tree running state-of-the-art gradient synchronization algorithm.

References

[1]

Caffe2 Website. Accessed: May 25, 2018. [Online]. Available: http://caffe2.ai/

[2]

Pytorch Website. Accessed: May 25, 2018. [Online]. Available: http://pytorch.org/

[3]

Gloo Project. Accessed: May 25, 2018. [Online]. Available: https://github.com/facebookincubator/gloo

[4]

The Mnist Database of Handwritten Digits. Accessed: May 25, 2018. [Online]. Available: http://yann.lecun.com/exdb/mnist/

[5]

M. Abadiet al., “Tensorflow: A system for large-scale machine learning,” in Proc. USENIX (OSDI), 2016, pp. 1–20.

[6]

A. Agache, R. Deaconescu, and C. Raiciu, “Increasing datacenter network utilisation with GRIN,” in Proc. USENIX (NSDI), 2015, pp. 1–14.

[7]

M. Al-Fares, A. Loukissas, and A. Vahdat, “A scalable, commodity data center network architecture,” in Proc. ACM SIGCOMM Conf. Data Commun. (SIGCOMM), 2008, pp. 63–74.

[8]

T. Chenet al., “MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems,” 2015, arXiv:1512.01274. [Online]. Available: http://arxiv.org/abs/1512.01274

[9]

H.-T. Chenget al., “Wide & deep learning for recommender systems,” in Proc. 1st Workshop Deep Learn. Recommender Syst., 2016, pp. 7–10.

[10]

T. M. Chilimbiet al., “Project Adam: Building an efficient and scalable deep learning training system,” in Proc. USENIX (OSDI), 2014, pp. 1–13.

[11]

W. Daiet al., “High-performance distributed ML at scale through parameter server consistency models,” in Proc. 29th AAAI Conf. Artif. Intell., 2015, pp. 79–87.

[12]

J. J. Dongarra, R. Hempel, A. J. Hey, and D. W. Walker, “A proposal for a user-level, message passing interface in a distributed memory environment,” Oak Ridge National Lab., Oak Ridge, TN, USA, Tech. Rep. ORNL/TM-12231, 1993.

Digital Library

[13]

A. Gibiansky. (2017). Bringing HPC Techniques to Deep Learning. Baidu Research. [Online]. Available: http://research.baidu.com/bringing-hpc-techniques-deep-learning/

[14]

P. Goyalet al., “Accurate, large minibatch SGD: Training ImageNet in 1 hour,” 2017, arXiv:1706.02677. [Online]. Available: http://arxiv.org/abs/1706.02677

[15]

C. Guoet al., “BCube: A high performance, server-centric network architecture for modular data centers,” in Proc. ACM SIGCOMM, 2009, pp. 63–74.

[16]

C. Guoet al., “RDMA over commodity Ethernet at scale,” in Proc. Conf. ACM SIGCOMM Conf. (SIGCOMM), 2016, pp. 202–215.

[17]

C. Guoet al., “Dcell: A scalable and fault-tolerant network structure for data centers,” in Proc. ACM SIGCOMM, 2008, pp. 75–86.

[18]

K. Hazelwoodet al., “Applied machine learning at facebook: A datacenter infrastructure perspective,” in Proc. IEEE Int. Symp. High Perform. Comput. Architecture (HPCA), Feb. 2018, pp. 620–629.

[19]

K. Hsiehet al., “Gaia: Geo-distributed machine learning approaching LAN speeds,” in Proc. USENIX (NSDI), 2017, pp. 1–19.

[20]

D. Liet al., “Scalable and cost-effective interconnection of data-center servers using dual server ports,” IEEE/ACM Trans. Netw., vol. 19, no. 1, pp. 102–114, Feb. 2011.

[21]

M. Liet al., “Scaling distributed machine learning with the parameter server,” in Proc. USENIX (OSDI), 2014, pp. 1–16.

[22]

N. Luehr. (2016). Fast Multi-GPU Collectives With NCCL. NVIDIA Developer Blog. [Online]. Available: https://devblogs.nvidia.com/parallelforall/fast-multi-gpu-collectives-nccl/

[23]

P. Patarasuk and X. Yuan, “Bandwidth optimal all-reduce algorithms for clusters of workstations,” J. Parallel Distrib. Comput., vol. 69, no. 2, pp. 117–124, Feb. 2009.

[24]

D. E. Rumelhart, “Learning representations by back-propagation errors,” Nature, vol. 323, pp. 533–536, Oct. 1986.

[25]

D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Neurocomputing: Foundations of research,” Learning Representations by Backpropagating Errors. Cambridge, MA, USA: MIT Press, 1988, pp. 696–699.

[26]

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, arXiv:1409.1556. [Online]. Available: http://arxiv.org/abs/1409.1556

[27]

A. Smola. (2017). Machine Learning–Progress and Opportunities. Speech AI World. [Online]. Available: https://goo.gl/emn8np

[28]

R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collective communication operations in MPICH,” Int. J. High Perform. Comput. Appl., vol. 19, no. 1, pp. 49–66, Feb. 2005.

[29]

E. P. Xinget al., “Petuum: A new platform for distributed machine learning on big data,” IEEE Trans. Big Data, vol. 1, no. 2, pp. 49–67, Jun. 2015.

[30]

B. Yi, J. Xia, L. Chen, and K. Chen, “Towards zero copy dataflows using RDMA,” in Proc. SIGCOMM Posters Demos (SIGCOMM Posters Demos), 2017, pp. 28–30.

[31]

Y. You, Z. Zhang, C.-J. Hsieh, J. Demmel, and K. Keutzer, “ImageNet training in minutes,” 2017, arXiv:1709.05011. [Online]. Available: http://arxiv.org/abs/1709.05011

[32]

H. Zhanget al., “Poseidon: An efficient communication architecture for distributed deep learning on GPU clusters,” in Proc. USENIX (ATC), 2017, pp. 181–193.

Cited By

Wang RDong DLei FMa JWu KLu KGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Roar: A Router Microarchitecture for In-network AllreduceProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593711(423-436)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593711
Hoefler TBonato TDe Sensi DDi Girolamo SLi SHeddes MBelk JGoel DCastro MScott SWolf FShende SCulhane CAlam SJagode H(2022)HammingMeshProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571899(1-18)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571899

Index Terms

A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Fault Tolerant Interleaved Switching Fabrics For Scalable High-Performance Routers

Scalable high performance routers and switches are required to provide a larger number of ports, higher throughput, and good reliability. Most of today’s routers and switches are implemented using single crossbar as the switched fabric. The single ...
Scalable high-radix router microarchitecture using a network switch organization

As the system size of supercomputers and datacenters increases, cost-efficient networks become critical in achieving good scalability on those systems. High-radix routers reduce network cost by lowering the network diameter while providing a high ...
A Fault-Tolerant Rearrangeable Permutation Network

As optical communication becomes a promising networking choice, the well-known Clos network has regained much attention recently from optical switch designers/manufacturers and cluster computing community. There has been much work on the Clos network in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Networking

IEEE/ACM Transactions on Networking Volume 28, Issue 4

Aug. 2020

477 pages

ISSN:1063-6692

Issue’s Table of Contents

1063-6692 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 17 August 2020

Published in TON Volume 28, Issue 4

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
96
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang RDong DLei FMa JWu KLu KGallivan KNikolopoulos DBeivide RGallopoulos E(2023)Roar: A Router Microarchitecture for In-network AllreduceProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593711(423-436)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593711
Hoefler TBonato TDe Sensi DDi Girolamo SLi SHeddes MBelk JGoel DCastro MScott SWolf FShende SCulhane CAlam SJagode H(2022)HammingMeshProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571899(1-18)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571899

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents