tutorial

High performance distributed deep learning: a beginner's guide

Authors:

Dhabaleswar K. (DK) Panda,

Ammar Ahmad Awan,

Hari SubramoniAuthors Info & Claims

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Pages 452 - 454

https://doi.org/10.1145/3293883.3302260

Published: 16 February 2019 Publication History

Get Access

Abstract

The current wave of advances in Deep Learning (DL) has led to many exciting challenges and opportunities for Computer Science and Artificial Intelligence researchers alike. Modern DL frameworks like Caffe2, TensorFlow, Cognitive Toolkit (CNTK), PyTorch, and several others have emerged that offer ease of use and flexibility to describe, train, and deploy various types of Deep Neural Networks (DNNs). In this tutorial, we will provide an overview of interesting trends in DNN design and how cutting-edge hardware architectures are playing a key role in moving the field forward. We will also present an overview of different DNN architectures and DL frameworks. Most DL frameworks started with a single-node/single-GPU design. However, approaches to parallelize the process of DNN training are also being actively explored. The DL community has moved along different distributed training designs that exploit communication runtimes like gRPC, MPI, and NCCL. In this context, we will highlight new challenges and opportunities for communication runtimes to efficiently support distributed DNN training. We also highlight some of our co-design efforts to utilize CUDA-Aware MPI for large-scale DNN training on modern GPU clusters. Finally, we include hands-on exercises in this tutorial to enable the attendees to gain first-hand experience of running distributed DNN training experiments on a modern GPU cluster.

References

[1]

2018. ImageNet. http://image-net.org/. {Online; accessed January 3, 2019}.

Google Scholar

[2]

2018. PyTorch. https://pytorch.org/. {Online; accessed January 3, 2019}.

Google Scholar

[3]

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. arXiv preprint arXiv:1603.04467 (2016).

Google Scholar

[4]

Facebook. 2018. Caffe2 Framework. https://caffe2.ai.

Google Scholar

[5]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).

Google Scholar

[6]

A. Krizhevsky andG. Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.

Google Scholar

[7]

Microsoft. 2018. CNTK. http://www.cntk.ai/. {Online; accessed January 3, 2019}.

Google Scholar

[8]

Preferred Networks. 2018. Chainer: A flexible framework for neural networks. https://chainer.org/. {Online; accessed January 3, 2019}.

Google Scholar

Cited By

View all

Kollia IStevenson JKollias S(2021)AI-Enabled Efficient and Safe Food Supply ChainElectronics10.3390/electronics1011122310:11(1223)Online publication date: 21-May-2021
https://doi.org/10.3390/electronics10111223
Hakimi IAviv RLevy KSchuster A(2021)LAGA: Lagged AllReduce with Gradient Accumulation for Minimal Idle Time2021 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM51629.2021.00027(171-180)Online publication date: Dec-2021
https://doi.org/10.1109/ICDM51629.2021.00027
Yang CLiu JChan YKristiani EKuo C(2020)Performance benchmarking of deep learning framework on Intel Xeon PhiThe Journal of Supercomputing10.1007/s11227-020-03362-3Online publication date: 17-Jun-2020
https://doi.org/10.1007/s11227-020-03362-3
Show More Cited By

High performance distributed deep learning: a beginner's guide
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

High-performance dense tucker decomposition on GPU clusters
SC '18: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis

The dense Tucker decomposition method is one of the most popular algorithms for analyzing and compressing data with multi-way relationship. Its execution time is typically dominated by dense matrix multiplication operations, which makes it well-suited ...
NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems
ICS '20: Proceedings of the 34th ACM International Conference on Supercomputing

The advanced fabrics like NVIDIA NVLink are enabling the deployment of dense Graphics Processing Unit (GPU) systems such as DGX-2 and Summit. With the wide adoption of large-scale GPU-enabled systems for distributed deep learning (DL) training, it is ...
Modeling and predicting performance of high performance computing applications on hardware accelerators

Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

February 2019

472 pages

ISBN:9781450362252

DOI:10.1145/3293883

General Chair:
Jeff Hollingsworth
University of Maryland
,
Program Chair:
Idit Keidar
Technion, Israel

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 February 2019

Check for updates

Author Tags

Qualifiers

Tutorial

Conference

PPoPP '19

Sponsor:

PPoPP '19: 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 16 - 20, 2019

District of Columbia, Washington

Acceptance Rates

PPoPP '19 Paper Acceptance Rate 29 of 152 submissions, 19%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
501
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)3

Reflects downloads up to 19 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Kollia IStevenson JKollias S(2021)AI-Enabled Efficient and Safe Food Supply ChainElectronics10.3390/electronics1011122310:11(1223)Online publication date: 21-May-2021
https://doi.org/10.3390/electronics10111223
Hakimi IAviv RLevy KSchuster A(2021)LAGA: Lagged AllReduce with Gradient Accumulation for Minimal Idle Time2021 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM51629.2021.00027(171-180)Online publication date: Dec-2021
https://doi.org/10.1109/ICDM51629.2021.00027
Yang CLiu JChan YKristiani EKuo C(2020)Performance benchmarking of deep learning framework on Intel Xeon PhiThe Journal of Supercomputing10.1007/s11227-020-03362-3Online publication date: 17-Jun-2020
https://doi.org/10.1007/s11227-020-03362-3
Vogels TKarimireddy SJaggi MWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)PowerSGDProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455565(14269-14278)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455565

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Recommendations

High-performance dense tucker decomposition on GPU clusters

NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems

Modeling and predicting performance of high performance computing applications on hardware accelerators