More Web Proxy on the site http://driver.im/

research-article

SparCML: high-performance sparse communication for machine learning

Authors:

Cedric Renggli,

Saleh Ashkboos,

Mehdi Aghagolzadeh,

Torsten HoeflerAuthors Info & Claims

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 11, Pages 1 - 15

https://doi.org/10.1145/3295500.3356222

Published: 17 November 2019 Publication History

Abstract

Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution to the overall gradient is summed using a global allreduce. This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads. We observe that frequently, many gradient values are (close to) zero, leading to sparse of sparsifyable communications. To exploit this insight, we analyze, design, and implement a set of communication-efficient protocols for sparse input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute arbitrary sparse input data vectors. Our generic communication library, SparCML¹, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations. As such, SparCML and its techniques will form the basis of future highly-scalable machine learning frameworks.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, Vol. 16. 265--283.

Digital Library

[2]

Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Distributed Gradient Descent. arXiv preprint arXiv:1704.05021 (2017).

[3]

Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely large mini-batch SGD: training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325 (2017).

[4]

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Randomized Quantization for Communication-Efficient Stochastic Gradient Descent. In Proceedings of NIPS 2017.

[5]

Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems. 5973--5983.

[6]

Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K Panda. 2017. S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 193--205.

Digital Library

[7]

T. Ben-Nun and T. Hoefler. 2018. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. CoRR abs/1802.09941 (Feb. 2018).

[8]

CSCS Swiss National Supercomputing Centre. 2018. Apache Spark on the CSCS Cluster. https://user.cscs.ch/scientific_computing/supported_applications/spark/.

[9]

CSCS Swiss National Supercomputing Centre. 2018. The CSCS Piz Daint Supercomputer. http://www.cscs.ch/computers/piz_daint.

[10]

Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. 2007. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19, 13 (2007), 1749--1783.

[11]

Kai Chen and Qiang Huo. 2016. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 5880--5884.

Digital Library

[12]

Lingjiao Chen, Hongyi Wang, Jinman Zhao, Dimitris Papailiopoulos, and Paraschos Koutris. 2018. The effect of network width on the performance of large-batch training. In Advances in Neural Information Processing Systems. 9302--9309.

[13]

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).

[14]

Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In OSDI, Vol. 14. 571--582.

Digital Library

[15]

Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. 2017. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291 (2017).

[16]

Christopher De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. 2015. Taming the Wild: A Unified Analysis of Hogwild. Style Algorithms. In NIPS (2015).

[17]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805

[18]

Nikoli Dryden, Sam Ade Jacobs, Tim Moon, and Brian Van Essen. 2016. Communication quantization for data-parallel training of deep neural networks. In Proceedings of the Workshop on Machine Learning in High Performance Computing Environments. IEEE Press, 1--8.

Digital Library

[19]

Message Passing Interface Forum. 2012. MPI: A Message-Passing Interface Standard Version 3.0.

[20]

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677 (2017).

[21]

W. Gropp, T. Hoefler, R. Thakur, and E. Lusk. 2014. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press.

[22]

Demjan Grubic, Leo Tam, Dan Alistarh, and Ce Zhang. 2018. Synchronous Multi-GPU Training for Deep Learning with Low-Precision Communications: An Empirical Study. In Proceedings of the 21th International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, March 26--29, 2018., Michael H. Böhlen, Reinhard Pichler, Norman May, Erhard Rahm, Shan-Hung Wu, and Katja Hose (Eds.). OpenProceedings.org, 145--156.

[23]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[24]

Charles T Hemphill, John J Godfrey, George R Doddington, et al. 1990. The ATS spoken language systems pilot corpus. In Proceedings of the DARPA speech and natural language workshop. 96--101.

Digital Library

[25]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

[26]

T. Hoefler and A. Lumsdaine. 2008. Message Progression in Parallel Computing - To Thread or not to Thread?. In Proceedings of the 2008 IEEE International Conference on Cluster Computing. IEEE Computer Society.

[27]

Torsten Hoefler, Andrew Lumsdaine, and Wolfgang Rehm. 2007. Implementation and performance analysis of non-blocking collective operations for MPI. In Supercomputing, 2007. SC'07. Proceedings of the 2007 ACM/IEEE Conference on. IEEE, 1--10.

Digital Library

[28]

T. Hoefler, A. Lumsdaine, and W. Rehm. 2007. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM.

[29]

T. Hoefler and D. Moor. 2014. Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations. Journal of Supercomputing Frontiers and Innovations 1, 2 (Oct. 2014), 58--75.

[30]

Michael Hofmann and Gudula Rünger. 2008. MPI reduction operations for sparse floating-point data. Lecture Notes in Computer Science 5205 (2008), 94.

Digital Library

[31]

Rani Horev. [n. d.]. BERT Explained: State of the art language model for NLP. https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270. visited, Mar. 2019.

[32]

Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. 2014. Communication-efficient distributed dual coordinate ascent. In Advances in neural information processing systems. 3068--3076.

[33]

Nikhil Ketkar. 2017. Introduction to PyTorch. In Deep Learning with Python. Springer, 195--208.

[34]

Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. (2009).

[35]

Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In OSDI, Vol. 1. 3.

Digital Library

[36]

Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems. 2737--2745.

[37]

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2017. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. arXiv preprint arXiv:1712.01887 (2017).

[38]

Justin Ma, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. 2009. Identifying suspicious URLs: an application of large-scale online learning. In Proceedings of the 26th annual international conference on machine learning. ACM, 681--688.

Digital Library

[39]

Microsoft. [n. d.]. CNTK Examples. https://github.com/microsoft/CNTK/tree/987b22a8350211cb4c44278951857afl289c3666/Examples. visited, July. 2019, commit 7882cf0.

[40]

NVIDIA. 2016. The NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl (2016).

[41]

Natural Language Group of the USC Information Sciences Institute. 2017. Aligned Hansards of the 36th Parliament of Canada. https://www.isi.edu/natural-language/download/hansard/

[42]

Rolf Rabenseifner. 2004. Optimization of collective reduction operations. In International Conference on Computational Science. Springer, 1--9.

[43]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.

Digital Library

[44]

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit Stochastic Gradient Descent and its Application to Data-parallel Distributed Training of Speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association.

[45]

Nikko Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association.

[46]

Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. 2017. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. arXiv preprint arXiv:1706.06197 (2017).

[47]

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In AAAI. 4278--4284.

[48]

Rajeev Thakur and William D Gropp. 2003. Improving the performance of collective operations in MPICH. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 257--267.

[49]

Jesper Träff. 2010. Transparent neutral element elimination in MPI reduction operations. Recent Advances in the Message Passing Interface (2010), 275--284.

[50]

Brian Van Essen, Hyojin Kim, Roger Pearce, Kofi Boakye, and Barry Chen. 2015. LBANN: Livermore big artificial neural network HPC toolkit. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments. ACM, 5.

Digital Library

[51]

Steve Webb, James Caverlee, and Calton Pu. 2006. Introducing the Webb Spam Corpus: Using Email Spam to Identify Web Spam Automatically. In CEAS.

[52]

Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems. 1508--1518.

[53]

Stephen J Wright. 2015. Coordinate descent algorithms. Mathematical Programming 151, 1 (2015), 3--34.

Digital Library

[54]

Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data 1, 2 (2015), 49--67.

[55]

Masafumi Yamazaki, Akihiko Kasagi, Akihiro Tabuchi, Takumi Honda, Masahiro Miwa, Naoto Fukumoto, Tsuguchika Tabaru, Atsushi Ike, and Kohta Nakashima. 2019. Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds. arXiv preprint arXiv:1903.12650 (2019).

[56]

Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. arXiv preprint arXiv:1708.03888 (2017).

[57]

Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Zhiheng Huang, Brian Guenter, Oleksii Kuchaiev, Yu Zhang, Frank Seide, Huaming Wang, et al. 2014. An introduction to computational networks and the computational network toolkit. Microsoft Technical Report MSR-TR-2014-112 (2014).

[58]

Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).

[59]

Jian Zhang, Ioannis Mitliagkas, and Christopher Ré. 2017. YellowFin and the Art of Momentum Tuning. arXiv preprint arXiv:1706.03471 (2017).

[60]

Huasha Zhao and John Canny. 2014. Kylix: A sparse allreduce for commodity clusters. In Parallel Processing (ICPP), 2014 43rd International Conference on. IEEE, 273--282.

Digital Library

Cited By

Meera SValarmathi K(2024)Optimizing cloud resource allocationJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23405446:1(2311-2330)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.3233/JIFS-234054
Fan KPetruzza SGilray TKumar S(2024)Configurable Algorithms for All-to-All CollectivesISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528936(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528936
Peng JLi ZShi SLi B(2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673140
Show More Cited By

Index Terms

SparCML: high-performance sparse communication for machine learning
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
2. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Image compressive sensing via Truncated Schatten-p Norm regularization

Low-rank property as a useful image prior has attracted much attention in image processing communities. Recently, a nonlocal low-rank regularization (NLR) approach toward exploiting low-rank property has shown the state-of-the-art performance in ...
Image Recovery based on Local and Nonlocal Regularizations

Recently, a nonlocal low-rank regularization based compressive sensing approach (NLR) which exploits structured sparsity of similar patches has shown the state-of-the-art performance in image recovery. However, NLR cannot efficiently preserve local ...
Compressive sensing via nonlocal low-rank tensor regularization

The aim of Compressing sensing (CS) is to acquire an original signal, when it is sampled at a lower rate than Nyquist rate previously. In the framework of CS, the original signal is often assumed to be sparse and correlated in some domain. Recently, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2019

1921 pages

ISBN:9781450362290

DOI:10.1145/3295500

General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Swiss National Supercomputing Centre
European Research Council (ERC)

Conference

SC '19

Sponsor:

SIGHPC

SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis

November 17 - 19, 2019

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

56
Total Citations
View Citations
1,332
Total Downloads

Downloads (Last 12 months)105
Downloads (Last 6 weeks)20

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Meera SValarmathi K(2024)Optimizing cloud resource allocationJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23405446:1(2311-2330)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.3233/JIFS-234054
Fan KPetruzza SGilray TKumar S(2024)Configurable Algorithms for All-to-All CollectivesISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528936(1-12)Online publication date: May-2024
https://doi.org/10.23919/ISC.2024.10528936
Peng JLi ZShi SLi B(2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673140
Gu TFei JCanini M(2024)OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICsProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673804(75-83)Online publication date: 4-Aug-2024
https://dl.acm.org/doi/10.1145/3672198.3673804
Sewell AFan KShovon ADyken LKumar SPetruzza S(2024)Bruck Algorithm Performance Analysis for Multi-GPU All-to-All CommunicationProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3635047(127-133)Online publication date: 18-Jan-2024
https://dl.acm.org/doi/10.1145/3635035.3635047
Ming ZHu YZhou WZheng XYao CFeng DMencagli GDazzi PLowenthal DBadia R(2024)ADTopk: All-Dimension Top-k Compression for High-Performance Data-Parallel DNN TrainingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658678(135-147)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658678
Hoefler TBonoto TDe Sensi DDi Girolamo SLi SHeddes MGoel DCastro MScott S(2024)HammingMesh: A Network Topology for Large-Scale Deep LearningCommunications of the ACM10.1145/3623490Online publication date: 21-Nov-2024
https://doi.org/10.1145/3623490
Moreno-Álvarez SPaoletti MCavallaro GHaut J(2024)Enhancing Distributed Neural Network Training Through Node-Based CommunicationsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.330973535:12(17893-17907)Online publication date: Dec-2024
https://doi.org/10.1109/TNNLS.2023.3309735
Hosseinalipour SWang SMichelusi NAggarwal VBrinton CLove DChiang M(2024)Parallel Successive Learning for Dynamic Distributed Model Training Over Heterogeneous Wireless NetworksIEEE/ACM Transactions on Networking10.1109/TNET.2023.328698732:1(222-237)Online publication date: Feb-2024
https://doi.org/10.1109/TNET.2023.3286987
Pan XLin WShi SChu XSun WLi B(2024)Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated SchedulesIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621327(1880-1889)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOM52122.2024.10621327
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents