[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3295500.3356222acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

SparCML: high-performance sparse communication for machine learning

Published: 17 November 2019 Publication History

Abstract

Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution to the overall gradient is summed using a global allreduce. This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads. We observe that frequently, many gradient values are (close to) zero, leading to sparse of sparsifyable communications. To exploit this insight, we analyze, design, and implement a set of communication-efficient protocols for sparse input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute arbitrary sparse input data vectors. Our generic communication library, SparCML1, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations. As such, SparCML and its techniques will form the basis of future highly-scalable machine learning frameworks.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, Vol. 16. 265--283.
[2]
Alham Fikri Aji and Kenneth Heafield. 2017. Sparse Communication for Distributed Gradient Descent. arXiv preprint arXiv:1704.05021 (2017).
[3]
Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. 2017. Extremely large mini-batch SGD: training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325 (2017).
[4]
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. 2017. QSGD: Randomized Quantization for Communication-Efficient Stochastic Gradient Descent. In Proceedings of NIPS 2017.
[5]
Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. 2018. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems. 5973--5983.
[6]
Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K Panda. 2017. S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 193--205.
[7]
T. Ben-Nun and T. Hoefler. 2018. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. CoRR abs/1802.09941 (Feb. 2018).
[8]
CSCS Swiss National Supercomputing Centre. 2018. Apache Spark on the CSCS Cluster. https://user.cscs.ch/scientific_computing/supported_applications/spark/.
[9]
CSCS Swiss National Supercomputing Centre. 2018. The CSCS Piz Daint Supercomputer. http://www.cscs.ch/computers/piz_daint.
[10]
Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. 2007. Collective communication: theory, practice, and experience. Concurrency and Computation: Practice and Experience 19, 13 (2007), 1749--1783.
[11]
Kai Chen and Qiang Huo. 2016. Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 5880--5884.
[12]
Lingjiao Chen, Hongyi Wang, Jinman Zhao, Dimitris Papailiopoulos, and Paraschos Koutris. 2018. The effect of network width on the performance of large-batch training. In Advances in Neural Information Processing Systems. 9302--9309.
[13]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274 (2015).
[14]
Trishul M Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project Adam: Building an Efficient and Scalable Deep Learning Training System. In OSDI, Vol. 14. 571--582.
[15]
Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. 2017. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291 (2017).
[16]
Christopher De Sa, Ce Zhang, Kunle Olukotun, and Christopher Ré. 2015. Taming the Wild: A Unified Analysis of Hogwild. Style Algorithms. In NIPS (2015).
[17]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
[18]
Nikoli Dryden, Sam Ade Jacobs, Tim Moon, and Brian Van Essen. 2016. Communication quantization for data-parallel training of deep neural networks. In Proceedings of the Workshop on Machine Learning in High Performance Computing Environments. IEEE Press, 1--8.
[19]
Message Passing Interface Forum. 2012. MPI: A Message-Passing Interface Standard Version 3.0.
[20]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677 (2017).
[21]
W. Gropp, T. Hoefler, R. Thakur, and E. Lusk. 2014. Using Advanced MPI: Modern Features of the Message-Passing Interface. MIT Press.
[22]
Demjan Grubic, Leo Tam, Dan Alistarh, and Ce Zhang. 2018. Synchronous Multi-GPU Training for Deep Learning with Low-Precision Communications: An Empirical Study. In Proceedings of the 21th International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, March 26--29, 2018., Michael H. Böhlen, Reinhard Pichler, Norman May, Erhard Rahm, Shan-Hung Wu, and Katja Hose (Eds.). OpenProceedings.org, 145--156.
[23]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[24]
Charles T Hemphill, John J Godfrey, George R Doddington, et al. 1990. The ATS spoken language systems pilot corpus. In Proceedings of the DARPA speech and natural language workshop. 96--101.
[25]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[26]
T. Hoefler and A. Lumsdaine. 2008. Message Progression in Parallel Computing - To Thread or not to Thread?. In Proceedings of the 2008 IEEE International Conference on Cluster Computing. IEEE Computer Society.
[27]
Torsten Hoefler, Andrew Lumsdaine, and Wolfgang Rehm. 2007. Implementation and performance analysis of non-blocking collective operations for MPI. In Supercomputing, 2007. SC'07. Proceedings of the 2007 ACM/IEEE Conference on. IEEE, 1--10.
[28]
T. Hoefler, A. Lumsdaine, and W. Rehm. 2007. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM.
[29]
T. Hoefler and D. Moor. 2014. Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations. Journal of Supercomputing Frontiers and Innovations 1, 2 (Oct. 2014), 58--75.
[30]
Michael Hofmann and Gudula Rünger. 2008. MPI reduction operations for sparse floating-point data. Lecture Notes in Computer Science 5205 (2008), 94.
[31]
Rani Horev. [n. d.]. BERT Explained: State of the art language model for NLP. https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270. visited, Mar. 2019.
[32]
Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I Jordan. 2014. Communication-efficient distributed dual coordinate ascent. In Advances in neural information processing systems. 3068--3076.
[33]
Nikhil Ketkar. 2017. Introduction to PyTorch. In Deep Learning with Python. Springer, 195--208.
[34]
Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. (2009).
[35]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In OSDI, Vol. 1. 3.
[36]
Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu. 2015. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems. 2737--2745.
[37]
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2017. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training. arXiv preprint arXiv:1712.01887 (2017).
[38]
Justin Ma, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. 2009. Identifying suspicious URLs: an application of large-scale online learning. In Proceedings of the 26th annual international conference on machine learning. ACM, 681--688.
[39]
Microsoft. [n. d.]. CNTK Examples. https://github.com/microsoft/CNTK/tree/987b22a8350211cb4c44278951857afl289c3666/Examples. visited, July. 2019, commit 7882cf0.
[40]
NVIDIA. 2016. The NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl (2016).
[41]
Natural Language Group of the USC Information Sciences Institute. 2017. Aligned Hansards of the 36th Parliament of Canada. https://www.isi.edu/natural-language/download/hansard/
[42]
Rolf Rabenseifner. 2004. Optimization of collective reduction operations. In International Conference on Computational Science. Springer, 1--9.
[43]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211--252.
[44]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 2014. 1-bit Stochastic Gradient Descent and its Application to Data-parallel Distributed Training of Speech DNNs. In Fifteenth Annual Conference of the International Speech Communication Association.
[45]
Nikko Strom. 2015. Scalable distributed DNN training using commodity GPU cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association.
[46]
Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. 2017. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. arXiv preprint arXiv:1706.06197 (2017).
[47]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. 2017. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In AAAI. 4278--4284.
[48]
Rajeev Thakur and William D Gropp. 2003. Improving the performance of collective operations in MPICH. In European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting. Springer, 257--267.
[49]
Jesper Träff. 2010. Transparent neutral element elimination in MPI reduction operations. Recent Advances in the Message Passing Interface (2010), 275--284.
[50]
Brian Van Essen, Hyojin Kim, Roger Pearce, Kofi Boakye, and Barry Chen. 2015. LBANN: Livermore big artificial neural network HPC toolkit. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments. ACM, 5.
[51]
Steve Webb, James Caverlee, and Calton Pu. 2006. Introducing the Webb Spam Corpus: Using Email Spam to Identify Web Spam Automatically. In CEAS.
[52]
Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2017. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems. 1508--1518.
[53]
Stephen J Wright. 2015. Coordinate descent algorithms. Mathematical Programming 151, 1 (2015), 3--34.
[54]
Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A new platform for distributed machine learning on big data. IEEE Transactions on Big Data 1, 2 (2015), 49--67.
[55]
Masafumi Yamazaki, Akihiko Kasagi, Akihiro Tabuchi, Takumi Honda, Masahiro Miwa, Naoto Fukumoto, Tsuguchika Tabaru, Atsushi Ike, and Kohta Nakashima. 2019. Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds. arXiv preprint arXiv:1903.12650 (2019).
[56]
Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling SGD Batch Size to 32K for ImageNet Training. arXiv preprint arXiv:1708.03888 (2017).
[57]
Dong Yu, Adam Eversole, Mike Seltzer, Kaisheng Yao, Zhiheng Huang, Brian Guenter, Oleksii Kuchaiev, Yu Zhang, Frank Seide, Huaming Wang, et al. 2014. An introduction to computational networks and the computational network toolkit. Microsoft Technical Report MSR-TR-2014-112 (2014).
[58]
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146 (2016).
[59]
Jian Zhang, Ioannis Mitliagkas, and Christopher Ré. 2017. YellowFin and the Art of Momentum Tuning. arXiv preprint arXiv:1706.03471 (2017).
[60]
Huasha Zhao and John Canny. 2014. Kylix: A sparse allreduce for commodity clusters. In Parallel Processing (ICPP), 2014 43rd International Conference on. IEEE, 273--282.

Cited By

View all
  • (2024)Optimizing cloud resource allocationJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23405446:1(2311-2330)Online publication date: 1-Jan-2024
  • (2024)Configurable Algorithms for All-to-All CollectivesISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528936(1-12)Online publication date: May-2024
  • (2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. sparse allgather
  2. sparse allreduce
  3. sparse input vectors

Qualifiers

  • Research-article

Funding Sources

  • Swiss National Supercomputing Centre
  • European Research Council (ERC)

Conference

SC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)105
  • Downloads (Last 6 weeks)20
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Optimizing cloud resource allocationJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23405446:1(2311-2330)Online publication date: 1-Jan-2024
  • (2024)Configurable Algorithms for All-to-All CollectivesISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528936(1-12)Online publication date: May-2024
  • (2024)Sparse Gradient Communication with AlltoAll for Accelerating Distributed Deep LearningProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673140(148-157)Online publication date: 12-Aug-2024
  • (2024)OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICsProceedings of the 2024 SIGCOMM Workshop on Networks for AI Computing10.1145/3672198.3673804(75-83)Online publication date: 4-Aug-2024
  • (2024)Bruck Algorithm Performance Analysis for Multi-GPU All-to-All CommunicationProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3635047(127-133)Online publication date: 18-Jan-2024
  • (2024)ADTopk: All-Dimension Top-k Compression for High-Performance Data-Parallel DNN TrainingProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658678(135-147)Online publication date: 3-Jun-2024
  • (2024)HammingMesh: A Network Topology for Large-Scale Deep LearningCommunications of the ACM10.1145/3623490Online publication date: 21-Nov-2024
  • (2024)Enhancing Distributed Neural Network Training Through Node-Based CommunicationsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2023.330973535:12(17893-17907)Online publication date: Dec-2024
  • (2024)Parallel Successive Learning for Dynamic Distributed Model Training Over Heterogeneous Wireless NetworksIEEE/ACM Transactions on Networking10.1109/TNET.2023.328698732:1(222-237)Online publication date: Feb-2024
  • (2024)Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated SchedulesIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621327(1880-1889)Online publication date: 20-May-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media