[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3447818.3460372acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Partitioning sparse deep neural networks for scalable training and inference

Published: 04 June 2021 Publication History

Abstract

The state-of-the-art deep neural networks (DNNs) have significant computational and data management requirements. The size of both training data and models continue to increase. Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs. The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning. Both the feedforward (inference) and backpropagation steps in stochastic gradient descent (SGD) algorithm for training sparse DNNs involve consecutive sparse matrix-vector multiplications (SpMVs). We first introduce a distributed-memory parallel SpMV-based solution for the SGD algorithm to improve its scalability. The parallelization approach is based on row-wise partitioning of weight matrices that represent neuron connections between consecutive layers. We then propose a novel hypergraph model for partitioning weight matrices to reduce the total communication volume and ensure computational load-balance among processors. Experiments performed on sparse DNNs demonstrate that the proposed solution is highly efficient and scalable. By utilizing the proposed matrix partitioning scheme, the performance of our solution is further improved significantly.

References

[1]
Alham Fikri Aji and Kenneth Heafield. 2017. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021 (2017).
[2]
Kadir Akbudak, Enver Kayaaslan, and Cevdet Aykanat. 2013. Hypergraph partitioning based models and methods for exploiting cache locality in sparse matrix-vector multiplication. SIAM Journal on Scientific Computing 35, 3 (2013), C237--C262.
[3]
Ammar Ahmad Awan, Khaled Hamidouche, Jahanzeb Maqbool Hashmi, and Dhabaleswar K Panda. 2017. S-caffe: Co-designing mpi runtimes and caffe for scalable deep learning on modern gpu clusters. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 193--205.
[4]
Mauro Bisson and Massimiliano Fatica. 2019. A GPU Implementation of the Sparse Deep Neural Network Graph Challenge. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--8.
[5]
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165 (2020).
[6]
Adrián Castelló, Manuel F Dolz, Enrique S Quintana-Ortí, and José Duato. 2019. Analysis of model parallelism for distributed neural networks. In Proceedings of the 26th European MPI Users' Group Meeting. 1--10.
[7]
Umit V Catalyurek and Cevdet Aykanat. 1999. Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Transactions on parallel and distributed systems 10, 7 (1999), 673--693.
[8]
Zheng Chai, Ahsan Ali, Syed Zawad, Stacey Truex, Ali Anwar, Nathalie Baracaldo, Yi Zhou, Heiko Ludwig, Feng Yan, and Yue Cheng. 2020. Tifl: A tier-based federated learning system. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. 125--136.
[9]
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 571--582.
[10]
Ching-Hsiang Chu, Pouya Kousha, Ammar Ahmad Awan, Kawthar Shafie Khorassani, Hari Subramoni, and Dhabaleswar K Panda. 2020. Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems. In Proceedings of the 34th ACM International Conference on Supercomputing. 1--12.
[11]
Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, and Ng Andrew. 2013. Deep learning with COTS HPC systems. In International conference on machine learning. 1337--1345.
[12]
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of machine learning research 12, Aug (2011), 2493--2537.
[13]
Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. 2016. Distributed deep learning using synchronous stochastic gradient descent. arXiv preprint arXiv:1602.06709 (2016).
[14]
Timothy A Davis. 2019. Algorithm 1000: SuiteSparse: GraphBLAS: Graph algorithms in the language of sparse linear algebra. ACM Transactions on Mathematical Software (TOMS) 45, 4 (2019), 1--25.
[15]
Timothy A Davis, Mohsen Aznaveh, and Scott Kolodziej. 2019. Write quick, run fast: Sparse deep neural network in 20 minutes of development time via SuiteSparse: GraphBLAS. In 2019 IEEE High Performance extreme Computing Conference (HPEC). IEEE, 1--6.
[16]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, et al. 2012. Large scale distributed deep networks. In Advances in neural information processing systems. 1223--1231.
[17]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.
[18]
Trevor Gale, Erich Elsen, and Sara Hooker. 2019. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019).
[19]
Tong Geng, Tianqi Wang, Chunshu Wu, Chen Yang, Wei Wu, Ang Li, and Martin C Herbordt. 2019. O3BNN: An out-of-order architecture for high-performance binarized neural network inference with fine-grained pruning. In Proceedings of the ACM International Conference on Supercomputing. 461--472.
[20]
Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).
[21]
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural networks 18, 5-6 (2005), 602--610.
[22]
Robert M Gray. 2006. Toeplitz and circulant matrices: A review. now publishers inc.
[23]
Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li, Minyi Guo, and Yuhao Zhu. 2020. Accelerating sparse DNN models without hardware-support via tile-wise sparsity. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.
[24]
Babak Hassibi and David G Stork. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in neural information processing systems. 164--171.
[25]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[26]
B Hendrickson and TG Kolda. [n.d.]. Partitioning Rectangular and Structurally Nonsymmetric Sparse Matrices for Parallel Processing, submitted to SIAM Journal of Scientific Computing.
[27]
Mert Hidayetoğlu, Carl Pearson, Vikram Sharma Mailthody, Eiman Ebrahimi, Jinjun Xiong, Rakesh Nagi, and Wen-mei Hwu. 2020. At-Scale Sparse Deep Neural Network Inference With Efficient GPU Implementation. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--7.
[28]
Sara Hooker, Aaron Courville, Yann Dauphin, and Andrea Frome. 2019. Selective Brain Damage: Measuring the Disparate Impact of Model Pruning. arXiv preprint arXiv:1911.05248 (2019).
[29]
Forrest N Iandola, Matthew W Moskewicz, Khalid Ashraf, and Kurt Keutzer. 2016. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2592--2600.
[30]
Zhihao Jia, Sina Lin, Charles R Qi, and Alex Aiken. 2018. Exploring hidden dimensions in parallelizing convolutional neural networks. arXiv preprint arXiv:1802.04924 (2018).
[31]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2019. Beyond Data and Model Parallelism for Deep Neural Networks. SysML 2019 (2019).
[32]
Peter H Jin, Qiaochu Yuan, Forrest Iandola, and Kurt Keutzer. 2016. How to scale distributed deep learning? arXiv preprint arXiv:1611.04581 (2016).
[33]
George Karypis. 1998. hMETIS 1.5: A hypergraph partitioning package. http://www. cs. umn. edu/̃ metis (1998).
[34]
Oguz Kaya and Bora Uçar. 2015. Scalable sparse tensor decompositions in distributed memory systems. In SC'15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.
[35]
Jeremy Kepner, Simon Alford, Vijay Gadepally, Michael Jones, Lauren Milechin, Albert Reuther, Ryan Robinett, and Sid Samsi. 2020. GraphChallenge. org Sparse Deep Neural Network Performance. arXiv preprint arXiv:2004.01181 (2020).
[36]
Jeremy Kepner, Simon Alford, Vijay Gadepally, Michael Jones, Lauren Milechin, Ryan Robinett, and Sid Samsi. 2019. Sparse deep neural network graph challenge. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--7.
[37]
Jeremy Kepner and Ryan Robinett. 2019. RadiX-Net: Structured Sparse Matrices for Deep Neural Networks. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 268--274.
[38]
Tamara G Kolda. 1998. Partitioning sparse rectangular matrices for parallel processing. In International Symposium on Solving Irregularly Structured Problems in Parallel. Springer, 68--79.
[39]
Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
[40]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[41]
Yann LeCun. 1998. The MNIST database of handwritten digits. http://yann. lecun. com/exdb/mnist/ (1998).
[42]
Yann LeCun, John S Denker, and Sara A Solla. 1990. Optimal brain damage. In Advances in neural information processing systems. 598--605.
[43]
Mu Li, David G Andersen, Jun Woo Park, Alexander J Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14). 583--598.
[44]
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. [n.d.]. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proceedings of the VLDB Endowment 13, 12 ([n. d.]).
[45]
Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887 (2017).
[46]
Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. 2015. Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 806--814.
[47]
Christos Louizos, Max Welling, and Diederik P Kingma. 2017. Learning Sparse Neural Networks through L0 Regularization. arXiv preprint arXiv:1712.01312 (2017).
[48]
Mohammad Hasanzadeh Mofrad, Rami Melhem, Yousuf Ahmad, and Mohammad Hammoud. 2019. Multithreaded Layer-wise Training of Sparse Deep Neural Networks using Compressed Sparse Column. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.
[49]
Mohammad Hasanzadeh Mofrad, Rami Melhem, Yousuf Ahmad, and Mohammad Hammoud. 2020. Studying the effects of hashing of sparse deep neural networks on data and model parallelisms. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--7.
[50]
Lin Ning and Xipeng Shen. 2019. Deep reuse: streamline CNN inference on the fly via coarse-grained computation reuse. In Proceedings of the ACM International Conference on Supercomputing. 438--448.
[51]
Filip Pawłowski, Rob H Bisseling, Bora Uçar, and AN Yzelman. 2020. Combinatorial Tiling for Sparse Neural Networks. In 2020 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--7.
[52]
Ameya Prabhu, Girish Varma, and Anoop Namboodiri. 2018. Deep expander networks: Efficient deep networks from graph theory. In Proceedings of the European Conference on Computer Vision (ECCV). 20--35.
[53]
Gerald Schubert, Georg Hager, Holger Fehske, and Gerhard Wellein. 2011. Parallel sparse matrix-vector multiplication as a test case for hybrid MPI+ OpenMP programming. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. IEEE, 1751--1758.
[54]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[55]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929--1958.
[56]
Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf. 2014. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1701--1708.
[57]
Linnan Wang, Wei Wu, Junyu Zhang, Hang Liu, George Bosilca, Maurice Herlihy, and Rodrigo Fonseca. 2020. FFT-based Gradient Sparsification for the Distributed Training of Deep Neural Networks. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. 113--124.
[58]
Xiaoyun Wang, Zhongyi Lin, Carl Yang, and John D Owens. 2019. Accelerating DNN Inference with GraphBLAS and the GPU. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1--6.
[59]
Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. 2018. Gradient sparsification for communication-efficient distributed optimization. In Advances in Neural Information Processing Systems. 1299--1309.
[60]
Xintian Yang, Srinivasan Parthasarathy, and Ponnuswamy Sadayappan. 2011. Fast sparse matrix-vector multiplication on GPUs: implications for graph mining. arXiv preprint arXiv:1103.2405 (2011).
[61]
Yang You, Igor Gitman, and Boris Ginsburg. 2017. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888 6 (2017).
[62]
Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2019. Fast deep neural network training on distributed systems and cloud TPUs. IEEE Transactions on Parallel and Distributed Systems 30, 11 (2019), 2449--2462.
[63]
Sixin Zhang, Anna E Choromanska, and Yann LeCun. 2015. Deep learning with elastic averaging SGD. In Advances in neural information processing systems. 685--693.
[64]
Michael Zhu and Suyog Gupta. 2017. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878 (2017).

Cited By

View all
  • (2024)D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural NetworksProceedings of the VLDB Endowment10.14778/3681954.368196117:11(2764-2777)Online publication date: 1-Jul-2024
  • (2024)FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00168(2109-2122)Online publication date: 13-May-2024
  • (2023)Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUsElectronics10.3390/electronics1217368712:17(3687)Online publication date: 31-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing
June 2021
506 pages
ISBN:9781450383356
DOI:10.1145/3447818
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed stochastic gradient descent
  2. hypergraph partitioning
  3. scalable deep learning
  4. sparse deep neural networks
  5. sparse matrix vector multiplication

Qualifiers

  • Research-article

Conference

ICS '21
Sponsor:

Acceptance Rates

ICS '21 Paper Acceptance Rate 39 of 157 submissions, 25%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)D3-GNN: Dynamic Distributed Dataflow for Streaming Graph Neural NetworksProceedings of the VLDB Endowment10.14778/3681954.368196117:11(2764-2777)Online publication date: 1-Jul-2024
  • (2024)FSD-Inference: Fully Serverless Distributed Inference with Scalable Cloud Communication2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00168(2109-2122)Online publication date: 13-May-2024
  • (2023)Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUsElectronics10.3390/electronics1217368712:17(3687)Online publication date: 31-Aug-2023
  • (2023)Low-Latency Federated Learning With DNN Partition in Distributed Industrial IoT NetworksIEEE Journal on Selected Areas in Communications10.1109/JSAC.2022.322943641:3(755-775)Online publication date: Mar-2023
  • (2023)Dynamic layer-wise sparsification for distributed deep learningFuture Generation Computer Systems10.1016/j.future.2023.04.022147:C(1-15)Online publication date: 1-Oct-2023
  • (2022)A Lightweight Self-Supervised Representation Learning Algorithm for Scene Classification in Spaceborne SAR and Optical ImagesRemote Sensing10.3390/rs1413295614:13(2956)Online publication date: 21-Jun-2022
  • (2022)Mapping and Optimization Method of SpMV on Multi-DSP AcceleratorElectronics10.3390/electronics1122369911:22(3699)Online publication date: 11-Nov-2022
  • (2022)Scalable Graph Convolutional Network Training on Distributed-Memory SystemsProceedings of the VLDB Endowment10.14778/3574245.357425616:4(711-724)Online publication date: 1-Dec-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media