[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures

Published: 17 September 2018 Publication History

Abstract

Due to the popularity of Deep Neural Network (DNN) models, we have witnessed extreme-scale DNN models with the continued increase of the scale in terms of depth and width. However, the extremely high memory requirements for them make it difficult to run the training processes on single many-core architectures such as a Graphic Processing Unit (GPU), which compels researchers to use model parallelism over multiple GPUs to make it work. However, model parallelism always brings very heavy additional overhead. Therefore, running an extreme-scale model in a single GPU is urgently required. There still exist several challenges to reduce the memory footprint for extreme-scale deep learning. To address this tough problem, we first identify the memory usage characteristics for deep and wide convolutional networks, and demonstrate the opportunities for memory reuse at both the intra-layer and inter-layer levels. We then present Layrub, a runtime data placement strategy that orchestrates the execution of the training process. It achieves layer-centric reuse to reduce memory consumption for extreme-scale deep learning that could not previously be run on a single GPU. Experiments show that, compared to the original Caffe, Layrub can cut down the memory usage rate by an average of 58.2% and by up to 98.9%, at the moderate cost of 24.1% higher training execution time on average. Results also show that Layrub outperforms some popular deep learning systems such as GeePS, vDNN, MXNet, and Tensorflow. More importantly, Layrub can tackle extreme-scale deep learning tasks. For example, it makes an extra-deep ResNet with 1,517 layers that can be trained successfully in one GPU with 12GB memory, while other existing deep learning systems cannot.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). USENIX Association, Savannah, GA, 265--283.
[2]
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: A CPU and GPU math compiler in Python. In Proceedings of the 9th Python in Science Conference (SciPy’10). Austin, Texas, 1--7.
[3]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. In Proceedings of Workshop on Machine Learning Systems at the 28th Annual Conference on Neural Information Processing Systems (LearningSys’15). Montreal, Canada, 1--6.
[4]
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training deep nets with sublinear memory cost. arXiv:1604.06174 (2016).
[5]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient primitives for deep learning. arXiv:1410.0759 (2014).
[6]
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, USA, 571--582.
[7]
Ronan Collobert, Samy Bengio, and Johnny Mariéthoz. 2002. Torch: A Modular Machine Learning Software Library. Technical Report EPFL-REPORT-82802. Idiap, Martigny, Valais, Switzerland.
[8]
Henggang Cui, Hao Zhang, Gregory R. Ganger, Phillip B. Gibbons, and Eric P. Xing. 2016. GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server. In Proceedings of the 11th European Conference on Computer Systems (EuroSys’16). ACM, London, UK, 1--16.
[9]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS’12). Curran Associates, Inc., Lake Tahoe, Nevada, 1223--1231.
[10]
Jia Deng, Wei Dong, Richard Socher, Lijia Li, Kai Li, and Feifei Li. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the 22nd IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, Miami, F, 248--255.
[11]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, Boston, MA, 2625--2634.
[12]
Facebook. 2017. Caffe2. Retrieved December 13, 2017 from https://caffe2.ai.
[13]
Qichuan Geng, Zhong Zhou, and Xiaochun Cao. 2017. Survey of recent progress in semantic image segmentation with CNNs. Science China Information Sciences 61, 5 (2017), 1--18.
[14]
Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. 2014. Compressing deep convolutional networks using vector quantization. arXiv:1412.6115 (2014).
[15]
Audrunas Gruslys, Rémi Munos, Ivo Danihelka, Marc Lanctot, and Alex Graves. 2016. Memory-efficient backpropagation through time. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS’16). Curran Associates Inc., Barcelona, Spain, 4125--4133.
[16]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. 2016. EIE: Efficient inference engine on compressed deep neural network. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). ACM, Seoul, Korea, 243--254.
[17]
Song Han, Huizi Mao, and William J. Dally. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16). San Juan, Puerto Rico, 1--14.
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, Las Vegas, NV, 770--778.
[19]
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. 2016. Deep networks with stochastic depth. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16). Springer, Amsterdam, Netherlands, 646--661.
[20]
Forrest Iandola. 2016. Exploring the Design Space of Deep Convolutional Neural Networks at Large Scale. Ph.D. Dissertation. University of California, Berkeley.
[21]
Inspur. 2017. Caffe-MPI. Retrieved August 24, 2018 from https://github.com/Caffe-MPI/Caffe-MPI.github.io.
[22]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14). ACM, Orlando, Florida, USA, 675--678.
[23]
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Feifei Li. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’14). IEEE, Washington, DC, 1725--1732.
[24]
Jan Koutník, Klaus Greff, Faustino Gomez, and Jürgen Schmidhuber. 2014. A clockwork RNN. In Proceedings of the 31st International Conference on International Conference on Machine Learning (ICML’14). JMLR, Beijing, China, 1863--1871.
[25]
Alex Krizhevsky and Geoffrey E. Hinton. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. University of Toronto.
[26]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems (NIPS’12). Curran Associates, Inc., Lake Tahone, Nevad, 1097--1105.
[27]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278--2324.
[28]
Ignacio Lopez-Moreno, Javier Gonzalez-Dominguez, Oldrich Plchot, David Martinez, Joaquin Gonzalez-Rodriguez, and Pedro Moreno. 2014. Automatic language identification using deep neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’14). IEEE, Florence, Italy, 5337--5341.
[29]
Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, and Trevor Cohn. 2017. DyNet: The dynamic neural network toolkit. arXiv:1701.03980 (2017).
[30]
NVIDIA. 2017. CUDA toolkit 8.0 documentation: profiler. Retrieved December 13, 2017 from http://docs.nvidia.com/cuda/profiler-users-guide/index.html#axzz4p7v1jezC.
[31]
NVIDIA. 2017. Introduction to cuBLAS. Retrieved December 13, 2017 from http://docs.nvidia.com/cuda/cublas/index.html#axzz4oxZMgelu.
[32]
Zhaofan Qiu, Ting Yao, and Tao Mei. 2017. Learning spatio-temporal representation with pseudo-3D residual networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV’17). IEEE, Venice, Italy, 5534--5542.
[33]
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. 2016. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). IEEE, Taipei, China, 1--13.
[34]
Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (InterSpeech’14). Singapore, 338--342.
[35]
Frank Seide and Amit Agarwal. 2016. CNTK: Microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16). ACM, San Francisco, California, USA, 2135--2135.
[36]
Karen Simonyan and Andrew Zisserman. 2016. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15). San Diego, USA, 1--14.
[37]
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild. Technical Report CRCV-TR-12-01. University of Central Florida, Orlando, FL.
[38]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 105, 12 (2017), 2295--2329.
[39]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI’17). AAAI Press, San Francisco, California, 4278--4284.
[40]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). IEEE, Boston, MA, 1--9.
[41]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). IEEE, Las Vegas, NV, 2818--2826.
[42]
Marc Gonzalez Tallada. 2016. Coarse grain parallelization of deep neural net works. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’16). ACM, Barcelona, Spain, 1--12.
[43]
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’18). ACM, Vienna, Austria, 41--53.
[44]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). IEEE, Honolulu, Hawaii, 5987--5995.
[45]
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. In Proceedings of the 27th British Machine Vision Conference (BMVC’16). BMVA Press, York, UK, 87.1--87.12.

Cited By

View all
  • (2023)InterGrad: Energy-Efficient Training of Convolutional Neural Networks via Interleaved Gradient SchedulingIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.324646870:5(1949-1962)Online publication date: May-2023
  • (2023)EagerReuse: An Efficient Memory Reuse Approach for Complex Computational Graph2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00041(223-229)Online publication date: 17-Dec-2023
  • (2023)MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071077(556-569)Online publication date: Feb-2023
  • Show More Cited By

Index Terms

  1. Layer-Centric Memory Reuse and Data Migration for Extreme-Scale Deep Learning on Many-Core Architectures

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 3
      September 2018
      322 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/3274266
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 September 2018
      Accepted: 01 July 2018
      Revised: 01 May 2018
      Received: 01 January 2018
      Published in TACO Volume 15, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. DNN
      2. Data placement
      3. GPU
      4. memory efficiency

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • National Natural Science Foundation of China
      • National Key Research and Development Program of China
      • NVIDIA Corporation

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)460
      • Downloads (Last 6 weeks)72
      Reflects downloads up to 14 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)InterGrad: Energy-Efficient Training of Convolutional Neural Networks via Interleaved Gradient SchedulingIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2023.324646870:5(1949-1962)Online publication date: May-2023
      • (2023)EagerReuse: An Efficient Memory Reuse Approach for Complex Computational Graph2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00041(223-229)Online publication date: 17-Dec-2023
      • (2023)MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071077(556-569)Online publication date: Feb-2023
      • (2023)LayCO: Achieving Least Lossy Accuracy for Most Efficient RRAM-Based Deep Neural Network Accelerator via Layer-Centric Co-OptimizationJournal of Computer Science and Technology10.1007/s11390-023-2545-y38:2(328-347)Online publication date: 30-Mar-2023
      • (2023)An Empirical Study of Memory Pool Based Allocation and Reuse in CUDA GraphAlgorithms and Architectures for Parallel Processing10.1007/978-981-97-0808-6_23(394-406)Online publication date: 20-Oct-2023
      • (2022)StrongHoldProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571979(1-17)Online publication date: 13-Nov-2022
      • (2022)Memory-Throughput Trade-off for CNN-Based Applications at the EdgeACM Transactions on Design Automation of Electronic Systems10.1145/352745728:1(1-26)Online publication date: 10-Dec-2022
      • (2022)STRONGHOLD: Fast and Affordable Billion-Scale Deep Learning Model TrainingSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00076(1-17)Online publication date: Nov-2022
      • (2021)COMETProceedings of the VLDB Endowment10.14778/3503585.350359715:4(886-899)Online publication date: 1-Dec-2021
      • (2021)Understanding and optimizing packed neural network training for hyper-parameter tuningProceedings of the Fifth Workshop on Data Management for End-To-End Machine Learning10.1145/3462462.3468880(1-11)Online publication date: 20-Jun-2021
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media