[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2967938.2967944acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article
Public Access

Bridging the Semantic Gaps of GPU Acceleration for Scale-out CNN-based Big Data Processing: Think Big, See Small

Published: 11 September 2016 Publication History

Abstract

Convolutional Neural Networks (CNNs) have substantially advanced the state-of-the-art accuracies of object recognition, which is the core function of a myriad of modern multimedia processing techniques such as image/video processing, speech recognition, and natural language processing. GPU-based accelerators gained increasing attention because a large amount of highly parallel neurons in CNN naturally matches the GPU computation pattern. In this work, we perform comprehensive experiments to investigate the performance bottlenecks and overheads of current GPU acceleration platform for scale-out CNN-based big data processing.
In our characterization, we observe two significant semantic gaps: framework gap that lies between CNN-based data processing workflow and data processing manner in distributed framework; and the standalone gap that lies between the uneven computation loads at different CNN layers and fixed computing capacity provisioning of current GPU acceleration library. To bridge these gaps, we propose D3NN, a Distributed, Decoupled, and Dynamically tuned GPU acceleration framework for modern CNN architectures. In particular, D3NN features a novel analytical model that enables accurate time estimation of GPU accelerated CNN processing with only 5-10% error. Our evaluation results show the throughput of standalone processing node using D3NN gains up to 3.7X performance improvement over current standalone GPU acceleration platform. Our CNN-oriented GPU acceleration library with built-in dynamic batching scheme achieves up to 1.5X performance improvement over the non-batching scheme and outperforms the state-of-the-art deep learning library by up to 28% (performance mode) ~ 67% (memory-efficient mode).

References

[1]
Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G. and Yu, D. 2014. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing. 22, 10 (2014), 1533--1545.
[2]
Amazon G2 instance: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using_cluster_co mputing.html.
[3]
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y. and Temam, O. 2014. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (2014), 269--284.
[4]
Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z. and Sun, N. 2014. DaDianNao: A Machine-Learning Supercomputer. Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (2014), 609--622.
[5]
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B. and Shelhamer, E. 2014. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759. (Oct. 2014).
[6]
CS231n: Convolutional Neural Networks for Visual Recognition: http://cs231n.github.io/convolutional-networks/.
[7]
cuBLAS: https://developer.nvidia.com/cuBLAS.
[8]
CUDA Profiler User's Guide: docs.nvidia.com/cuda/profiler-users-guide/.
[9]
cuDNN v2: Higher Performance for Deep Learning on GPUs: http://devblogs.nvidia.com/parallelforall/cudnn-v2-higher-performance-deep-learning-gpus/.
[10]
Dean, J. and Ghemawat, S. 2008. MapReduce?: Simplified Data Processing on Large Clusters. Communications of the ACM. 51, 1 (2008), 1--13.
[11]
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K. and Fei-Fei, L. 2009. ImageNet: A large-scale hierarchical image database. Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2009), 1097--1105.
[12]
Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y. and Temam, O. 2015. ShiDianNao: Shifting Vision Processing Closer to the Sensor. Proceedings of the 42nd Annual International Symposium on Computer Architecture (2015), 92--104.
[13]
Facebook, Ericsson and Qualcomm 2013. A focus on efficiency.
[14]
Girshick, R., Donahue, J., Darrell, T., Berkeley, U.C. and Malik, J. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2014), 580--587.
[15]
Hauswald, J., Kang, Y., Laurenzano, M.A., Chen, Q., Li, C., Dreslinski, R., Mudge, T., Mars, J. and Tang, L. 2015. Djinn and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers. Proceedings of the 42Nd Annual International Symposium on Computer Architecture (2015), 27--40.
[16]
Hauswald, J., Laurenzano, M.A., Zhang, Y., Li, C., Rovinski, A., Khurana, A., Dreslinski, R.G., Mudge, T., Petrucci, V., Tang, L. and Mars, J. 2015. Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (2015), 223--238.
[17]
He, W., Cui, H., Lu, B., Zhao, J., Li, S., Xue, J. and Feng, X. 2015. Hadoop+: Modeling and Evaluating the Heterogeneity for MapReduce Applications in Heterogeneous Clusters. Proceedings of the 29th ACM on International Conference on Supercomputing (2015), 143--153.
[18]
Hu, Y., Li, C., Liu, L. and Li, T. 2016. HOPE: Enabling Efficient Service Orchestration in Software-Defined Data Centers. Proceedings of the 2016 International Conference on Supercomputing (2016), 10:1--10:12.
[19]
Hu, Y., Song, M., Chen, H. and Li, T. 2016. Towards Efficient Server Architecture for Virtualized Network Function Deployment: Implications and Implementations. Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (2016).
[20]
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T. and Eecs, U.C.B. 2014. Caffe?: Convolutional Architecture for Fast Feature Embedding. Proceedings of the 22Nd ACM International Conference on Multimedia. (2014), 675--678.
[21]
Karpathy, A. and Fei-Fei, L. 2015. Deep visual-semantic alignments for generating image descriptions. Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2015), 3128--3137.
[22]
Karpathy, A. and Leung, T. 2014. Large-scale Video Classification with Convolutional Neural Networks. Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2014), 1725--1732.
[23]
Kim, Y. 2014. Convolutional neural networks for sentence classification. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). (2014), 1746--1751.
[24]
Krizhevsky, A., Sutskever, I. and Hinton, G.E. 2012. ImageNet Classification with Deep Convolutional Neural Networks. Advances In Neural Information Processing Systems. (2012), 1097--1105.
[25]
Li, C., Hu, Y., Liu, L., Gu, J., Song, M., Liang, X., Yuan, J. and Li, T. 2015. Towards Sustainable In-situ Server Systems in the Big Data Era. Proceedings of the 42nd Annual International Symposium on Computer Architecture (2015), 14--26.
[26]
Li, C., Hu, Y., Zhou, R., Liu, M., Liu, L., Yuan, J. and Li, T. 2013. Enabling Datacenter Servers to Scale out Economically and Sustainably. Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (2013), 322--333.
[27]
Liu, D., Chen, T., Liu, S., Zhou, J., Zhou, S., Teman, O., Feng, X., Zhou, X. and Chen, Y. 2015. Pudiannao: A polyvalent machine learning accelerator. Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (2015), 369--381.
[28]
Liu, L., Cui, Z., Xing, M., Bao, Y., Chen, M. and Wu, C. 2012. A software memory partition approach for eliminating bank-level interference in multicore systems. Proceedings of the 21st international conference on Parallel architectures and compilation techniques (2012), 367--376.
[29]
MULTI-PROCESS SERVICE: https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf.
[30]
NVIDIA CUDA Programming Guide: http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.
[31]
NVIDIA Visual Profiler: https://developer.nvidia.com/nvidia-visual-profiler.
[32]
OpenCL: http://www.khronos.org/opencl/.
[33]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C. and Fei-Fei, L. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision. 115, 3 (Sep. 2015), 211--252.
[34]
Sermanet, P., Kavukcuoglu, K., Chintala, S. and Lecun, Y. 2013. Pedestrian detection with unsupervised multi-stage feature learning. Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2013), 3626--3633.
[35]
Sethia, A. and Mahlke, S. 2014. Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (2014), 647--658.
[36]
Simonyan, K. and Zisserman, A. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR. abs/1409.1, (2014).
[37]
SoftLayer offers Nvidia's most powerful GPU as-a-service: http://www.datacenterdynamics.com/app-cloud/softlayer-offers-nvidias-most-powerful-gpu-as-a-service/94407.fullarticle.
[38]
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A. 2015. Going deeper with convolutions. Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). (2015), 1--9.
[39]
White, T. 2012. Hadoop: The definitive guide. -- O'Reilly Media, Inc.
[40]
Xu, Y., Wang, R., Li, T., Song, M., Gao, L., Luan, Z. and Qian, D. 2016. Scheduling Tasks with Mixed Timing Constraints in GPU-Powered Real-Time Systems. Proceedings of the 2016 International Conference on Supercomputing (2016), 30:1--30:13.
[41]
Yang, L., Luo, P., Loy, C.C. and Tang, X. 2015. A Large-Scale Car Dataset for Fine-Grained Categorization and Verification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2015), 3973--3981.
[42]
Youtube press statistics: http://youtube.com/yt/press/statistics.html.
[43]
Zaharia, M., Chowdhury, M., Das, T. and Dave, A. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. (2012).
[44]
Zhang, C., Li, P., Sun, G., Guan, Y., Xiao, B. and Cong, J. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (2015), 161--170.
[45]
Zhang, J., Sameki, M., Ma, S., Price, B., Mech, R., Shen, X., Betke, M., Sclaroff, S. and Lin, Z. 2015. Salient object subitizing. Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2015), 4045--4054.
[46]
Zhou, B. and Lapedriza, A. and Xiao, J. and Torralba, A. and Oliva, A. 2014. Learning Deep Features for Scene Recognition using Places Database. Advances In Neural Information Processing Systems (2014).
[47]
Zhou, R., Chen, H. and Li, T. 2015. Towards Lightweight and Swift Storage Resource Management in Big Data Cloud Era. Proceedings of the 29th ACM on International Conference on Supercomputing (2015), 133--142.

Cited By

View all
  • (2022)Out-of-order backpropProceedings of the Seventeenth European Conference on Computer Systems10.1145/3492321.3519563(435-452)Online publication date: 28-Mar-2022
  • (2022)FPGA Implementation of Convolutional Neural Networks Based on Resource Sharing Techniques2022 14th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA)10.1109/ICMTMA54903.2022.00026(96-100)Online publication date: Jan-2022
  • (2022)Hybrid convolutional neural network based segmentation of visceral and subcutaneous adipose tissue from abdominal magnetic resonance imagesJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-022-03787-z14:10(13333-13347)Online publication date: 22-Apr-2022
  • Show More Cited By

Index Terms

  1. Bridging the Semantic Gaps of GPU Acceleration for Scale-out CNN-based Big Data Processing: Think Big, See Small

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation
      September 2016
      474 pages
      ISBN:9781450341219
      DOI:10.1145/2967938
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 September 2016

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. big data
      2. deep learning
      3. distributed system
      4. gpu

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      PACT '16
      Sponsor:
      • IFIP WG 10.3
      • IEEE TCCA
      • SIGARCH
      • IEEE CS TCPP

      Acceptance Rates

      PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;
      Overall Acceptance Rate 121 of 471 submissions, 26%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)321
      • Downloads (Last 6 weeks)21
      Reflects downloads up to 14 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Out-of-order backpropProceedings of the Seventeenth European Conference on Computer Systems10.1145/3492321.3519563(435-452)Online publication date: 28-Mar-2022
      • (2022)FPGA Implementation of Convolutional Neural Networks Based on Resource Sharing Techniques2022 14th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA)10.1109/ICMTMA54903.2022.00026(96-100)Online publication date: Jan-2022
      • (2022)Hybrid convolutional neural network based segmentation of visceral and subcutaneous adipose tissue from abdominal magnetic resonance imagesJournal of Ambient Intelligence and Humanized Computing10.1007/s12652-022-03787-z14:10(13333-13347)Online publication date: 22-Apr-2022
      • (2021)GrusACM Transactions on Architecture and Code Optimization10.1145/344484418:2(1-25)Online publication date: 9-Feb-2021
      • (2021)TurboDL: Improving the CNN Training on GPU With Fine-Grained Multi-Streaming SchedulingIEEE Transactions on Computers10.1109/TC.2020.299032170:4(552-565)Online publication date: 1-Apr-2021
      • (2020)Co-Optimizing Performance and Memory Footprint Via Integrated CPU/GPU Memory Management, an Implementation on Autonomous Driving Platform2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)10.1109/RTAS48715.2020.00007(310-323)Online publication date: Apr-2020
      • (2020)Efficient ResNet Model to Predict Protein-Protein Interactions With GPU ComputingIEEE Access10.1109/ACCESS.2020.30054448(127834-127844)Online publication date: 2020
      • (2019)MiCACM Journal on Emerging Technologies in Computing Systems10.1145/330410815:3(1-24)Online publication date: 29-Apr-2019
      • (2018)Prediction based execution on deep neural networksProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00068(752-763)Online publication date: 2-Jun-2018
      • (2018)Multiple CNN-based Tasks Scheduling across Shared GPU Platform in Research and Development Scenarios2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC/SmartCity/DSS.2018.00107(578-585)Online publication date: Jun-2018
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media