More Web Proxy on the site http://driver.im/

research-article

Zero and data reuse-aware fast convolution for deep neural networks on GPU

Authors:

Sungjoo YooAuthors Info & Claims

CODES '16: Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis

Article No.: 33, Pages 1 - 10

https://doi.org/10.1145/2968456.2968476

Published: 01 October 2016 Publication History

Abstract

Convolution operations dominate the total execution time of deep convolutional neural networks (CNNs). In this paper, we aim at enhancing the performance of the state-of-the-art convolution algorithm (called Winograd convolution) on the GPU. Our work is based on two observations: (1) CNNs often have abundant zero weights and (2) the performance benefit of Winograd convolution is limited mainly due to extra additions incurred during data transformation. In order to exploit abundant zero weights, we propose a low-overhead and efficient hardware mechanism that skips multiplications that will always give zero results regardless of input data (called ZeroSkip). In addition, to leverage the second observation, we present data reuse optimization for addition operations in Winograd convolution (called AddOpt), which improves the utilization of local registers, thereby reducing on-chip cache accesses. Our experiments with a real-world deep CNN, VGG-16, on GPGPU-Sim and Titan X show that the proposed methods, ZeroSkip and AddOpt, achieve 51.8% higher convolution performance than the baseline Winograd convolution. Moreover, even without any hardware modification, AddOpt alone gives 35.6% higher performance on a real hardware platform, Titan X.

References

[1]

A. Krizhevsky, et al. ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, December 2012.

Digital Library

[2]

G. Montúfar, et al. On the number of linear regions of deep neural networks. In Proceedings of the Advances in Neural Information Processing Systems, December 2014.

Digital Library

[3]

C. Szegedy, et al. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2015.

[4]

K. He, et al. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.

[5]

K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2015.

[6]

Y.-D. Kim, et al. Compression of deep convolutional neural networks for fast and low power applications. In Proceedings of International Conference on Learning and Representation, May 2016.

[7]

M. Mathieu, et al. Fast training of convolutional networks through FFTs. arXiv preprint arXiv:1312.5851, 2013.

[8]

N. Vasilache, et al. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv preprint arXiv:1412.7580, 2014.

[9]

A. Lavin and S. Gray. Fast algorithms for convolutional neural networks. arXiv preprint arXiv:1509.09308, 2015.

[10]

S. Han, et al. Learning both weights and connections for efficient neural network. In Proceedings of the Advances in Neural Information Processing Systems, December 2015.

Digital Library

[11]

Y.-H. Chen, et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks. In IEEE International Solid-State Circuits Conference Technical Digest of Papers, January 2016.

[12]

A. Bakhoda, et al. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, April 2009.

[13]

cuBLAS. http://docs.nvidia.com/cuda/cublas/. Accessed: 2016-04-08.

[14]

S. Chetlur, et al. cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.

[15]

M. Jaderberg, et al. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference, September 2014.

[16]

E. Denton, et al. Exploiting linear structure within convolutional networks for efficient evaluation. In Proceedings of the Advances in Neural Information Processing Systems, December 2014.

Digital Library

[17]

V. Lebedev, et al. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014.

[18]

X. Zhang, et al. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, preprint.

[19]

X. Zhang, et al. Efficient and accurate approximations of nonlinear convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2015.

[20]

T. Chen, et al. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine learning. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, March 2014.

Digital Library

[21]

Q. Zhang, et al. ApproxANN: An approximate computing framework for artificial neural network. In Proceedings of Design Automation and Test in Europe, March 2015.

Digital Library

[22]

D. Miyashita, et al. Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025, 2016.

[23]

M. Rastegari, et al. XNOR-Net: ImageNet classification using binary convolutional neural networks. arXiv preprint arXiv:1603.05279, 2016.

[24]

S. Venkataramani, et al. AxNN: Energy-efficient neuromorphic systems using approximate computing. In Proceedings of International Symposium on Low Power Electronics and Design, August 2014.

Digital Library

[25]

S. Winograd. Arithmetic complexity of computations, Volume 33. Siam, 1980.

[26]

S. Han, et al. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149, 2015.

[27]

http://www.nvidia.com/object/tesla-p100.html/. Accessed: 2016-04-08.

[28]

Y. Jia, et al. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, November 2014.

Digital Library

[29]

J. Leng, et al. GPUWattch: enabling energy optimizations in GPGPUs. ACM SIGARCH Computer Architecture News 41(3), 487--498, 2013. https://community.arm.com/groups/arm-mali-graphics/blog/2014/04/23/arm-mali-compute-architecture-fundamentals. Accessed: 2016-06-08.

Digital Library

Cited By

Xu HZhang YCheng ZLi XDoerfert JGrosser TLeather HSadayappan P(2025)An Efficient Polynomial Multiplication Derived Implementation of Convolution in Neural NetworksProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3696443.3708947(431-443)Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1145/3696443.3708947
Zhang YLi FXu HLi XJiang S(2024)Efficient Convolutional Neural Networks Utilizing Fine-Grained Fast Fourier TransformsElectronics10.3390/electronics1318376513:18(3765)Online publication date: 22-Sep-2024
https://doi.org/10.3390/electronics13183765
Wang PYang WFang JDong DHuang CZhang PTang TWang ZMohror KArnold DBadia R(2023)Optimizing Direct Convolutions on ARM Multi-CoresProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607107(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607107
Show More Cited By

Index Terms

Zero and data reuse-aware fast convolution for deep neural networks on GPU
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Neural networks

Recommendations

Performance and Scalability of GPU-Based Convolutional Neural Networks
PDP '10: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing

In this paper we present the implementation of a framework for accelerating training and classification of arbitrary Convolutional Neural Networks (CNNs) on the GPU. CNNs are a derivative of standard Multilayer Perceptron (MLP) neural networks optimized ...
Studying inter-core data reuse in multicores
SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems

Most of existing research on emerging multicore machines focus on parallelism extraction and architectural level optimizations. While these optimizations are critical, complementary approaches such as data locality enhancement can also bring significant ...
Studying inter-core data reuse in multicores
Performance evaluation review

Most of existing research on emerging multicore machines focus on parallelism extraction and architectural level optimizations. While these optimizations are critical, complementary approaches such as data locality enhancement can also bring significant ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

CODES '16: Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis

October 2016

294 pages

ISBN:9781450344838

DOI:10.1145/2968456

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ESWEEK'16

ESWEEK'16: TWELFTH EMBEDDED SYSTEM WEEK

October 1 - 7, 2016

Pennsylvania, Pittsburgh

Acceptance Rates

Overall Acceptance Rate 280 of 864 submissions, 32%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
724
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu HZhang YCheng ZLi XDoerfert JGrosser TLeather HSadayappan P(2025)An Efficient Polynomial Multiplication Derived Implementation of Convolution in Neural NetworksProceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3696443.3708947(431-443)Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1145/3696443.3708947
Zhang YLi FXu HLi XJiang S(2024)Efficient Convolutional Neural Networks Utilizing Fine-Grained Fast Fourier TransformsElectronics10.3390/electronics1318376513:18(3765)Online publication date: 22-Sep-2024
https://doi.org/10.3390/electronics13183765
Wang PYang WFang JDong DHuang CZhang PTang TWang ZMohror KArnold DBadia R(2023)Optimizing Direct Convolutions on ARM Multi-CoresProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607107(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607107
Lin JLin S(2023)Temperature-Prediction Based Rate-Adjusted Time and Space Mapping Algorithm for 3D CNN Accelerator SystemsIEEE Transactions on Computers10.1109/TC.2023.326969672:10(2767-2780)Online publication date: Oct-2023
https://doi.org/10.1109/TC.2023.3269696
Dash BAnsari MSwayamsiddha S(2023)Fusion of Artificial Intelligence and 5G in Defining Future UAV Technologies - A Review2023 International Conference on Device Intelligence, Computing and Communication Technologies, (DICCT)10.1109/DICCT56244.2023.10110231(312-316)Online publication date: 17-Mar-2023
https://doi.org/10.1109/DICCT56244.2023.10110231
De Albuquerque Silva ICarle TGauffriau APagetti C(2023)Extending a predictable machine learning framework with efficient gemm-based convolution routinesReal-Time Systems10.1007/s11241-023-09407-z59:3(408-437)Online publication date: 28-Aug-2023
https://doi.org/10.1007/s11241-023-09407-z
Al-Khafaji MElwiya L(2022)ML/AI Empowered 5G and beyond Networks2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA)10.1109/HORA55278.2022.9799813(1-6)Online publication date: 9-Jun-2022
https://doi.org/10.1109/HORA55278.2022.9799813
Naeem MDe Pietro GCoronato A(2021)Application of Reinforcement Learning and Deep Learning in Multiple-Input and Multiple-Output (MIMO) SystemsSensors10.3390/s2201030922:1(309)Online publication date: 31-Dec-2021
https://doi.org/10.3390/s22010309
Ang LSeng K(2021)GPU-Based Embedded Intelligence Architectures and ApplicationsElectronics10.3390/electronics1008095210:8(952)Online publication date: 16-Apr-2021
https://doi.org/10.3390/electronics10080952
Cariow ACariowa G(2021)Fast Algorithms for Quaternion-Valued Convolutional Neural NetworksIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2020.297968232:1(457-462)Online publication date: Jan-2021
https://doi.org/10.1109/TNNLS.2020.2979682
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten