More Web Proxy on the site http://driver.im/

research-article

Performance-portable autotuning of OpenCL kernels for convolutional layers of deep neural networks

Authors:

Yaohung M. Tsai,

Piotr Luszczek,

Jack DongarraAuthors Info & Claims

MLHPC '16: Proceedings of the Workshop on Machine Learning in High Performance Computing Environments

Pages 9 - 18

Published: 13 November 2016 Publication History

Abstract

We present a portable and highly-optimized Deep Neural Network (DNN) algorithm and its implementation techniques. Our approach is a novel combination of existing HPC techniques that methodically applies autotuning as well as data layout and low-level optimizations that achieve performance matching and/or exceeding what is possible with either reverse engineering and manual assembly coding or proprietary vendor libraries. The former was done inside the maxDNN implementation and the latter is represented by cuDNN. Our work may be directly applied to the most time consuming part of DNN workflow, namely the training process which often needs a restart when it stagnates due to, for example, diminishing gradients and getting stuck in local minima. With the result of performance tests on a consumer-grade GPU with the latest High Bandwidth Memory (HBM) stack, our methodology can match a server grade hardware at a fraction of the price. Another tuning sweep on a new GPU architecture from a different vendor also attests to the portability of our approach and the quality of our implementation.

References

[1]

J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, "Large scale distributed deep networks," in NIPS 2012: Neural Information Processing Systems, December 2012.

Digital Library

[2]

Y. Bengio, A. C. Courville, and P. Vincent, "Unsupervised feature learning and deep learning: A review and new perspectives," CoRR, vol. abs/1206.5538, 2012.

[3]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in Neural Information Processing Systems, 2012.

Digital Library

[4]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, "Imagenet large scale visual recognition challenge," 2014.

[5]

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," arXiv preprint arXiv:1408.5093, 2014.

[6]

S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, "cuDNN: Efficient primitives for deep learning," CoRR, vol. abs/1410.0759, 2014. {Online}. Available: http://arxiv.org/abs/1410.0759

[7]

A. Lavin, "maxDNN: An efficient convolution kernel for deep learning with Maxwell GPUs," CoRR, vol. abs/1501.06633, 2015. {Online}. Available: http://arxiv.org/abs/1501.06633

[8]

D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, and M. Young, "Machine learning: The high interest credit card of technical debt," in SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop), 2014.

[9]

A. Krizhevsky, "One weird trick for parallelizing convolutional neural networks," CoRR, vol. abs/1404.5997, 2014. {Online}. Available: http://arxiv.org/abs/1404.5997

[10]

L. Bottou, "Stochastic gradient tricks," in Neural Networks, Tricks of the Trade, Reloaded, ser. Lecture Notes in Computer Science (LNCS 7700), G. Montavon, G. B. Orr, and K.-R. Müller, Eds. Springer, 2012, pp. 430--445. {Online}. Available: http://leon.bottou.org/papers/bottou-tricks-2012

[11]

R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun, "Deep image: Scaling up image recognition," CoRR, vol. abs/1501.02876, 2015. {Online}. Available: http://arxiv.org/abs/1501.02876

[12]

J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, "Large scale distributed deep networks," in NIPS 2012: Neural Information Processing Systems, December 2012.

Digital Library

[13]

A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro, "Deep learning with COTS HPC systems," in Proceedings of the 30th International Conference on Ma- chine Learning, ser. JMLR: W&CP, vol. 28, 2013.

[14]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278--2324, 1998.

[15]

L. Bottou and Y. LeCun, "On-line learning for very large data sets," Appl. Stoch. Model. Bus. Ind., vol. 21, no. 2, 2005.

Digital Library

[16]

D. Hubel and T. Wiesel, "Receptive fields and functional architecture of monkey striate cortex," Journal of Physiology, vol. 195, pp. 215--243, 1968.

[17]

F. Rosenblatt, "The perceptron: A perceiving and recognizing automaton," Cornell Aeronautical Lab, Project PARA 85-460-1, 1957.

[18]

P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra, "From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming," Parallel Comput., vol. 38, no. 8, pp. 391--407, Aug. 2012. {Online}. Available:

Digital Library

[19]

M. J. Harvey and G. D. Fabritiis, "Swan: A tool for porting CUDA programs to OpenCL," Computer Physics Communications, vol. 182, no. 4, pp. 1093--1099, 2011. {Online}. Available:

[20]

"CUBLAS," 2016, available at http://docs.nvidia.com/cuda/cublas/.

[21]

"Assembler for NVIDIA Maxwell architecture," Online https://github.com/NervanaSystems/maxas, 2015. {Online}. Available: https://github.com/NervanaSystems/maxas

[22]

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, "Overfeat: Integrated recognition, localization and detection using convolutional networks," CoRR, vol. abs/1312.6229, 2013. {Online}. Available: http://arxiv.org/abs/1312.6229

[23]

"Fast, scalable, easy-to-use Python based Deep Learning framework by Nervana," Online http://neon.nervanasys.com/, 2015. {Online}. Available: https://github.com/NervanaSystems/neon

[24]

V. Vanhoucke, A. Senior, and M. Z. Mao, "Improving the speed of neural networks on CPUs," in Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.

[25]

M. Courbariaux, Y. Bengio, and J. David, "Low precision arithmetic for deep learning," CoRR, vol. abs/1412.7024, 2014. {Online}. Available: http://arxiv.org/abs/1412.7024

[26]

S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, "Deep learning with limited numerical precision," CoRR, vol. abs/1502.02551, 2015. {Online}. Available: http://arxiv.org/abs/1502.02551

[27]

M. Mathieu, M. Henaff, and Y. LeCun, "Fast training of convolutional networks through ffts," CoRR, vol. abs/1312.5851, 2013. {Online}. Available: http://arxiv.org/abs/1312.5851

[28]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," CoRR, vol. abs/1409.4842, 2014. {Online}. Available: http://arxiv.org/abs/1409.4842

[29]

J. Cong and B. Xiao, "Minimizing computation in convolutional neural networks," in Artificial Neural Networks and Machine Learning, ser. ICANN 2014. Springer, 2014, pp. 281--290.

[30]

A. Lavin, "Fast algorithms for convolutional neural networks," CoRR, vol. abs/1509.09308, 2015. {Online}. Available: http://arxiv.org/abs/1509.09308

[31]

R. Collobert, K. Kavukcuoglu, and C. Farabet, "Torch7: A MATLAB-like environment for machine learning," in BigLearn, NIPS Workshop, 2011.

[32]

F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio, "Theano: new features and speed improvements," Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.

[33]

"Intel math kernel library," https://software.intel.com/en-us/en-us/intel-mkl/.

[34]

"clBLAS," 2016, available at https://github.com/clMathLibraries/clBLAS.

[35]

P. Luszczek, M. Gates, J. Kurzak, A. Danalis, and J. Dongarra, "Search space generation and pruning system for autotuners," in Submitted to IPDPSW, ser. The Eleventh International Workshop on Automatic Performance Tuning (iWAPT) 2016. Chicago, IL, USA: IEEE, May 23rd 2016.

[36]

J. Kurzak, S. Tomov, and J. Dongarra, "Autotuning GEMM kernels for the Fermi GPU," Parallel and Distributed Systems, IEEE Transactions on, vol. 23, no. 11, pp. 2045--2057, 2012.

Digital Library

[37]

K. Matsumoto, N. Nakasato, and S. G. Sedukhin, "Performance tuning of matrix multiplication in opencl on different GPUs and CPUs," in High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:. IEEE, 2012, pp. 396--405.

Digital Library

[38]

"convnet-benchmarks," 2016, available at https://github.com/soumith/convnet-benchmarks.

Cited By

Li RXu YSukumaran-Rajam ARountev ASadayappan PSherwood TBerger EKozyrakis C(2021)Analytical characterization and design space exploration for optimization of CNNsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446759(928-942)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446759

Performance-portable autotuning of OpenCL kernels for convolutional layers of deep neural networks
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches

Recommendations

Evaluation of a performance portable lattice Boltzmann code using OpenCL
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014

With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important ...
Generating OpenCL C kernels from OpenACC
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014

Hardware accelerators are now a common way to improve the performances of compute nodes. This performance improvement has a cost: applications need to be rewritten to take advantage of the new hardware. OpenACC is a set of compiler directives to target ...
Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms
IWOCL '19: Proceedings of the International Workshop on OpenCL

A key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MLHPC '16: Proceedings of the Workshop on Machine Learning in High Performance Computing Environments

November 2016

66 pages

ISBN:9781509038824

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
IEEE-CS\DATC: IEEE Computer Society

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Qualifiers

Research-article

Conference

SC16

Sponsor:

SIGHPC
IEEE-CS\DATC

SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 13 - 18, 2016

Utah, Salt Lake City

Acceptance Rates

Overall Acceptance Rate 5 of 7 submissions, 71%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
156
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li RXu YSukumaran-Rajam ARountev ASadayappan PSherwood TBerger EKozyrakis C(2021)Analytical characterization and design space exploration for optimization of CNNsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446759(928-942)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446759

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents