[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3018874.3018876acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Performance-portable autotuning of OpenCL kernels for convolutional layers of deep neural networks

Published: 13 November 2016 Publication History

Abstract

We present a portable and highly-optimized Deep Neural Network (DNN) algorithm and its implementation techniques. Our approach is a novel combination of existing HPC techniques that methodically applies autotuning as well as data layout and low-level optimizations that achieve performance matching and/or exceeding what is possible with either reverse engineering and manual assembly coding or proprietary vendor libraries. The former was done inside the maxDNN implementation and the latter is represented by cuDNN. Our work may be directly applied to the most time consuming part of DNN workflow, namely the training process which often needs a restart when it stagnates due to, for example, diminishing gradients and getting stuck in local minima. With the result of performance tests on a consumer-grade GPU with the latest High Bandwidth Memory (HBM) stack, our methodology can match a server grade hardware at a fraction of the price. Another tuning sweep on a new GPU architecture from a different vendor also attests to the portability of our approach and the quality of our implementation.

References

[1]
J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, "Large scale distributed deep networks," in NIPS 2012: Neural Information Processing Systems, December 2012.
[2]
Y. Bengio, A. C. Courville, and P. Vincent, "Unsupervised feature learning and deep learning: A review and new perspectives," CoRR, vol. abs/1206.5538, 2012.
[3]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in Neural Information Processing Systems, 2012.
[4]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, "Imagenet large scale visual recognition challenge," 2014.
[5]
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, "Caffe: Convolutional architecture for fast feature embedding," arXiv preprint arXiv:1408.5093, 2014.
[6]
S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, "cuDNN: Efficient primitives for deep learning," CoRR, vol. abs/1410.0759, 2014. {Online}. Available: http://arxiv.org/abs/1410.0759
[7]
A. Lavin, "maxDNN: An efficient convolution kernel for deep learning with Maxwell GPUs," CoRR, vol. abs/1501.06633, 2015. {Online}. Available: http://arxiv.org/abs/1501.06633
[8]
D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, and M. Young, "Machine learning: The high interest credit card of technical debt," in SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop), 2014.
[9]
A. Krizhevsky, "One weird trick for parallelizing convolutional neural networks," CoRR, vol. abs/1404.5997, 2014. {Online}. Available: http://arxiv.org/abs/1404.5997
[10]
L. Bottou, "Stochastic gradient tricks," in Neural Networks, Tricks of the Trade, Reloaded, ser. Lecture Notes in Computer Science (LNCS 7700), G. Montavon, G. B. Orr, and K.-R. Müller, Eds. Springer, 2012, pp. 430--445. {Online}. Available: http://leon.bottou.org/papers/bottou-tricks-2012
[11]
R. Wu, S. Yan, Y. Shan, Q. Dang, and G. Sun, "Deep image: Scaling up image recognition," CoRR, vol. abs/1501.02876, 2015. {Online}. Available: http://arxiv.org/abs/1501.02876
[12]
J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng, "Large scale distributed deep networks," in NIPS 2012: Neural Information Processing Systems, December 2012.
[13]
A. Coates, B. Huval, T. Wang, D. J. Wu, A. Y. Ng, and B. Catanzaro, "Deep learning with COTS HPC systems," in Proceedings of the 30th International Conference on Ma- chine Learning, ser. JMLR: W&CP, vol. 28, 2013.
[14]
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, no. 11, pp. 2278--2324, 1998.
[15]
L. Bottou and Y. LeCun, "On-line learning for very large data sets," Appl. Stoch. Model. Bus. Ind., vol. 21, no. 2, 2005.
[16]
D. Hubel and T. Wiesel, "Receptive fields and functional architecture of monkey striate cortex," Journal of Physiology, vol. 195, pp. 215--243, 1968.
[17]
F. Rosenblatt, "The perceptron: A perceiving and recognizing automaton," Cornell Aeronautical Lab, Project PARA 85-460-1, 1957.
[18]
P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra, "From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming," Parallel Comput., vol. 38, no. 8, pp. 391--407, Aug. 2012. {Online}. Available:
[19]
M. J. Harvey and G. D. Fabritiis, "Swan: A tool for porting CUDA programs to OpenCL," Computer Physics Communications, vol. 182, no. 4, pp. 1093--1099, 2011. {Online}. Available:
[20]
"CUBLAS," 2016, available at http://docs.nvidia.com/cuda/cublas/.
[21]
"Assembler for NVIDIA Maxwell architecture," Online https://github.com/NervanaSystems/maxas, 2015. {Online}. Available: https://github.com/NervanaSystems/maxas
[22]
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, "Overfeat: Integrated recognition, localization and detection using convolutional networks," CoRR, vol. abs/1312.6229, 2013. {Online}. Available: http://arxiv.org/abs/1312.6229
[23]
"Fast, scalable, easy-to-use Python based Deep Learning framework by Nervana," Online http://neon.nervanasys.com/, 2015. {Online}. Available: https://github.com/NervanaSystems/neon
[24]
V. Vanhoucke, A. Senior, and M. Z. Mao, "Improving the speed of neural networks on CPUs," in Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011, 2011.
[25]
M. Courbariaux, Y. Bengio, and J. David, "Low precision arithmetic for deep learning," CoRR, vol. abs/1412.7024, 2014. {Online}. Available: http://arxiv.org/abs/1412.7024
[26]
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, "Deep learning with limited numerical precision," CoRR, vol. abs/1502.02551, 2015. {Online}. Available: http://arxiv.org/abs/1502.02551
[27]
M. Mathieu, M. Henaff, and Y. LeCun, "Fast training of convolutional networks through ffts," CoRR, vol. abs/1312.5851, 2013. {Online}. Available: http://arxiv.org/abs/1312.5851
[28]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, "Going deeper with convolutions," CoRR, vol. abs/1409.4842, 2014. {Online}. Available: http://arxiv.org/abs/1409.4842
[29]
J. Cong and B. Xiao, "Minimizing computation in convolutional neural networks," in Artificial Neural Networks and Machine Learning, ser. ICANN 2014. Springer, 2014, pp. 281--290.
[30]
A. Lavin, "Fast algorithms for convolutional neural networks," CoRR, vol. abs/1509.09308, 2015. {Online}. Available: http://arxiv.org/abs/1509.09308
[31]
R. Collobert, K. Kavukcuoglu, and C. Farabet, "Torch7: A MATLAB-like environment for machine learning," in BigLearn, NIPS Workshop, 2011.
[32]
F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio, "Theano: new features and speed improvements," Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
[33]
"Intel math kernel library," https://software.intel.com/en-us/en-us/intel-mkl/.
[34]
"clBLAS," 2016, available at https://github.com/clMathLibraries/clBLAS.
[35]
P. Luszczek, M. Gates, J. Kurzak, A. Danalis, and J. Dongarra, "Search space generation and pruning system for autotuners," in Submitted to IPDPSW, ser. The Eleventh International Workshop on Automatic Performance Tuning (iWAPT) 2016. Chicago, IL, USA: IEEE, May 23rd 2016.
[36]
J. Kurzak, S. Tomov, and J. Dongarra, "Autotuning GEMM kernels for the Fermi GPU," Parallel and Distributed Systems, IEEE Transactions on, vol. 23, no. 11, pp. 2045--2057, 2012.
[37]
K. Matsumoto, N. Nakasato, and S. G. Sedukhin, "Performance tuning of matrix multiplication in opencl on different GPUs and CPUs," in High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:. IEEE, 2012, pp. 396--405.
[38]
"convnet-benchmarks," 2016, available at https://github.com/soumith/convnet-benchmarks.

Cited By

View all
  • (2021)Analytical characterization and design space exploration for optimization of CNNsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446759(928-942)Online publication date: 19-Apr-2021
  1. Performance-portable autotuning of OpenCL kernels for convolutional layers of deep neural networks

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MLHPC '16: Proceedings of the Workshop on Machine Learning in High Performance Computing Environments
    November 2016
    66 pages
    ISBN:9781509038824

    Sponsors

    In-Cooperation

    Publisher

    IEEE Press

    Publication History

    Published: 13 November 2016

    Check for updates

    Qualifiers

    • Research-article

    Conference

    SC16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 5 of 7 submissions, 71%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 10 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Analytical characterization and design space exploration for optimization of CNNsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446759(928-942)Online publication date: 19-Apr-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media