[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3045118.3045167guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Batch normalization: accelerating deep network training by reducing internal covariate shift

Published: 06 July 2015 Publication History

Abstract

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization, and in some cases eliminates the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.82% top-5 test error, exceeding the accuracy of human raters.

References

[1]
Bengio, Yoshua and Glorot, Xavier. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of AISTATS 2010, volume 9, pp. 249-256, May 2010.
[2]
Dean, Jeffrey, Corrado, Greg S., Monga, Rajat, Chen, Kai, Devin, Matthieu, Le, Quoc V., Mao, Mark Z., Ranzato, Marc'Aurelio, Senior, Andrew, Tucker, Paul, Yang, Ke, and Ng, Andrew Y. Large scale distributed deep networks. In NIPS, 2012.
[3]
Desjardins, Guillaume and Kavukcuoglu, Koray. Natural neural networks. (unpublished).
[4]
Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121-2159, July 2011. ISSN 1532-4435.
[5]
Gülçehre, Ç aglar and Bengio, Yoshua. Knowledge matters: Importance of prior information for optimization. CoRR, abs/1301.4083, 2013.
[6]
He, K., Zhang, X., Ren, S., and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. ArXiv e-prints, February 2015.
[7]
Hyvärinen, A. and Oja, E. Independent component analysis: Algorithms and applications. Neural Netw., 13(4-5): 411-430, May 2000.
[8]
Jiang, Jing. A literature survey on domain adaptation of statistical classifiers, 2008.
[9]
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, November 1998a.
[10]
LeCun, Y., Bottou, L., Orr, G., and Muller, K. Efficient backprop. In Orr, G. and K., Muller (eds.), Neural Networks: Tricks of the trade. Springer, 1998b.
[11]
Lyu, S and Simoncelli, E P. Nonlinear image representation using divisive normalization. In Proc. Computer Vision and Pattern Recognition, pp. 1-8. IEEE Computer Society, Jun 23-28 2008.
[12]
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units improve restricted boltzmann machines. In ICML, pp. 807-814. Omnipress, 2010.
[13]
Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16- 21 June 2013, pp. 1310-1318, 2013.
[14]
Povey, Daniel, Zhang, Xiaohui, and Khudanpur, Sanjeev. Parallel training of deep neural networks with natural gradient and parameter averaging. CoRR, abs/1410.7455, 2014.
[15]
Raiko, Tapani, Valpola, Harri, and LeCun, Yann. Deep learning made easier by linear transformations in perceptrons. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 924-932, 2012.
[16]
Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.
[17]
Saxe, Andrew M., McClelland, James L., and Ganguli, Surya. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120, 2013.
[18]
Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90 (2):227-244, October 2000.
[19]
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929-1958, January 2014.
[20]
Sutskever, Ilya, Martens, James, Dahl, George E., and Hinton, Geoffrey E. On the importance of initialization and momentum in deep learning. In ICML (3), volume 28 of JMLR Proceedings, pp. 1139-1147. JMLR.org, 2013.
[21]
Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
[22]
Wiesler, Simon and Ney, Hermann. A convergence analysis of log-linear training. In Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N., and Weinberger, K.Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 657-665, Granada, Spain, December 2011.
[23]
Wiesler, Simon, Richard, Alexander, Schlüter, Ralf, and Ney, Hermann. Mean-normalized stochastic gradient for large-scale deep learning. In IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 180-184, Florence, Italy, May 2014.
[24]
Wu, Ren, Yan, Shengen, Shan, Yi, Dang, Qingqing, and Sun, Gang. Deep image: Scaling up image recognition, 2015.

Cited By

View all
  • (2024)MetaStore: Analyzing Deep Learning Meta-Data at ScaleProceedings of the VLDB Endowment10.14778/3648160.364818217:6(1446-1459)Online publication date: 1-Feb-2024
  • (2024)Pose-and-shear-based tactile servoingInternational Journal of Robotics Research10.1177/0278364923122581143:7(1024-1055)Online publication date: 21-Jun-2024
  • (2024)ActSonic: Recognizing Everyday Activities from Inaudible Acoustic Wave Around the BodyProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997528:4(1-32)Online publication date: 21-Nov-2024
  • Show More Cited By
  1. Batch normalization: accelerating deep network training by reducing internal covariate shift

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    ICML'15: Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37
    July 2015
    2558 pages

    Publisher

    JMLR.org

    Publication History

    Published: 06 July 2015

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)MetaStore: Analyzing Deep Learning Meta-Data at ScaleProceedings of the VLDB Endowment10.14778/3648160.364818217:6(1446-1459)Online publication date: 1-Feb-2024
    • (2024)Pose-and-shear-based tactile servoingInternational Journal of Robotics Research10.1177/0278364923122581143:7(1024-1055)Online publication date: 21-Jun-2024
    • (2024)ActSonic: Recognizing Everyday Activities from Inaudible Acoustic Wave Around the BodyProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36997528:4(1-32)Online publication date: 21-Nov-2024
    • (2024)Early-Exit Deep Neural Network - A Comprehensive SurveyACM Computing Surveys10.1145/369876757:3(1-37)Online publication date: 22-Nov-2024
    • (2024)Performance Analysis of CNN Inference/Training with Convolution and Non-Convolution Operations on ASIC AcceleratorsACM Transactions on Design Automation of Electronic Systems10.1145/369666530:1(1-34)Online publication date: 26-Sep-2024
    • (2024)A Survey of Mix-based Data Augmentation: Taxonomy, Methods, Applications, and ExplainabilityACM Computing Surveys10.1145/369620657:2(1-38)Online publication date: 17-Sep-2024
    • (2024)AI-driven Java Performance Testing: Balancing Result Quality with Testing TimeProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695017(443-454)Online publication date: 27-Oct-2024
    • (2024)DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and SwappingACM Transactions on Architecture and Code Optimization10.1145/368933821:4(1-25)Online publication date: 20-Aug-2024
    • (2024)NASM: Neural Anisotropic Surface MeshingSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687700(1-12)Online publication date: 3-Dec-2024
    • (2024)Towards Smartphone-based 3D Hand Pose Reconstruction Using Acoustic SignalsACM Transactions on Sensor Networks10.1145/367712220:5(1-32)Online publication date: 26-Aug-2024
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media