Abstract
In this paper, we propose a method of learning representation layers with squashing activation functions within a deep artificial neural network which directly addresses the vanishing gradients problem. The proposed solution is derived from solving the maximum likelihood estimator for components of the posterior representation, which are approximately Beta-distributed, formulated in the context of variational inference. This approach not only improves the performance of deep neural networks with squashing activation functions on some of the hidden layers - including in discriminative learning - but can be employed towards producing sparse codes.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
References
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems 25. Curran Associates, Inc, pp 1097–1105
LeCunn Y, Bottou L, Orr GB, Muller K-R (1998) Efficient backprop. In: Neural Networks: tricks of the trade. Springer, New York
Hahnloser RHR, Sarpeshkar R, Mahowald MA, Douglas RJ, Sebastian SH (2000) Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature p 405
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks
Clevert D-A, Unterthiner T, Hochreiter S (2015) Fast and accurate deep network learning by exponential linear units (ELUs)
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space
Bojanowski P, Joulin A (2017) Unsupervised learning by predicting noise
Rifai S, Vincent P, Muller X, et al. (2011) Contractive auto-encoders: Explicit invariance during feature extraction. Proceedings of the
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408
Jiang N, Rong W, Peng B, Nie Y, Xiong Z (2015) An empirical analysis of different sparse penalties for autoencoder in unsupervised feature learning. In: 2015 international joint conference on neural networks (IJCNN), pp 1–8
Le QV, Karpenko A, Ngiam J, Ng AY (2011) ICA with reconstruction cost for efficient overcomplete feature learning. In: Shawe-Taylor J, Zemel RS, Bartlett PL, Pereira F, Weinberger KQ (eds) Advances in neural information processing systems 24. Curran Associates, Inc, pp 1017–1025
Dinh L, Krueger D, Bengio Y (2014) Nice: Non-linear independent components estimation
Hyvarinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley, New Jersey
Miller EG, Fisher JW III (2003) Independent components analysis by direct entropy minimization. Technical report, DTIC Document
Sejnowski TJ, Bell AJ (1995) An information-maximisation approach to blind separation and blind deconvolution
Kingma DP, Welling M (2013) Auto-Encoding variational bayes
Larsen ABL, Sønderby SK, Larochelle H, Winther O (2015) Autoencoding beyond pixels using a learned similarity metric
Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2016) Beta-VAE: learning basic visual concepts with a constrained variational framework
Muche R (1999) Applied survival analysis: Regression modeling of time to event data. dw hosmer, jr., s lemeshow. Wiley, New York, p 386. ISBN: 0-471-15410-5
Sussillo D, Abbott LF (2014) Random walk initialization for training very deep feedforward networks
Yoshida Y, Miyato T (2017) Spectral norm regularization for improving the generalizability of deep learning
Paisley J, Blei D, Jordan M (2012) Variational bayesian inference with stochastic search
Beckman RJ, Tiet jen GL (1978) Maximum likelihood estimation for the beta distribution. J Stat Comput Simul 7(3-4):253–258
Krizhevsky A (2012) Learning multiple layers of features from tiny images
Kingma D, Ba J (2014) Adam: A method for stochastic optimization
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks
Bienaymé I-J (1853) Considérations à l’appui de la découverte de laplace sur la loi de probabilité dans la méthode des moindres carrés, vol 37. Comptes rendus de l’Académie des sciences, Paris, pp 309–317
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition
Liao Q, Poggio T (2016) Bridging the gaps between residual learning recurrent neural networks and visual cortex
Graves A, Fernandez S, Gomez F, Schmidhuber J (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition
Acknowledgements
This work was supported by the National Authority for Scientific Research and Innovation, and by the Ministry of European Funds, through the Competitiveness Operational Programme 2014-2020, POC-A.1-A.1.1.4-E-2015 [Grant number: 40/02.09.2016, ID: P_37_778, to RT]. We also gratefully acknowledge the support of the NVIDIA Corporation for the donation of a Titan Xp GPU, and the support of the Microsoft Corporation for a 1-year Azure Research Sponsorship. We are thankful to Dmitri Toren for helping in the making of Figure 1.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Constantinescu, V., Chiru, C., Boloni, T. et al. Learning flat representations with artificial neural networks. Appl Intell 51, 2456–2470 (2021). https://doi.org/10.1007/s10489-020-02032-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-02032-4