Is My Neural Net Driven by the MDL Principle?

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14170))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1360 Accesses

Abstract

The Minimum Description Length principle (MDL) is a formalization of Occam’s razor for model selection, which states that a good model is one that can losslessly compress the data while including the cost of describing the model itself. While MDL can naturally express the behavior of certain models such as autoencoders (that inherently compress data) most representation learning techniques do not rely on such models. Instead, they learn representations by training on general or, for self-supervised learning, pretext tasks. In this paper, we propose a new formulation of the MDL principle that relies on the concept of signal and noise, which are implicitly defined by the learning task at hand. Additionally, we introduce ways to empirically measure the complexity of the learned representations by analyzing the spectra of the point Jacobians. Under certain assumptions, we show that the singular values of the point Jacobians of Neural Networks driven by the MDL principle should follow either a power law or a lognormal distribution. Finally, we conduct experiments to evaluate the behavior of the proposed measure applied to deep neural networks on different datasets, with respect to several types of noise. We observe that the experimental spectral distribution is in agreement with the spectral distribution predicted by our MDL principle, which suggests that neural networks trained with gradient descent on noisy data implicitly abide the MDL principle.

This work has been funded by a public grant from the French National Research Agency (ANR) under the “France 2030” investment plan, which has the reference EUR MANUTECH SLEIGHT - ANR-17-EURE-0026. This work has also been partly funded and by a PhD grant from the French Ministry of Higher Education and Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 67.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 84.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A systematic review on overfitting control in shallow and deep neural networks

Article 03 March 2021

Sparse Besov Space Analysis of Representations in Machine Learning

Computational Learning Theory

Notes

1.
This formulation is known as two-part MDL, which depending on the author can be seen as “traditional” (in opposition to “modern” MDL which uses a one-step encoding using universal encodings [15]) or “pure [46]”.
2.
Repository: https://github.com/brandao-eduardo/ismynndrivenbymdl.
3.
The Shannon-Fano code is competitive, meaning that the probability that the expected length exceeds another code’s by c bits does not exceed \(2^{1-c}\) [8].
4.
For classification, we work on an intermediate representation, which explains the use of integrals in calculating the expectation.
5.
A similar argument can be found in [1] in the discussion of noise sensitivity.
6.
Since \(p(k\sigma ) = a \left( k\sigma \right) ^{\alpha }=a k^{\alpha }\sigma ^{\alpha }\), normalization implies \(p(k\sigma )=p(\sigma )\).
7.
The number of training epochs being relatively small, we did not find a power-law behavior.

References

Arora, S., Ge, R., Neyshabur, B., Zhang, Y.: Stronger generalization bounds for deep nets via a compression approach. CoRR (2018). http://arxiv.org/abs/1802.05296v4
Barron, A., Rissanen, J., Yu, B.: The minimum description length principle in coding and modeling. IEEE Trans. Inf. Theory 44(6), 2743–2760 (1998)
Article MathSciNet MATH Google Scholar
Barron, A.R.: Complexity regularization with application to artificial neural networks. In: Nonparametric Functional Estimation and Related Topics, pp. 561–576 (1991)
Google Scholar
Blier, L., Ollivier, Y.: The description length of deep learning models. Adv. Neural Inf. Process. Syst. 31, 1–11 (2018)
Google Scholar
Blum, A., Langford, J.: PAC-MDL bounds. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT-Kernel 2003. LNCS (LNAI), vol. 2777, pp. 344–357. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45167-9_26
Chapter Google Scholar
Brandao, E., Duffner, S., Emonet, R., Habrard, A., Jacquenet, F., Sebban, M.: Appendixes to: is my neural net driven by the mdl principle? (2023). https://github.com/brandao-eduardo/ismynndrivenbymdl/blob/main/Appendixes.pdf
Cayton, L.: Algorithms for manifold learning. Univ. of California at San Diego Technical Report 12(1–17), 1 (2005)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley & Sons, Hoboekn (2012)
MATH Google Scholar
Deng, L.: The mnist database of handwritten digit images for machine learning research. IEEE Signal Process. Maga. 29(6), 141–142 (2012)
Article Google Scholar
Fefferman, C., Mitter, S., Narayanan, H.: Testing the manifold hypothesis. J. Am. Math. Soc. 29(4), 983–1049 (2016)
Article MathSciNet MATH Google Scholar
Galbrun, E.: The minimum description length principle for pattern mining: a survey. Data Mining Knowl. Disc. 36(5), 1679–1727 (2022)
Article MathSciNet MATH Google Scholar
Gao, T., Jojic, V.: Degrees of freedom in deep neural networks. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI 2016, pp. 232–241. AUAI Press, Arlington (2016)
Google Scholar
Gheorghiu, S., Coppens, M.O.: Heterogeneity explains features of “anomalous" thermodynamics and statistics. Proc. Natl. Acad. Sci. 101(45), 15852–15856 (2004)
Article Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016). http://www.deeplearningbook.org
Grünwald, P.: Minimum description length tutorial. Adv. Minimum Descript. Length: Theory Appl. 5, 1–80 (2005)
Google Scholar
Grünwald, P., Langford, J.: Suboptimal behavior of bayes and mdl in classification under misspecification. Mach. Learn. 66, 119–149 (2007)
Article MATH Google Scholar
Grünwald, P., Roos, T.: Minimum description length revisited. Int. J. Math. Ind. 11(01), 1930001 (2019). https://doi.org/10.1142/S2661335219300018
Hansen, M.H., Yu, B.: Minimum description length model selection criteria for generalized linear models. Lect. Notes-Monograph Ser., 145–163 (2003)
Google Scholar
Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: stability of stochastic gradient descent. In: International Conference on Machine Learning, pp. 1225–1234. PMLR (2016)
Google Scholar
He, H., Su, W.J.: The local elasticity of neural networks. arXiv preprint arXiv:1910.06943 (2019)
Helmbold, D.P., Long, P.M.: On the inductive bias of dropout. J. Mach. Learn. Res. 16(1), 3403–3454 (2015)
MathSciNet MATH Google Scholar
Hinton, G.E., Van Camp, D.: Keeping the neural networks simple by minimizing the description length of the weights. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory, pp. 5–13 (1993)
Google Scholar
Hu, B., Rakthanmanon, T., Hao, Y., Evans, S., Lonardi, S., Keogh, E.: Using the minimum description length to discover the intrinsic cardinality and dimensionality of time series. Data Mining Knowl. Disc. 29, 358–399 (2015)
Article MathSciNet MATH Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Google Scholar
Jaynes, E.: Where do we stand on maximum entropy? The maximum entropy formalism, pp. 15–118 (1978)
Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Google Scholar
Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In: Advances in Neural Information Processing Systems, pp. 950–957 (1992)
Google Scholar
Li, C., Farkhoor, H., Liu, R., Yosinski, J.: Measuring the intrinsic dimension of objective landscapes. In: International Conference on Learning Representations (2018)
Google Scholar
Luo, P., Wang, X., Shao, W., Peng, Z.: Towards understanding regularization in batch normalization. arXiv preprint arXiv:1809.00846 (2018)
MacKay, D.J., Mac Kay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)
Google Scholar
Martin, C.H., Mahoney, M.W.: Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning. arXiv preprint arXiv:1810.01075 (2018)
Martin, C.H., Mahoney, M.W.: Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. In: Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 505–513. SIAM (2020)
Google Scholar
Morcos, A.S., Barrett, D.G.T., Rabinowitz, N.C., Botvinick, M.: On the importance of single directions for generalization. CoRR (2018). http://arxiv.org/abs/1803.06959v4
Myung, J.I., Navarro, D.J., Pitt, M.A.: Model selection by normalized maximum likelihood. J. Math. Psychol. 50(2), 167–179 (2006)
Article MathSciNet MATH Google Scholar
Neyshabur, B., Tomioka, R., Srebro, N.: In search of the real inductive bias: on the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614 (2014)
Pennebaker, W.B., Mitchell, J.L.: JPEG: Still Image Data Compression Standard. Springer, Heidelberg (1992)
Google Scholar
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Article MATH Google Scholar
Rissanen, J.: A universal prior for integers and estimation by minimum description length. Ann. Stat. 11(2), 416–431 (1983)
Article MathSciNet MATH Google Scholar
Rissanen, J.: MDL denoising. IEEE Trans. Inf. Theory 46(7), 2537–2543 (2000)
Article MATH Google Scholar
Rissanen, J.: Strong optimality of the normalized ml models as universal codes and information in data. IEEE Trans. Inf. Theory 47(5), 1712–1717 (2001)
Article MathSciNet MATH Google Scholar
Santurkar, S., Tsipras, D., Ilyas, A., Madry, A.: How does batch normalization help optimization? arXiv preprint arXiv:1805.11604 (2018)
Song, H., Kim, M., Park, D., Shin, Y., Lee, J.G.: Learning from noisy labels with deep neural networks: a survey. IEEE Trans. Neural Netw. Learn. Syst. (2022)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Thompson, R.C.: Principal submatrices ix: interlacing inequalities for singular values of submatrices. Linear Algebra Appl. 5(1), 1–12 (1972)
Article MathSciNet MATH Google Scholar
Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle. In: 2015 IEEE Information Theory Workshop (itw), pp. 1–5. IEEE (2015)
Google Scholar
Vitányi, P.M., Li, M.: Minimum description length induction, bayesianism, and kolmogorov complexity. IEEE Trans. Inf. Theory 46(2), 446–464 (2000)
Article MathSciNet MATH Google Scholar
Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., Liu, Y.: Learning with noisy labels revisited: a study using real-world human annotations. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=TBWA6PLJZQm
Yamanishi, K.: A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Trans. Inf. Theory 44(4), 1424–1439 (1998)
Article MathSciNet MATH Google Scholar
Yoshida, Y., Miyato, T.: Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941 (2017)
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Learning Representations (2017). https://openreview.net/forum?id=Sy8gdB9xx
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64(3), 107–115 (2021)
Article Google Scholar
Zhao, K., Musolesi, M., Hui, P., Rao, W., Tarkoma, S.: Explaining the power-law distribution of human mobility through transportation modality decomposition. Sci. Rep. 5(1), 1–7 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Université Jean Monnet Saint-Etienne, CNRS, Institut d Optique Graduate School, Laboratoire Hubert Curien UMR 5516, 42023, Saint-Etienne, France
Eduardo Brandao, Rémi Emonet, Amaury Habrard, François Jacquenet & Marc Sebban
CNRS, INSA-Lyon, LIRIS, UMR5205, Université de Lyon, 69621, Villeurbanne, France
Stefan Duffner
Institut Universitaire de France (IUF), Paris, France
Amaury Habrard

Authors

Eduardo Brandao
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Duffner
View author publications
You can also search for this author in PubMed Google Scholar
Rémi Emonet
View author publications
You can also search for this author in PubMed Google Scholar
Amaury Habrard
View author publications
You can also search for this author in PubMed Google Scholar
François Jacquenet
View author publications
You can also search for this author in PubMed Google Scholar
Marc Sebban
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eduardo Brandao .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Danai Koutra
University of Vienna, Vienna, Austria
Claudia Plant
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Manuel Gomez Rodriguez
Politecnico di Torino, Turin, Italy
Elena Baralis
CENTAI, Turin, Italy
Francesco Bonchi

Ethics declarations

Ethical statement

This paper presents a contribution that is essentially fundamental, theoretical and methodological. We do not see any immediate ethical or societal issues. Our experimental evaluation considers classic benchmarks of the literature and our analysis focuses on particular mathematical properties of point Jacobians spectra of trained neural networks. Our work follows ethical guidelines in modern machine learning research in general and in representation learning in particular. The application of the methodology presented in this paper should consider ethical implications that can arise from the datasets used of the applications targeted.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brandao, E., Duffner, S., Emonet, R., Habrard, A., Jacquenet, F., Sebban, M. (2023). Is My Neural Net Driven by the MDL Principle?. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14170. Springer, Cham. https://doi.org/10.1007/978-3-031-43415-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-43415-0_11
Published: 17 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43414-3
Online ISBN: 978-3-031-43415-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Is My Neural Net Driven by the MDL Principle?

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A systematic review on overfitting control in shallow and deep neural networks

Sparse Besov Space Analysis of Representations in Machine Learning

Computational Learning Theory

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Ethical statement

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Subscribe and save

Buy Now

Navigation

Is My Neural Net Driven by the MDL Principle?

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A systematic review on overfitting control in shallow and deep neural networks

Sparse Besov Space Analysis of Representations in Machine Learning

Computational Learning Theory

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Ethical statement

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation