Abstract
The Minimum Description Length principle (MDL) is a formalization of Occam’s razor for model selection, which states that a good model is one that can losslessly compress the data while including the cost of describing the model itself. While MDL can naturally express the behavior of certain models such as autoencoders (that inherently compress data) most representation learning techniques do not rely on such models. Instead, they learn representations by training on general or, for self-supervised learning, pretext tasks. In this paper, we propose a new formulation of the MDL principle that relies on the concept of signal and noise, which are implicitly defined by the learning task at hand. Additionally, we introduce ways to empirically measure the complexity of the learned representations by analyzing the spectra of the point Jacobians. Under certain assumptions, we show that the singular values of the point Jacobians of Neural Networks driven by the MDL principle should follow either a power law or a lognormal distribution. Finally, we conduct experiments to evaluate the behavior of the proposed measure applied to deep neural networks on different datasets, with respect to several types of noise. We observe that the experimental spectral distribution is in agreement with the spectral distribution predicted by our MDL principle, which suggests that neural networks trained with gradient descent on noisy data implicitly abide the MDL principle.
This work has been funded by a public grant from the French National Research Agency (ANR) under the “France 2030” investment plan, which has the reference EUR MANUTECH SLEIGHT - ANR-17-EURE-0026. This work has also been partly funded and by a PhD grant from the French Ministry of Higher Education and Research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Repository: https://github.com/brandao-eduardo/ismynndrivenbymdl.
- 3.
The Shannon-Fano code is competitive, meaning that the probability that the expected length exceeds another code’s by c bits does not exceed \(2^{1-c}\) [8].
- 4.
For classification, we work on an intermediate representation, which explains the use of integrals in calculating the expectation.
- 5.
A similar argument can be found in [1] in the discussion of noise sensitivity.
- 6.
Since \(p(k\sigma ) = a \left( k\sigma \right) ^{\alpha }=a k^{\alpha }\sigma ^{\alpha }\), normalization implies \(p(k\sigma )=p(\sigma )\).
- 7.
The number of training epochs being relatively small, we did not find a power-law behavior.
References
Arora, S., Ge, R., Neyshabur, B., Zhang, Y.: Stronger generalization bounds for deep nets via a compression approach. CoRR (2018). http://arxiv.org/abs/1802.05296v4
Barron, A., Rissanen, J., Yu, B.: The minimum description length principle in coding and modeling. IEEE Trans. Inf. Theory 44(6), 2743–2760 (1998)
Barron, A.R.: Complexity regularization with application to artificial neural networks. In: Nonparametric Functional Estimation and Related Topics, pp. 561–576 (1991)
Blier, L., Ollivier, Y.: The description length of deep learning models. Adv. Neural Inf. Process. Syst. 31, 1–11 (2018)
Blum, A., Langford, J.: PAC-MDL bounds. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT-Kernel 2003. LNCS (LNAI), vol. 2777, pp. 344–357. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45167-9_26
Brandao, E., Duffner, S., Emonet, R., Habrard, A., Jacquenet, F., Sebban, M.: Appendixes to: is my neural net driven by the mdl principle? (2023). https://github.com/brandao-eduardo/ismynndrivenbymdl/blob/main/Appendixes.pdf
Cayton, L.: Algorithms for manifold learning. Univ. of California at San Diego Technical Report 12(1–17), 1 (2005)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley & Sons, Hoboekn (2012)
Deng, L.: The mnist database of handwritten digit images for machine learning research. IEEE Signal Process. Maga. 29(6), 141–142 (2012)
Fefferman, C., Mitter, S., Narayanan, H.: Testing the manifold hypothesis. J. Am. Math. Soc. 29(4), 983–1049 (2016)
Galbrun, E.: The minimum description length principle for pattern mining: a survey. Data Mining Knowl. Disc. 36(5), 1679–1727 (2022)
Gao, T., Jojic, V.: Degrees of freedom in deep neural networks. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI 2016, pp. 232–241. AUAI Press, Arlington (2016)
Gheorghiu, S., Coppens, M.O.: Heterogeneity explains features of “anomalous" thermodynamics and statistics. Proc. Natl. Acad. Sci. 101(45), 15852–15856 (2004)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016). http://www.deeplearningbook.org
Grünwald, P.: Minimum description length tutorial. Adv. Minimum Descript. Length: Theory Appl. 5, 1–80 (2005)
Grünwald, P., Langford, J.: Suboptimal behavior of bayes and mdl in classification under misspecification. Mach. Learn. 66, 119–149 (2007)
Grünwald, P., Roos, T.: Minimum description length revisited. Int. J. Math. Ind. 11(01), 1930001 (2019). https://doi.org/10.1142/S2661335219300018
Hansen, M.H., Yu, B.: Minimum description length model selection criteria for generalized linear models. Lect. Notes-Monograph Ser., 145–163 (2003)
Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: stability of stochastic gradient descent. In: International Conference on Machine Learning, pp. 1225–1234. PMLR (2016)
He, H., Su, W.J.: The local elasticity of neural networks. arXiv preprint arXiv:1910.06943 (2019)
Helmbold, D.P., Long, P.M.: On the inductive bias of dropout. J. Mach. Learn. Res. 16(1), 3403–3454 (2015)
Hinton, G.E., Van Camp, D.: Keeping the neural networks simple by minimizing the description length of the weights. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory, pp. 5–13 (1993)
Hu, B., Rakthanmanon, T., Hao, Y., Evans, S., Lonardi, S., Keogh, E.: Using the minimum description length to discover the intrinsic cardinality and dimensionality of time series. Data Mining Knowl. Disc. 29, 358–399 (2015)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Jaynes, E.: Where do we stand on maximum entropy? The maximum entropy formalism, pp. 15–118 (1978)
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In: Advances in Neural Information Processing Systems, pp. 950–957 (1992)
Li, C., Farkhoor, H., Liu, R., Yosinski, J.: Measuring the intrinsic dimension of objective landscapes. In: International Conference on Learning Representations (2018)
Luo, P., Wang, X., Shao, W., Peng, Z.: Towards understanding regularization in batch normalization. arXiv preprint arXiv:1809.00846 (2018)
MacKay, D.J., Mac Kay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)
Martin, C.H., Mahoney, M.W.: Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning. arXiv preprint arXiv:1810.01075 (2018)
Martin, C.H., Mahoney, M.W.: Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. In: Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 505–513. SIAM (2020)
Morcos, A.S., Barrett, D.G.T., Rabinowitz, N.C., Botvinick, M.: On the importance of single directions for generalization. CoRR (2018). http://arxiv.org/abs/1803.06959v4
Myung, J.I., Navarro, D.J., Pitt, M.A.: Model selection by normalized maximum likelihood. J. Math. Psychol. 50(2), 167–179 (2006)
Neyshabur, B., Tomioka, R., Srebro, N.: In search of the real inductive bias: on the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614 (2014)
Pennebaker, W.B., Mitchell, J.L.: JPEG: Still Image Data Compression Standard. Springer, Heidelberg (1992)
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Rissanen, J.: A universal prior for integers and estimation by minimum description length. Ann. Stat. 11(2), 416–431 (1983)
Rissanen, J.: MDL denoising. IEEE Trans. Inf. Theory 46(7), 2537–2543 (2000)
Rissanen, J.: Strong optimality of the normalized ml models as universal codes and information in data. IEEE Trans. Inf. Theory 47(5), 1712–1717 (2001)
Santurkar, S., Tsipras, D., Ilyas, A., Madry, A.: How does batch normalization help optimization? arXiv preprint arXiv:1805.11604 (2018)
Song, H., Kim, M., Park, D., Shin, Y., Lee, J.G.: Learning from noisy labels with deep neural networks: a survey. IEEE Trans. Neural Netw. Learn. Syst. (2022)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Thompson, R.C.: Principal submatrices ix: interlacing inequalities for singular values of submatrices. Linear Algebra Appl. 5(1), 1–12 (1972)
Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle. In: 2015 IEEE Information Theory Workshop (itw), pp. 1–5. IEEE (2015)
Vitányi, P.M., Li, M.: Minimum description length induction, bayesianism, and kolmogorov complexity. IEEE Trans. Inf. Theory 46(2), 446–464 (2000)
Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., Liu, Y.: Learning with noisy labels revisited: a study using real-world human annotations. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=TBWA6PLJZQm
Yamanishi, K.: A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Trans. Inf. Theory 44(4), 1424–1439 (1998)
Yoshida, Y., Miyato, T.: Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941 (2017)
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Learning Representations (2017). https://openreview.net/forum?id=Sy8gdB9xx
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64(3), 107–115 (2021)
Zhao, K., Musolesi, M., Hui, P., Rao, W., Tarkoma, S.: Explaining the power-law distribution of human mobility through transportation modality decomposition. Sci. Rep. 5(1), 1–7 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Ethical statement
This paper presents a contribution that is essentially fundamental, theoretical and methodological. We do not see any immediate ethical or societal issues. Our experimental evaluation considers classic benchmarks of the literature and our analysis focuses on particular mathematical properties of point Jacobians spectra of trained neural networks. Our work follows ethical guidelines in modern machine learning research in general and in representation learning in particular. The application of the methodology presented in this paper should consider ethical implications that can arise from the datasets used of the applications targeted.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Brandao, E., Duffner, S., Emonet, R., Habrard, A., Jacquenet, F., Sebban, M. (2023). Is My Neural Net Driven by the MDL Principle?. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14170. Springer, Cham. https://doi.org/10.1007/978-3-031-43415-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-43415-0_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43414-3
Online ISBN: 978-3-031-43415-0
eBook Packages: Computer ScienceComputer Science (R0)