[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Is My Neural Net Driven by the MDL Principle?

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases: Research Track (ECML PKDD 2023)

Abstract

The Minimum Description Length principle (MDL) is a formalization of Occam’s razor for model selection, which states that a good model is one that can losslessly compress the data while including the cost of describing the model itself. While MDL can naturally express the behavior of certain models such as autoencoders (that inherently compress data) most representation learning techniques do not rely on such models. Instead, they learn representations by training on general or, for self-supervised learning, pretext tasks. In this paper, we propose a new formulation of the MDL principle that relies on the concept of signal and noise, which are implicitly defined by the learning task at hand. Additionally, we introduce ways to empirically measure the complexity of the learned representations by analyzing the spectra of the point Jacobians. Under certain assumptions, we show that the singular values of the point Jacobians of Neural Networks driven by the MDL principle should follow either a power law or a lognormal distribution. Finally, we conduct experiments to evaluate the behavior of the proposed measure applied to deep neural networks on different datasets, with respect to several types of noise. We observe that the experimental spectral distribution is in agreement with the spectral distribution predicted by our MDL principle, which suggests that neural networks trained with gradient descent on noisy data implicitly abide the MDL principle.

This work has been funded by a public grant from the French National Research Agency (ANR) under the “France 2030” investment plan, which has the reference EUR MANUTECH SLEIGHT - ANR-17-EURE-0026. This work has also been partly funded and by a PhD grant from the French Ministry of Higher Education and Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 67.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 84.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    This formulation is known as two-part MDL, which depending on the author can be seen as “traditional” (in opposition to “modern” MDL which uses a one-step encoding using universal encodings [15]) or “pure [46]”.

  2. 2.

    Repository: https://github.com/brandao-eduardo/ismynndrivenbymdl.

  3. 3.

    The Shannon-Fano code is competitive, meaning that the probability that the expected length exceeds another code’s by c bits does not exceed \(2^{1-c}\) [8].

  4. 4.

    For classification, we work on an intermediate representation, which explains the use of integrals in calculating the expectation.

  5. 5.

    A similar argument can be found in [1] in the discussion of noise sensitivity.

  6. 6.

    Since \(p(k\sigma ) = a \left( k\sigma \right) ^{\alpha }=a k^{\alpha }\sigma ^{\alpha }\), normalization implies \(p(k\sigma )=p(\sigma )\).

  7. 7.

    The number of training epochs being relatively small, we did not find a power-law behavior.

References

  1. Arora, S., Ge, R., Neyshabur, B., Zhang, Y.: Stronger generalization bounds for deep nets via a compression approach. CoRR (2018). http://arxiv.org/abs/1802.05296v4

  2. Barron, A., Rissanen, J., Yu, B.: The minimum description length principle in coding and modeling. IEEE Trans. Inf. Theory 44(6), 2743–2760 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  3. Barron, A.R.: Complexity regularization with application to artificial neural networks. In: Nonparametric Functional Estimation and Related Topics, pp. 561–576 (1991)

    Google Scholar 

  4. Blier, L., Ollivier, Y.: The description length of deep learning models. Adv. Neural Inf. Process. Syst. 31, 1–11 (2018)

    Google Scholar 

  5. Blum, A., Langford, J.: PAC-MDL bounds. In: Schölkopf, B., Warmuth, M.K. (eds.) COLT-Kernel 2003. LNCS (LNAI), vol. 2777, pp. 344–357. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45167-9_26

    Chapter  Google Scholar 

  6. Brandao, E., Duffner, S., Emonet, R., Habrard, A., Jacquenet, F., Sebban, M.: Appendixes to: is my neural net driven by the mdl principle? (2023). https://github.com/brandao-eduardo/ismynndrivenbymdl/blob/main/Appendixes.pdf

  7. Cayton, L.: Algorithms for manifold learning. Univ. of California at San Diego Technical Report 12(1–17), 1 (2005)

    Google Scholar 

  8. Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley & Sons, Hoboekn (2012)

    MATH  Google Scholar 

  9. Deng, L.: The mnist database of handwritten digit images for machine learning research. IEEE Signal Process. Maga. 29(6), 141–142 (2012)

    Article  Google Scholar 

  10. Fefferman, C., Mitter, S., Narayanan, H.: Testing the manifold hypothesis. J. Am. Math. Soc. 29(4), 983–1049 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  11. Galbrun, E.: The minimum description length principle for pattern mining: a survey. Data Mining Knowl. Disc. 36(5), 1679–1727 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  12. Gao, T., Jojic, V.: Degrees of freedom in deep neural networks. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI 2016, pp. 232–241. AUAI Press, Arlington (2016)

    Google Scholar 

  13. Gheorghiu, S., Coppens, M.O.: Heterogeneity explains features of “anomalous" thermodynamics and statistics. Proc. Natl. Acad. Sci. 101(45), 15852–15856 (2004)

    Article  Google Scholar 

  14. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016). http://www.deeplearningbook.org

  15. Grünwald, P.: Minimum description length tutorial. Adv. Minimum Descript. Length: Theory Appl. 5, 1–80 (2005)

    Google Scholar 

  16. Grünwald, P., Langford, J.: Suboptimal behavior of bayes and mdl in classification under misspecification. Mach. Learn. 66, 119–149 (2007)

    Article  MATH  Google Scholar 

  17. Grünwald, P., Roos, T.: Minimum description length revisited. Int. J. Math. Ind. 11(01), 1930001 (2019). https://doi.org/10.1142/S2661335219300018

  18. Hansen, M.H., Yu, B.: Minimum description length model selection criteria for generalized linear models. Lect. Notes-Monograph Ser., 145–163 (2003)

    Google Scholar 

  19. Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: stability of stochastic gradient descent. In: International Conference on Machine Learning, pp. 1225–1234. PMLR (2016)

    Google Scholar 

  20. He, H., Su, W.J.: The local elasticity of neural networks. arXiv preprint arXiv:1910.06943 (2019)

  21. Helmbold, D.P., Long, P.M.: On the inductive bias of dropout. J. Mach. Learn. Res. 16(1), 3403–3454 (2015)

    MathSciNet  MATH  Google Scholar 

  22. Hinton, G.E., Van Camp, D.: Keeping the neural networks simple by minimizing the description length of the weights. In: Proceedings of the Sixth Annual Conference on Computational Learning Theory, pp. 5–13 (1993)

    Google Scholar 

  23. Hu, B., Rakthanmanon, T., Hao, Y., Evans, S., Lonardi, S., Keogh, E.: Using the minimum description length to discover the intrinsic cardinality and dimensionality of time series. Data Mining Knowl. Disc. 29, 358–399 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  24. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)

    Google Scholar 

  25. Jaynes, E.: Where do we stand on maximum entropy? The maximum entropy formalism, pp. 15–118 (1978)

    Google Scholar 

  26. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)

    Google Scholar 

  27. Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In: Advances in Neural Information Processing Systems, pp. 950–957 (1992)

    Google Scholar 

  28. Li, C., Farkhoor, H., Liu, R., Yosinski, J.: Measuring the intrinsic dimension of objective landscapes. In: International Conference on Learning Representations (2018)

    Google Scholar 

  29. Luo, P., Wang, X., Shao, W., Peng, Z.: Towards understanding regularization in batch normalization. arXiv preprint arXiv:1809.00846 (2018)

  30. MacKay, D.J., Mac Kay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)

    Google Scholar 

  31. Martin, C.H., Mahoney, M.W.: Implicit self-regularization in deep neural networks: evidence from random matrix theory and implications for learning. arXiv preprint arXiv:1810.01075 (2018)

  32. Martin, C.H., Mahoney, M.W.: Heavy-tailed universality predicts trends in test accuracies for very large pre-trained deep neural networks. In: Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 505–513. SIAM (2020)

    Google Scholar 

  33. Morcos, A.S., Barrett, D.G.T., Rabinowitz, N.C., Botvinick, M.: On the importance of single directions for generalization. CoRR (2018). http://arxiv.org/abs/1803.06959v4

  34. Myung, J.I., Navarro, D.J., Pitt, M.A.: Model selection by normalized maximum likelihood. J. Math. Psychol. 50(2), 167–179 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  35. Neyshabur, B., Tomioka, R., Srebro, N.: In search of the real inductive bias: on the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614 (2014)

  36. Pennebaker, W.B., Mitchell, J.L.: JPEG: Still Image Data Compression Standard. Springer, Heidelberg (1992)

    Google Scholar 

  37. Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)

    Article  MATH  Google Scholar 

  38. Rissanen, J.: A universal prior for integers and estimation by minimum description length. Ann. Stat. 11(2), 416–431 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  39. Rissanen, J.: MDL denoising. IEEE Trans. Inf. Theory 46(7), 2537–2543 (2000)

    Article  MATH  Google Scholar 

  40. Rissanen, J.: Strong optimality of the normalized ml models as universal codes and information in data. IEEE Trans. Inf. Theory 47(5), 1712–1717 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  41. Santurkar, S., Tsipras, D., Ilyas, A., Madry, A.: How does batch normalization help optimization? arXiv preprint arXiv:1805.11604 (2018)

  42. Song, H., Kim, M., Park, D., Shin, Y., Lee, J.G.: Learning from noisy labels with deep neural networks: a survey. IEEE Trans. Neural Netw. Learn. Syst. (2022)

    Google Scholar 

  43. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  44. Thompson, R.C.: Principal submatrices ix: interlacing inequalities for singular values of submatrices. Linear Algebra Appl. 5(1), 1–12 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  45. Tishby, N., Zaslavsky, N.: Deep learning and the information bottleneck principle. In: 2015 IEEE Information Theory Workshop (itw), pp. 1–5. IEEE (2015)

    Google Scholar 

  46. Vitányi, P.M., Li, M.: Minimum description length induction, bayesianism, and kolmogorov complexity. IEEE Trans. Inf. Theory 46(2), 446–464 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  47. Wei, J., Zhu, Z., Cheng, H., Liu, T., Niu, G., Liu, Y.: Learning with noisy labels revisited: a study using real-world human annotations. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=TBWA6PLJZQm

  48. Yamanishi, K.: A decision-theoretic extension of stochastic complexity and its applications to learning. IEEE Trans. Inf. Theory 44(4), 1424–1439 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  49. Yoshida, Y., Miyato, T.: Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941 (2017)

  50. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: International Conference on Learning Representations (2017). https://openreview.net/forum?id=Sy8gdB9xx

  51. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64(3), 107–115 (2021)

    Article  Google Scholar 

  52. Zhao, K., Musolesi, M., Hui, P., Rao, W., Tarkoma, S.: Explaining the power-law distribution of human mobility through transportation modality decomposition. Sci. Rep. 5(1), 1–7 (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eduardo Brandao .

Editor information

Editors and Affiliations

Ethics declarations

Ethical statement

This paper presents a contribution that is essentially fundamental, theoretical and methodological. We do not see any immediate ethical or societal issues. Our experimental evaluation considers classic benchmarks of the literature and our analysis focuses on particular mathematical properties of point Jacobians spectra of trained neural networks. Our work follows ethical guidelines in modern machine learning research in general and in representation learning in particular. The application of the methodology presented in this paper should consider ethical implications that can arise from the datasets used of the applications targeted.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Brandao, E., Duffner, S., Emonet, R., Habrard, A., Jacquenet, F., Sebban, M. (2023). Is My Neural Net Driven by the MDL Principle?. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14170. Springer, Cham. https://doi.org/10.1007/978-3-031-43415-0_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43415-0_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43414-3

  • Online ISBN: 978-3-031-43415-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics