Abstract
We are motivated by the problem of identifying potentially nonlinear regression relationships between high-dimensional outputs and high-dimensional inputs of heterogeneous data. This requires regression, clustering, and model selection, simultaneously. In this framework, we apply the mixture of experts models which are among the most popular ensemble learning techniques developed in the field of neural networks. In particular, we consider a more general case of mixture of experts models characterized by multiple Gaussian experts whose means are polynomials of the input variables and whose covariance matrices have block-diagonal structures. More especially, each expert is weighted by a gating network that is a softmax function of a polynomial of the input variables. These models require several hyper-parameters, including the number of mixture components, the complexity of the softmax gating networks and Gaussian mean experts, and the hidden block-diagonal structures of the covariance matrices. We provide a non-asymptotic theory for model selection of such complex hyper-parameters using the slope heuristic approach in a penalized maximum likelihood estimation framework. Specifically, we establish a non-asymptotic risk bound on the penalized maximum likelihood estimation, which takes the form of an oracle inequality, given lower bound assumptions on the penalty function.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Anderson, C.W., Stolz, E.A., Shamsunder, S.: Multivariate autoregressive models for classification of spontaneous electroencephalographic signals during mental tasks. IEEE Trans. Biomed. Eng. 45(3), 277–286 (1998)
Arlot, S.: Minimal penalties and the slope heuristics: a survey. J. Soc. Française Stat. 160(3), 1–106 (2019)
Bengio, Y.: Deep learning of representations: looking forward. In: Dediu, A.-H., Martín-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS (LNAI), vol. 7978, pp. 1–37. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39593-2_1
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22(7), 719–725 (2000)
Birgé, L., Massart, P.: Minimal penalties for Gaussian model selection. Probab. Theory Relat. Fields 138(1), 33–73 (2007)
Borwein, J.M., Zhu, Q.J.: Techniques of Variational Analysis. Springer, Heidelberg (2004). https://doi.org/10.1007/0-387-28271-8
Chamroukhi, F., Huynh, B.T.: Regularized maximum-likelihood estimation of mixture-of-experts for regression and clustering. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2018)
Chen, Z., Deng, Y., Wu, Y., Gu, Q., Li, Y.: Towards understanding the mixture-of-experts layer in deep learning. In: NeurIPS (2022)
Cohen, S.X., Le Pennec, E.: Partition-based conditional density estimation. ESAIM: Probab. Stat. 17, 672–697 (2013)
Cohen, S., Le Pennec, E.: Conditional density estimation by penalized likelihood model selection and applications. Technical report, INRIA (2011)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B 39(1), 1–38 (1977)
Devijver, E.: Finite mixture regression: a sparse variable selection by model selection for clustering. Electron. J. Stat. 9(2), 2642–2674 (2015)
Devijver, E.: Joint rank and variable selection for parsimonious estimation in a high-dimensional finite mixture regression model. J. Multivar. Anal. 157, 1–13 (2017)
Devijver, E., Gallopin, M.: Block-diagonal covariance selection for high-dimensional Gaussian graphical models. J. Am. Stat. Assoc. 113(521), 306–314 (2018)
Ho, N., Yang, C.Y., Jordan, M.I.: Convergence rates for Gaussian mixtures of experts. J. Mach. Learn. Res. 23(323), 1–81 (2022)
Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural Comput. 3(1), 79–87 (1991)
Khalili, A.: New estimation and feature selection methods in mixture-of-experts models. Can. J. Stat. 38(4), 519–539 (2010)
Kwon, J., Qian, W., Caramanis, C., Chen, Y., Davis, D.: Global convergence of the EM algorithm for mixtures of two component linear regression. In: COLT, vol. 99, pp. 2055–2110. PMLR (2019)
Masoudnia, S., Ebrahimpour, R.: Mixture of experts: a literature survey. Artif. Intell. Rev. 42(2), 275–293 (2014)
Massart, P.: Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII-2003. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-48503-2_7
Maugis, C., Michel, B.: A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: Probab. Stat. 15, 41–68 (2011)
Mazumder, R., Hastie, T.: Exact covariance thresholding into connected components for large-scale graphical lasso. J. Mach. Learn. Res. 13(1), 781–794 (2012)
Mendes, E.F., Jiang, W.: On convergence rates of mixtures of polynomial experts. Neural Comput. 24(11), 3025–3051 (2012)
Montuelle, L., Le Pennec, E., et al.: Mixture of Gaussian regressions model with logistic weights, a penalized maximum likelihood approach. Electron. J. Stat. 8(1), 1661–1695 (2014)
Nguyen, H.D., Chamroukhi, F.: Practical and theoretical aspects of mixture-of-experts modeling: an overview. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 8(4), e1246 (2018)
Nguyen, H.D., Chamroukhi, F., Forbes, F.: Approximation results regarding the multiple-output Gaussian gated mixture of linear experts model. Neurocomputing 366, 208–214 (2019)
Nguyen, H.D., Nguyen, T., Chamroukhi, F., McLachlan, G.J.: Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models. J. Stat. Distrib. Appl. 8(1), 13 (2021)
Nguyen, T., Chamroukhi, F., Nguyen, H.D., McLachlan, G.J.: Approximation of probability density functions via location-scale finite mixtures in Lebesgue spaces. Commun. Stat. - Theory Methods 52, 1–12 (2022)
Nguyen, T., Nguyen, H.D., Chamroukhi, F., McLachlan, G.J.: An \(l_1\)-oracle inequality for the Lasso in mixture-of-experts regression models. arXiv preprint arXiv:2009.10622 (2020)
Nguyen, T., Nguyen, H.D., Chamroukhi, F., McLachlan, G.J.: Approximation by finite mixtures of continuous density functions that vanish at infinity. Cogent Math. Stat. 7(1), 1750861 (2020)
Schwarz, G., et al.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. B 58(1), 267–288 (1996)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. Roy. Stat. Soc. B 68(1), 49–67 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Nguyen, T., Nguyen, D.N., Nguyen, H.D., Chamroukhi, F. (2024). A Non-asymptotic Risk Bound for Model Selection in a High-Dimensional Mixture of Experts via Joint Rank and Variable Selection. In: Liu, T., Webb, G., Yue, L., Wang, D. (eds) AI 2023: Advances in Artificial Intelligence. AI 2023. Lecture Notes in Computer Science(), vol 14472. Springer, Singapore. https://doi.org/10.1007/978-981-99-8391-9_19
Download citation
DOI: https://doi.org/10.1007/978-981-99-8391-9_19
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8390-2
Online ISBN: 978-981-99-8391-9
eBook Packages: Computer ScienceComputer Science (R0)