Abstract
The representer theorem for kernel methods states that the solution of the associated variational problem can be expressed as the linear combination of a finite number of kernel functions. However, for non-smooth loss functions, the analytic characterization of the coefficients poses nontrivial problems. Standard approaches resort to constrained optimization reformulations which, in general, lack a closed-form solution. Herein, by a proper change of variable, it is shown that, for any convex loss function, the coefficients satisfy a system of algebraic equations in a fixed-point form, which may be directly obtained from the primal formulation. The algebraic characterization is specialized to regression and classification methods and the fixed-point equations are explicitly characterized for many loss functions of practical interest. The consequences of the main result are then investigated along two directions. First, the existence of an unconstrained smooth reformulation of the original non-smooth problem is proven. Second, in the context of SURE (Stein’s Unbiased Risk Estimation), a general formula for the degrees of freedom of kernel regression methods is derived.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & F. Csáki (Eds.), Second international symposium on information theory. Budapest: Académiai Kiadó.
Chapelle, O. (2007). Training a support vector machine in the primal. Neural Computation, 19(5), 1155–1178.
Chu, W., Keerthi, S. S., & Ong, C. J. (2001). A unified loss function in Bayesian framework for support vector regression. In Proceedings of the 18th international conference on machine learning (pp. 51–58). San Francisco: Morgan Kaufmann.
Cox, D., & O’Sullivan, F. (1990). Asymptotic analysis of penalized likelihood and related estimators. The Annals of Statistics, 18, 1676–1695.
Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline functions. Numerische Mathematik, 31, 377–403.
De Vito, E., Rosasco, L., Caponnetto, A., Piana, M., & Verri, A. (2004). Some properties of regularized kernel methods. Journal of Machine Learning Research, 5, 1363–1390.
Dekel, O., Shalev-Shwartz, S., & Singer, Y. (2005). Smooth ε-insensitive regression by loss symmetrization. Journal of Machine Learning Research, 6, 711–741.
Dinuzzo, F., Neve, M., De Nicolao, G., & Gianazza, U. P. (2007). On the representer theorem and equivalent degrees of freedom of SVR. Journal of Machine Learning Research, 8, 2467–2495.
Efron, B. (2004). The estimation of prediction error: covariance penalties and cross-validation. Journal of the American Statistical Association, 99(14), 619–632.
Fukushima, M., & Qi, L. (1999). Reformulation: nonsmooth, piecewise smooth, semismooth and smoothing methods. Dordrecht: Kluwer Academic.
Gunter, L., & Zhu, J. (2007). Efficient computation and model selection for the support vector regression. Neural Computation, 19, 1633–1655.
Hastie, T. J., Tibshirani, R. J., & Friedman, J. (2001). The elements of statistical learning. Data mining, inference and prediction. Canada: Springer.
Kimeldorf, G., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33(1), 82–95.
Lee, Y. J., & Mangasarian, O. L. (2001). SSVM: A smooth support vector machine for classification. Computational Optimization and Applications, 20, 5–22.
Mallows, C. (1973). Some comments on C p . Technometrics, 15, 661–675.
Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Programming, 103, 127–152.
Osborne, M. R. (2001). Simplicial algorithms for minimizing polyhedral functions. Cambridge: Cambridge University Press.
Perez-Cruz, F., Bousono-Calzon, C., & Artes-Rodriguez, A. (2005). Convergence of the IRWLS procedure to the support vector machine solution. Neural Computation, 17(1), 7–18.
Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—support vector learning. Cambridge: MIT Press.
Poggio, T., & Girosi, F. (1992). A theory of networks for approximation and learning. In Foundation of neural networks (pp. 91–106).
Pontil, M., & Verri, A. (1998). Properties of support vector machines. Neural Computation, 10, 955–974.
Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). A generalized representer theorem. Neural Networks and Computational Learning Theory, 81, 416–426.
Schölkopf, B., & Smola, A. J. (2001). Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive computation and machine learning. Cambridge: MIT Press.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461–464.
Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). PEGASOS: primal estimated sub-gradient solver for svm. In ICML ’07: proceedings of the 24th international conference on machine learning (pp. 807–814). New York: ACM.
Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9, 1135–1151.
Steinwart, I. (2003). Sparseness of support vector machines. Journal of Machine Learning Research, 4, 1071–1105.
Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer.
Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM.
Wahba, G. (1998). Support vector machines, reproducing kernel Hilbert spaces and randomized GACV (Tech. Rep. 984). Department of Statistics, University of Wisconsin.
Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association, 93, 120–131.
Zou, H., Hastie, T., & Tibshirani, R. (2007). On the degrees of freedom of the lasso. The Annals of Statistics, 35(5), 2173–2192.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editor: Olivier Chapelle.
Rights and permissions
About this article
Cite this article
Dinuzzo, F., De Nicolao, G. An algebraic characterization of the optimum of regularized kernel methods. Mach Learn 74, 315–345 (2009). https://doi.org/10.1007/s10994-008-5095-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-008-5095-1