[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Sparse Multinomial Logistic Regression: Fast Algorithms and Generalization Bounds

Published: 01 June 2005 Publication History

Abstract

Recently developed methods for learning sparse classifiers are among the state-of-the-art in supervised learning. These methods learn classifiers that incorporate weighted sums of basis functions with sparsity-promoting priors encouraging the weight estimates to be either significantly large or exactly zero. From a learning-theoretic perspective, these methods control the capacity of the learned classifier by minimizing the number of basis functions used, resulting in better generalization. This paper presents three contributions related to learning sparse classifiers. First, we introduce a true multiclass formulation based on multinomial logistic regression. Second, by combining a bound optimization approach with a component-wise update procedure, we derive fast exact algorithms for learning sparse multiclass classifiers that scale favorably in both the number of training samples and the feature dimensionality, making them applicable even to large data sets in high-dimensional feature spaces. To the best of our knowledge, these are the first algorithms to perform exact multinomial logistic regression with a sparsity-promoting prior. Third, we show how nontrivial generalization bounds can be derived for our classifier in the binary case. Experimental results on standard benchmarkdata sets attest to the accuracy, sparsity, and efficiency of the proposed methods.

References

[1]
P. Bartlett and S. Mendelson, “Rademacher and Gaussian Complexities: Risk Bounds and Structural Results,” J. Machine Learning Research, vol. 3, pp. 463-482, 2002.
[2]
D. Böhning, “Multinomial Logistic Regression Algorithm,” Annals of the Inst. of Statistical Math., vol. 44, pp. 197-200, 1992.
[3]
D. Böhning and B. Lindsay, “Monotonicity of Quadratic-Approximation Algorithms,” Annals of the Inst. of Statistical Math., vol. 40, pp. 641-663, 1988.
[4]
S. Chen D. Donoho and M. Saunders, “Atomic Decomposition by Basis Pursuit,” SIAM J. Scientific Computation, vol. 20, pp. 33-61, 1998.
[5]
M. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines. Cambridge, U.K.: Cambridge Univ. Press, 2000.
[6]
L. Csato and M. Opper, “Sparse Online Gaussian Processes,” Neural Computation, vol. 14, no. 3, pp. 641-668, 2002.
[7]
J. de Leeuw and G. Michailides, “Block Relaxation Methods in Statistics,” technical report, Dept. of Statistics, Univ. of California at Los Angeles, 1993.
[8]
A. Dempster N. Laird and D. Rubin, “Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. B, vol. 39, pp. 1-38, 1977.
[9]
D. Donoho and M. Elad, “Optimally Sparse Representations in General Nonorthogonal Dictionaries by l<inf>1</inf> Minimization,” Proc. Nat'l Academy of Science, vol. 100, no. 5, pp. 2197-2202, 2003.
[10]
M. Figueiredo and A. Jain, “Bayesian Learning of Sparse Classifiers,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 35-41, 2001.
[11]
M. Figueiredo, “Adaptive Sparseness for Supervised Learning,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, pp.nbsp1150-1159, 2003.
[12]
J. Friedman T. Hastie S. Rosset R. Tibshirani and J. Zhu, “Discussion of Boosting Papers,” The Annals of Statistics, vol. 32, no. 1, pp. 102-107, 2004.
[13]
T. Graepel R. Herbrich and J. Shawe-Taylor, “Generalisation Error Bounds for Sparse Linear Classifiers,” Proc. Conf. Computational Learning Theory, pp. 298-303, 2000.
[14]
T. Graepel R. Herbrich and R.C. Williamson, “From Margin to Sparsity,” Proc. Neural Information Processing Systems (NIPS) 13, pp. 210-216, 2001.
[15]
R. Herbrich, Learning Kernel Classifiers: Theory and Algorithms. Cambridge, Mass.: MIT Press, 2002.
[16]
B. Krishnapuram L. Carin and A. Hartemink, “Joint Classifier and Feature Optimization for Cancer Diagnosis Using Gene Expression Data,” Proc. Int'l Conf. Research in Computational Molecular Biology, 2003.
[17]
B. Krishnapuram A. Hartemink L. Carin and M. Figueiredo, “A Bayesian Approach to Joint Feature Selection and Classifier Design,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, pp. 1105-1111, 2004.
[18]
K. Lange, Optimization. New York: Springer Verlag, 2004.
[19]
K. Lange D. Hunter and I. Yang, “Optimization Transfer Using Surrogate Objective Functions,” J. Computational and Graphical Statistics, vol. 9, pp. 1-59, 2000.
[20]
J. Langford and J. Shawe-Taylor, “PAC-Bayes and Margins,” Advances in Neural Information Processing Systems 15, S. Becker, S.nbspThrun, and K. Obermayer, eds., pp. 423-430, Cambridge, Mass.: MIT Press, 2003.
[21]
J. Langford, “Practical Prediction Theory for Classification,” Proc. Int'l Conf. Machine Learning, T. Fawcett and N. Mishra, eds., 2003.
[22]
N.D. Lawrence M. Seeger and R. Herbrich, “Fast Sparse Gaussian Process Methods: The Informative Vector Machine,” Advances in Neural Information Processing Systems 15, S. Becker, S. Thrun and K.nbspObermayer, eds., pp. 609-616, Cambridge, Mass.: MIT Press, 2003.
[23]
M. Lewicki and T. Sejnowski, “Learning Overcomplete Representations,” Neural Computation, vol. 12, pp. 337-365, 2000.
[24]
S. Mallat, A Wavelet Tour of Signal Processing. San Diego, Calif.: Academic Press, 1998.
[25]
D. McAllester, “Some PAC-Bayesian Theorems,” Machine Learning, vol. 37, pp. 355-363, 1999.
[26]
R. Meir and T. Zhang, “Generalization Error Bounds for Bayesian Mixture Algorithms,” J. Machine Learning Research, vol. 4, pp. 839-860, 2003.
[27]
T. Minka, “A Comparison of Numerical Optimizers for Logistic Regression,” technical report, Dept. of Statistics, Carnegie Mellon Univ., 2003.
[28]
R. Neal, Bayesian Learning for Neural Networks. New York: Springer Verlag, 1996.
[29]
A.Y. Ng, “Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance,” Proc. Int'l Conf. Machine Learning, 2004.
[30]
B. Olshausen and D. Field, “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,” Nature, vol. 381, pp. 607-609, 1996.
[31]
Y. Qi T.P. Minka R.W. Picard and Z. Ghahramani, “Predictive Automatic Relevance Determination by Expectation Propagation,” Proc. Int'l Conf. Machine Learning, 2004.
[32]
R. Salakhutdinov and S. Roweis, “Adaptive Overrelaxed Bound Optimization Methods,” Proc. Int'l Conf. Machine Learning, pp. 664-671, 2003.
[33]
M. Seeger, “PAC-Bayesian Generalization Error Bounds for Gaussian Process Classification,” J. Machine Learning Research, vol. 3, pp. 233-269, 2002.
[34]
R. Tibshirani, “Regression Shrinkage and Selection via the LASSO,” J. Royal Statistical Soc. B, vol. 58, no. 1, pp. 267-288, 1996.
[35]
M. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” J. Machine Learning Research, vol. 1, pp. 211-244, 2001.
[36]
M. Tipping and A. Faul, “Fast Marginal Likelihood Maximisation for Sparse Bayesian Models,” Proc. Ninth Int'l Workshop Artificial Intelligence and Statistics, C. Bishop and B. Frey, eds., 2003.
[37]
L. Valiant, “A Theory of the Learnable,” Comm. ACM, vol. 27, pp.nbsp1134-1142, 1984.
[38]
V. Vapnik, Statistical Learning Theory. New York: John Wiley, 1998.
[39]
J. Weston A. Elisseeff B. Schölkopf and M. Tipping, “Use of the Zero-Norm with Linear Models and Kernel Methods,” J. Machine Learning Research, vol. 3, pp. 1439-1461, 2003.
[40]
C. Williams and D. Barber, “Bayesian Classification with Gaussian Priors,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20,no. 12, pp. 1342-1351, Dec. 1998.
[41]
P. Williams, “Bayesian Regularization and Pruning Using a Laplace Prior,” Neural Computation, vol. 7, pp. 117-143, 1995.
[42]
T. Zhang and F. Oles, “Regularized Linear Classification Methods,” Information Retrieval, vol. 4, pp. 5-31, 2001.
[43]
J. Zhu and T. Hastie, “Kernel Logistic Regression and the Import Vector Machine,” Advances in Neural Information Processing Systems 14, T. Dietterich, S. Becker, and Z. Ghahramani, eds., pp. 1081-1088, Cambridge, Mass.: MIT Press, 2002.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Pattern Analysis and Machine Intelligence
IEEE Transactions on Pattern Analysis and Machine Intelligence  Volume 27, Issue 6
June 2005
176 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 June 2005

Author Tags

  1. Bayesian inference
  2. Index Terms- Supervised learning
  3. bound optimization
  4. classification
  5. expectation maximization (EM)
  6. generalization bounds.
  7. learning theory
  8. multinomial logistic regression
  9. sparsity

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Group Penalized Multinomial Logit Models and Stock Return Direction PredictionIEEE Transactions on Information Theory10.1109/TIT.2024.337675170:6(4297-4318)Online publication date: 13-Mar-2024
  • (2024)MTLSC-DiffKnowledge-Based Systems10.1016/j.knosys.2024.112415303:COnline publication date: 4-Nov-2024
  • (2024)G-LASSO/G-SCAD/G-MCP penalized trinomial logit dynamic models predict up trends, sideways trends and down trends for stock returns▪Expert Systems with Applications: An International Journal10.1016/j.eswa.2024.123476249:PAOnline publication date: 1-Sep-2024
  • (2024)An embedded feature selection method based on generalized classifier neural network for cancer classificationComputers in Biology and Medicine10.1016/j.compbiomed.2023.107677168:COnline publication date: 12-Apr-2024
  • (2024)The appeals of quadratic majorization–minimizationJournal of Global Optimization10.1007/s10898-023-01361-189:3(509-558)Online publication date: 1-Jul-2024
  • (2023)Demystifying softmax gating function in Gaussian mixture of expertsProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666328(4624-4652)Online publication date: 10-Dec-2023
  • (2023)Simultaneous Dimension Reduction and Variable Selection for Multinomial Logistic RegressionINFORMS Journal on Computing10.1287/ijoc.2022.013235:5(1044-1060)Online publication date: 1-Sep-2023
  • (2022)abessThe Journal of Machine Learning Research10.5555/3586589.358679123:1(9206-9212)Online publication date: 1-Jan-2022
  • (2022)Compressed-Domain ECG-based Biometric User Identification Using Task-Driven Dictionary LearningACM Transactions on Computing for Healthcare10.1145/34617013:3(1-15)Online publication date: 7-Apr-2022
  • (2022)Activity-Based Model using GPS Data and Google APIs2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC)10.1109/ITSC55140.2022.9922042(1723-1729)Online publication date: 8-Oct-2022
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media