[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A New Smooth Approximation to the Zero One Loss with a Probabilistic Interpretation

Published: 13 December 2019 Publication History

Abstract

We examine a new form of smooth approximation to the zero one loss in which learning is performed using a reformulation of the widely used logistic function. Our approach is based on using the posterior mean of a novel generalized Beta-Bernoulli formulation. This leads to a generalized logistic function that approximates the zero one loss, but retains a probabilistic formulation conferring a number of useful properties. The approach is easily generalized to kernel logistic regression and easily integrated into methods for structured prediction. We present experiments in which we learn such models using an optimization method consisting of a combination of gradient descent and coordinate descent using localized grid search so as to escape from local minima. Our experiments indicate that optimization quality is improved when learning metaparameters are themselves optimized using a validation set. Our experiments show improved performance relative to widely used logistic and hinge loss methods on a wide variety of problems ranging from standard UC Irvine and libSVM evaluation datasets to product review predictions and a visual information extraction task. We observe that the approach is as follows: (1) more robust to outliers compared to the logistic and hinge losses; (2) outperforms comparable logistic and max margin models on larger scale benchmark problems; (3) when combined with Gaussian–Laplacian mixture prior on parameters the kernelized version of our formulation yields sparser solutions than Support Vector Machine classifiers; and (4) when integrated into a probabilistic structured prediction technique our approach provides more accurate probabilities yielding improved inference and increasing information extraction performance.

References

[1]
K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.
[2]
Ronan Collobert, Fabian Sinz, Jason Weston, and Léon Bottou. 2006. Trading convexity for scalability. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 201--208.
[3]
Andrew Cotter, Shai Shalev-Shwartz, and Nati Srebro. 2013. Learning optimally sparse support vector machines. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 266--274.
[4]
Cheng Deng, Jie Xu, Kaibing Zhang, Dacheng Tao, Xinbo Gao, and Xuelong Li. 2016. Similarity constraints-based structured output regression machine: An approach to image super-resolution. IEEE Transactions on Neural Networks and Learning Systems 27, 12 (2016), 2472--2485.
[5]
Chuong B. Do, Quoc Le, Choon Hui Teo, Olivier Chapelle, and Alex Smola. 2008. Tighter bounds for structured estimation. In Proceedings of Neural Information Processing Systems (NIPS’08).
[6]
M. Dredze, K. Crammer, and F. Pereira. 2008. Confidence-weighted linear classification. In Proceedings of the 25th International Conference on Machine Learning. ACM, New York, NY, 264--271.
[7]
Seyda Ertekin, Leon Bottou, and C. Lee Giles. 2011. Nonconvex online support vector machines. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 2 (2011), 368--381.
[8]
Vitaly Feldman, Venkatesan Guruswami, Prasad Raghavendra, and Yi Wu. 2012. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing 41, 6 (2012), 1558--1590.
[9]
Kevin Gimpel and Noah A. Smith. 2012. Structured ramp loss minimization for machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 221--231.
[10]
Md. Kamrul Hasan and Chris Pal. 2014. Experiments on visual information extraction with the faces of Wikipedia. In Proceedings of the AAAI Conference on Artificial Intelligence.
[11]
Romain Hérault and Yves Grandvalet. 2007. Sparse probabilistic classifiers. In Proceedings of the 24th International Conference on Machine Learning. ACM, 337--344.
[12]
Ailsa H. Land and Alison G. Doig. 1960. An automatic method of solving discrete programming problems. Econometrica 28, 3 (1960), 497--520.
[13]
Tongliang Liu and Dacheng Tao. 2016. Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 3 (2016), 447--461.
[14]
Weiwei Liu and Ivor W. Tsang. 2017. Making decision trees feasible in ultrahigh feature and label dimensions. The Journal of Machine Learning Research 18, 1 (2017), 2814--2849.
[15]
Weiwei Liu, Ivor W. Tsang, and Klaus-Robert Müller. 2017. An easy-to-hard learning paradigm for multiple classes and multiple labels. The Journal of Machine Learning Research 18, 1 (2017), 3300--3337.
[16]
Gábor Lugosi, Nicolas Vayatis, et al. 2004. On the Bayes-risk consistency of regularized boosting methods. The Annals of statistics 32, 1 (2004), 30--55.
[17]
Olvi L. Mangasarian, W. Nick Street, and William H. Wolberg. 1995. Breast cancer diagnosis and prognosis via linear programming. Operations Research 43, 4 (1995), 570--577.
[18]
Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. 1999. Boosting algorithms as gradient descent in function space. In Proceedings of the 12th International Conference on Neural Information Processing Systems (NIPS’99).
[19]
Nasser M. Nasrabadi. 2007. Pattern recognition and machine learning. Journal of Electronic Imaging 16, 4 (2007), 049901.
[20]
Tan Nguyen and Scott Sanner. 2013. Algorithms for direct 0--1 loss optimization in binary classification. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 1085--1093.
[21]
Fernando Pérez-Cruz, Angel Navia-Vázquez, Aníbal R. Figueiras-Vidal, and Antonio Artes-Rodriguez. 2003. Empirical risk minimization for support vector classifiers. IEEE Transactions on Neural Networks 14, 2 (2003), 296--303.
[22]
Franz Pernkopf, Michael Wohlmayr, and Sebastian Tschiatschek. 2012. Maximum margin Bayesian network classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 3 (2012), 521--532.
[23]
Xiaotong Shen, George C. Tseng, Xuegong Zhang, and Wing Hung Wong. 2003. On ψ-learning. Journal of the American Statistical Association 98, 463 (2003), 724--734.
[24]
Ingo Steinwart. 2003. Sparseness of support vector machines. The Journal of Machine Learning Research 4 (2003), 1071--1105.
[25]
Brendan Van Rooyen, Aditya Menon, and Robert C. Williamson. 2015. Learning with symmetric label noise: The importance of being unhinged. In Proceedings of the Advances in Neural Information Processing Systems. 10--18.
[26]
Vladimir Vapnik. 2000. The Nature of Statistical Learning Theory. Springer.
[27]
Pascal Vincent. 2004. Modèles à noyaux à structure locale. Citeseer.
[28]
Pascal Vincent and Yoshua Bengio. 2002. Kernel matching pursuit. Machine Learning 48, 1--3 (2002), 165--187.
[29]
De Wang, Danesh Irani, and Calton Pu. 2012. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In Proceedings of the 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom’12). IEEE, 40--49.
[30]
Yichao Wu and Yufeng Liu. 2007. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association 102, 479 (2007), 974--983.
[31]
Liping Xie, Dacheng Tao, and Haikun Wei. 2017. Joint structured sparsity regularized multiview dimension reduction for video-based facial expression recognition. ACM Transactions on Intelligent Systems and Technology 8, 2 (2017), 28.
[32]
Alan L. Yuille and Anand Rangarajan. 2003. The concave-convex procedure. Neural Computation 15, 4 (2003), 915--936.
[33]
Jingwei Zhang, Tongliang Liu, and Dacheng Tao. 2018. On the rates of convergence from surrogate risk minimizers to the Bayes optimal classifier. arXiv:1802.03688.
[34]
Tong Zhang and Frank J. Oles. 2001. Text categorization based on regularized linear classification methods. Information Retrieval 4, 1 (2001), 5--31.
[35]
Xinhua Zhang, Ankan Saha, and S. V. N. Vishwanathan. 2011. Smoothing multivariate performance measures. Journal of Machine Learning Research 10 (2011), 1--55.
[36]
Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301--320.

Cited By

View all

Index Terms

  1. A New Smooth Approximation to the Zero One Loss with a Probabilistic Interpretation

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 14, Issue 1
      February 2020
      325 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/3375789
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 December 2019
      Accepted: 01 July 2019
      Revised: 01 May 2019
      Received: 01 October 2017
      Published in TKDD Volume 14, Issue 1

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Machine learning
      2. classification
      3. logistic regression
      4. probabilistic models
      5. supervised learning

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)22
      • Downloads (Last 6 weeks)10
      Reflects downloads up to 20 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media