More Web Proxy on the site http://driver.im/

research-article

A New Smooth Approximation to the Zero One Loss with a Probabilistic Interpretation

Authors:

Md Kamrul Hasan,

Christopher PalAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 14, Issue 1

Article No.: 1, Pages 1 - 28

https://doi.org/10.1145/3365672

Published: 13 December 2019 Publication History

Abstract

We examine a new form of smooth approximation to the zero one loss in which learning is performed using a reformulation of the widely used logistic function. Our approach is based on using the posterior mean of a novel generalized Beta-Bernoulli formulation. This leads to a generalized logistic function that approximates the zero one loss, but retains a probabilistic formulation conferring a number of useful properties. The approach is easily generalized to kernel logistic regression and easily integrated into methods for structured prediction. We present experiments in which we learn such models using an optimization method consisting of a combination of gradient descent and coordinate descent using localized grid search so as to escape from local minima. Our experiments indicate that optimization quality is improved when learning metaparameters are themselves optimized using a validation set. Our experiments show improved performance relative to widely used logistic and hinge loss methods on a wide variety of problems ranging from standard UC Irvine and libSVM evaluation datasets to product review predictions and a visual information extraction task. We observe that the approach is as follows: (1) more robust to outliers compared to the logistic and hinge losses; (2) outperforms comparable logistic and max margin models on larger scale benchmark problems; (3) when combined with Gaussian–Laplacian mixture prior on parameters the kernelized version of our formulation yields sparser solutions than Support Vector Machine classifiers; and (4) when integrated into a probabilistic structured prediction technique our approach provides more accurate probabilities yielding improved inference and increasing information extraction performance.

References

[1]

K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.

[2]

Ronan Collobert, Fabian Sinz, Jason Weston, and Léon Bottou. 2006. Trading convexity for scalability. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 201--208.

Digital Library

[3]

Andrew Cotter, Shai Shalev-Shwartz, and Nati Srebro. 2013. Learning optimally sparse support vector machines. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 266--274.

[4]

Cheng Deng, Jie Xu, Kaibing Zhang, Dacheng Tao, Xinbo Gao, and Xuelong Li. 2016. Similarity constraints-based structured output regression machine: An approach to image super-resolution. IEEE Transactions on Neural Networks and Learning Systems 27, 12 (2016), 2472--2485.

[5]

Chuong B. Do, Quoc Le, Choon Hui Teo, Olivier Chapelle, and Alex Smola. 2008. Tighter bounds for structured estimation. In Proceedings of Neural Information Processing Systems (NIPS’08).

[6]

M. Dredze, K. Crammer, and F. Pereira. 2008. Confidence-weighted linear classification. In Proceedings of the 25th International Conference on Machine Learning. ACM, New York, NY, 264--271.

[7]

Seyda Ertekin, Leon Bottou, and C. Lee Giles. 2011. Nonconvex online support vector machines. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 2 (2011), 368--381.

Digital Library

[8]

Vitaly Feldman, Venkatesan Guruswami, Prasad Raghavendra, and Yi Wu. 2012. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing 41, 6 (2012), 1558--1590.

Digital Library

[9]

Kevin Gimpel and Noah A. Smith. 2012. Structured ramp loss minimization for machine translation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 221--231.

[10]

Md. Kamrul Hasan and Chris Pal. 2014. Experiments on visual information extraction with the faces of Wikipedia. In Proceedings of the AAAI Conference on Artificial Intelligence.

[11]

Romain Hérault and Yves Grandvalet. 2007. Sparse probabilistic classifiers. In Proceedings of the 24th International Conference on Machine Learning. ACM, 337--344.

Digital Library

[12]

Ailsa H. Land and Alison G. Doig. 1960. An automatic method of solving discrete programming problems. Econometrica 28, 3 (1960), 497--520.

[13]

Tongliang Liu and Dacheng Tao. 2016. Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 3 (2016), 447--461.

Digital Library

[14]

Weiwei Liu and Ivor W. Tsang. 2017. Making decision trees feasible in ultrahigh feature and label dimensions. The Journal of Machine Learning Research 18, 1 (2017), 2814--2849.

Digital Library

[15]

Weiwei Liu, Ivor W. Tsang, and Klaus-Robert Müller. 2017. An easy-to-hard learning paradigm for multiple classes and multiple labels. The Journal of Machine Learning Research 18, 1 (2017), 3300--3337.

Digital Library

[16]

Gábor Lugosi, Nicolas Vayatis, et al. 2004. On the Bayes-risk consistency of regularized boosting methods. The Annals of statistics 32, 1 (2004), 30--55.

[17]

Olvi L. Mangasarian, W. Nick Street, and William H. Wolberg. 1995. Breast cancer diagnosis and prognosis via linear programming. Operations Research 43, 4 (1995), 570--577.

Digital Library

[18]

Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean. 1999. Boosting algorithms as gradient descent in function space. In Proceedings of the 12th International Conference on Neural Information Processing Systems (NIPS’99).

Digital Library

[19]

Nasser M. Nasrabadi. 2007. Pattern recognition and machine learning. Journal of Electronic Imaging 16, 4 (2007), 049901.

[20]

Tan Nguyen and Scott Sanner. 2013. Algorithms for direct 0--1 loss optimization in binary classification. In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 1085--1093.

[21]

Fernando Pérez-Cruz, Angel Navia-Vázquez, Aníbal R. Figueiras-Vidal, and Antonio Artes-Rodriguez. 2003. Empirical risk minimization for support vector classifiers. IEEE Transactions on Neural Networks 14, 2 (2003), 296--303.

Digital Library

[22]

Franz Pernkopf, Michael Wohlmayr, and Sebastian Tschiatschek. 2012. Maximum margin Bayesian network classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 3 (2012), 521--532.

Digital Library

[23]

Xiaotong Shen, George C. Tseng, Xuegong Zhang, and Wing Hung Wong. 2003. On ψ-learning. Journal of the American Statistical Association 98, 463 (2003), 724--734.

[24]

Ingo Steinwart. 2003. Sparseness of support vector machines. The Journal of Machine Learning Research 4 (2003), 1071--1105.

Digital Library

[25]

Brendan Van Rooyen, Aditya Menon, and Robert C. Williamson. 2015. Learning with symmetric label noise: The importance of being unhinged. In Proceedings of the Advances in Neural Information Processing Systems. 10--18.

[26]

Vladimir Vapnik. 2000. The Nature of Statistical Learning Theory. Springer.

[27]

Pascal Vincent. 2004. Modèles à noyaux à structure locale. Citeseer.

[28]

Pascal Vincent and Yoshua Bengio. 2002. Kernel matching pursuit. Machine Learning 48, 1--3 (2002), 165--187.

Digital Library

[29]

De Wang, Danesh Irani, and Calton Pu. 2012. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In Proceedings of the 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom’12). IEEE, 40--49.

Digital Library

[30]

Yichao Wu and Yufeng Liu. 2007. Robust truncated hinge loss support vector machines. Journal of the American Statistical Association 102, 479 (2007), 974--983.

[31]

Liping Xie, Dacheng Tao, and Haikun Wei. 2017. Joint structured sparsity regularized multiview dimension reduction for video-based facial expression recognition. ACM Transactions on Intelligent Systems and Technology 8, 2 (2017), 28.

Digital Library

[32]

Alan L. Yuille and Anand Rangarajan. 2003. The concave-convex procedure. Neural Computation 15, 4 (2003), 915--936.

Digital Library

[33]

Jingwei Zhang, Tongliang Liu, and Dacheng Tao. 2018. On the rates of convergence from surrogate risk minimizers to the Bayes optimal classifier. arXiv:1802.03688.

[34]

Tong Zhang and Frank J. Oles. 2001. Text categorization based on regularized linear classification methods. Information Retrieval 4, 1 (2001), 5--31.

Digital Library

[35]

Xinhua Zhang, Ankan Saha, and S. V. N. Vishwanathan. 2011. Smoothing multivariate performance measures. Journal of Machine Learning Research 10 (2011), 1--55.

[36]

Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301--320.

Cited By

(2024)BibliographyMachine Learning with Noisy Labels10.1016/B978-0-44-315441-6.00016-2(257-280)Online publication date: 2024
https://doi.org/10.1016/B978-0-44-315441-6.00016-2
Carneiro G(2024)Theoretical aspects of noisy-label learningMachine Learning with Noisy Labels10.1016/B978-0-44-315441-6.00012-5(73-91)Online publication date: 2024
https://doi.org/10.1016/B978-0-44-315441-6.00012-5

Index Terms

A New Smooth Approximation to the Zero One Loss with a Probabilistic Interpretation
1. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Bayesian analysis
      2. Structured prediction

Recommendations

A comparison of machine learning techniques for phishing detection
eCrime '07: Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit

There are many applications available for phishing detection. However, unlike predicting spam, there are only few studies that compare machine learning techniques in predicting phishing. The present study compares the predictive accuracy of several ...
Predicting Classification of Telecommunication Subscribers Based on Polytomous Logistic Regression Usage Model
ICMLC '17: Proceedings of the 9th International Conference on Machine Learning and Computing

Validating the performance of data mining models requires data not seen during model creation. In this paper, a bootstrapped partition of a full dataset has been used to gauge the ability of polytomous logistic regression models in predicting ...
Logistic Model Trees

Tree induction methods and linear models are popular techniques for supervised learning tasks, both for the prediction of nominal classes and numeric values. For predicting numeric quantities, there has been work on combining these two schemes into `...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 14, Issue 1

February 2020

325 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/3375789

Editors:
Charu Aggarwal
IBM T. J. Watson Research, USA
,
Xindong Wu
Minginglamp Academy of Sciences, China

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 December 2019

Accepted: 01 July 2019

Revised: 01 May 2019

Received: 01 October 2017

Published in TKDD Volume 14, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
307
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)10

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

(2024)BibliographyMachine Learning with Noisy Labels10.1016/B978-0-44-315441-6.00016-2(257-280)Online publication date: 2024
https://doi.org/10.1016/B978-0-44-315441-6.00016-2
Carneiro G(2024)Theoretical aspects of noisy-label learningMachine Learning with Noisy Labels10.1016/B978-0-44-315441-6.00012-5(73-91)Online publication date: 2024
https://doi.org/10.1016/B978-0-44-315441-6.00012-5

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents