More Web Proxy on the site http://driver.im/

research-article

Positive and unlabeled learning in categorical data

Authors:

Ruggero G. PensaAuthors Info & Claims

Neurocomputing, Volume 196, Issue C

Pages 113 - 124

https://doi.org/10.1016/j.neucom.2016.01.089

Published: 05 July 2016 Publication History

Abstract

In common binary classification scenarios, the presence of both positive and negative examples in training data is needed to build an efficient classifier. Unfortunately, in many domains, this requirement is not satisfied and only one class of examples is available. To cope with this setting, classification algorithms have been introduced that learn from Positive and Unlabeled (PU) data. Originally, these approaches were exploited in the context of document classification. Only few works address the PU problem for categorical datasets. Nevertheless, the available algorithms are mainly based on Naive Bayes classifiers. In this work we present a new distance based PU learning approach for categorical data: Pulce. Our framework takes advantage of the intrinsic relationships between attribute values and exceeds the independence assumption made by Naive Bayes. Pulce, in fact, leverages on the statistical properties of the data to learn a distance metric employed during the classification task. We extensively validate our approach over real world datasets and demonstrate that our strategy obtains statistically significant improvements w.r.t. state-of-the-art competitors. HighlightsWe propose a distance learning framework for PU learning with categorical data.Differently from existing methods for categorical data, our method is not based on the independence assumption.Our distance learning approach leverages attribute context.The experiments show statistically significant improvements in terms of prediction rate.The results are stable and robust with respect to parameter variations.

References

[1]

B. Calvo, N. López-Bigas, S.J. Furney, P. Larrañaga, J.A. Lozano, A partially supervised classification approach to dominant and recessive human disease gene prediction, Comput. Methods Progr. Biomed., 85 (2007) 229-237.

Digital Library

[2]

H. Yu, J. Han, K.C.-C. Chang, Pebl: positive example based learning for web page classification using svm, In: Proceedings of KDD 2002, Edmonton, Alberta, Canada, July 23-26, 2002, ACM, New York, NY, USA, 2002, pp. 239-248.

Digital Library

[3]

M.A. Zuluaga, D. Hush, E.J.F.D. Leyton, M.H. Hoyos, M. Orkisz, Learning from only positive and unlabeled data to detect lesions in vascular ct images, In: Proceedings of MICCAI 2011, Toronto, Canada, September 18-22, 2011, Springer, Berlin, 2011, pp. 9-16.

Digital Library

[4]

F.D. Comité, F. Denis, R. Gilleron, F. Letouzey, Positive and unlabeled examples help learning, In: Proceedings of ALT 1999, Tokyo, Japan, December 6-8, 1999, Springer, Berlin, 1999, pp. 219-230.

Digital Library

[5]

S. Boriah, V. Chandola, V. Kumar, Similarity measures for categorical data: a comparative evaluation, In: Proceedings of SDM 2008, Atlanta, Georgia, USA, April 24-26, 2008, SIAM, Philadelphia, PA, USA, 2008, pp. 243-254.

[6]

D. Ienco, R.G. Pensa, R. Meo, From context to distance, Trans. Knowl. Discov. Data, 6 (2012) 1:1-1:25.

Digital Library

[7]

L. Yu, H. Liu, Feature selection for high-dimensional data: a fast correlation-based filter solution, In: Proceedings of ICML 2003, Washington, DC, USA, August 21-24, 2003, AAAI, Palo Alto, CA, USA, 2003, pp. 856-863.

[8]

C. Elkan, K. Noto, Learning classifiers from only positive and unlabeled data, In: Proceedings of KDD 2008, Las Vegas, Nevada, USA, August 24-27, 2008, ACM, New York, NY, USA, 2008, pp. 213-220.

Digital Library

[9]

Y. Xiao, B. Liu, J. Yin, L. Cao, C. Zhang, Z. Hao, Similarity-based approach for positive and unlabeled learning, In: Proceedings of IJCAI 2011, Barcelona, Catalonia, Spain, July 16-22, 2011, AAAI, Palo Alto, CA, USA, 2011, pp. 1577-1582.

Digital Library

[10]

K. Zhou, G.-R. Xue, Q. Yang, Y. Yu, Learning with positive and unlabeled examples using topic-sensitive plsa, IEEE Trans. Knowl. Data Eng., 22 (2010) 46-58.

Digital Library

[11]

F. Mordelet, J. Vert, A bagging SVM to learn from positive and unlabeled examples, Pattern Recognit. Lett., 37 (2014) 201-209.

Digital Library

[12]

J. Wu, X. Zhu, C. Zhang, Z. Cai, Multi-instance learning from positive and unlabeled bags, In: Proceedings of PAKDD 2014, Tainan, Taiwan, May 13-16, 2014, Springer, Cham, Switzerland, 2014, pp. 237-248.

[13]

H. Li, Z. Chen, B. Liu, X. Wei, J. Shao, Spotting fake reviews via collective positive-unlabeled learning, In: Proceedings of IEEE ICDM 2014, Shenzhen, China, December 14-17, 2014, IEEE, Los Alamitos, CA, 2014, pp. 899-904.

Digital Library

[14]

P. Yang, X. Li, H.-N. Chua, C.-K. Kwoh, S.-K. Ng, Ensemble positive unlabeled learning for disease gene identification, PLoS ONE 9 (5) (2014).

[15]

B. Calvo, P. Larrañaga, J.A. Lozano, Learning Bayesian classifiers from positive and unlabeled examples, Pattern Recognit. Lett., 28 (2007) 2375-2384.

Digital Library

[16]

N. Friedman, D. Geiger, M. Goldszmidt, Bayesian network classifiers, Mach. Learn., 29 (1997) 131-163.

Digital Library

[17]

J. He, Y. Zhang, X. Li, Y. Wang, Naive Bayes classifier for positive unlabeled learning with uncertainty, In: Proceedings of SDM 2010, Columbus, Ohio, USA, April 29 - May 1, 2010, SIAM, Philadelphia, PA, USA, 2010, pp. 361-372.

[18]

Y. Zhao, X. Kong, P.S. Yu, Positive and unlabeled learning for graph classification, In: Proceedings of ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011, IEEE, Los Alamitos, CA, USA, Canada, 2011, pp. 962-971.

Digital Library

[19]

Y. Shao, W. Chen, L. Liu, N. Deng, Laplacian unit-hyperplane learning from positive and unlabeled examples, Inf. Sci., 314 (2015) 152-168.

Digital Library

[20]

B. Liu, W.S. Lee, P.S. Yu, X. Li, Partially supervised classification of text documents, In: Proceedings of ICML 2002, Sydney, Australia, July 8-12, 2002, Morgan Kaufmann, Burlington, MA, USA, 2002, pp. 387-394.

Digital Library

[21]

V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: a survey, ACM Comput. Surv. 41 (3) (2009).

Digital Library

[22]

B.S. John, J.C. Platt, J. Shawe-taylor, A.J. Smola, R.C. Williamson, Estimating the support of a high-dimensional distribution, Neural Comput. 13 (1999) 2001.

Digital Library

[23]

A. Blum, T.M. Mitchell, Combining labeled and unlabeled sata with co-training, In: Proceedings of COLT 1998, Madison, WI, USA, July 24-26, 1998, ACM, New York, NY, USA, 1998, pp. 92-100.

Digital Library

[24]

M.C. du Plessis, M. Sugiyama, Semi-supervised learning of class balance under class-prior change by distribution matching, Neural Netw., 50 (2014) 110-119.

Digital Library

[25]

H. Gan, R. Huang, Z. Luo, Y. Fan, F. Gao, Towards a probabilistic semi-supervised kernel minimum squared error algorithm, Neurocomputing, 171 (2016) 149-155.

Digital Library

[26]

K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, Constrained k-means clustering with background knowledge, In: Proceedings of ICML 2001, Williamstown, MA, USA, June 28 - July 1, 2001, Morgan Kaufmann, Burlington, MA, USA, 2001, pp. 577-584.

Digital Library

[27]

J. Li, Y. Xia, Z. Shan, Y. Liu, Scalable constrained spectral clustering, IEEE Trans. Knowl. Data Eng., 27 (2015) 589-593.

[28]

S. Basu, A. Banerjee, R.J. Mooney, Semi-supervised clustering by seeding, In: Proceedings of ICML 2002, Sydney, Australia, July 8-12, 2002, Morgan Kaufmann, Burlington, MA, USA, 2002, pp. 27-34.

Digital Library

[29]

I. Davidson, S. Ravi, The complexity of non-hierarchical clustering with instance and cluster level constraints, Data Min. Knowl. Discov., 14 (2007) 25-61.

Digital Library

[30]

M. Bilenko, S. Basu, R.J. Mooney, Integrating constraints and metric learning in semi-supervised clustering, In: Proceedings of ICML 2004, Banff, Alberta, Canada, July 4-8, 2004, ACM, New York, NY, USA, 2004, pp. 81-88.

Digital Library

[31]

L. Ma, X. Yang, D. Tao, Person re-identification over camera networks using multi-task distance metric learning, IEEE Trans. Image Process., 23 (2014) 3656-3670.

[32]

Y. Luo, T. Liu, D. Tao, C. Xu, Decomposition-based transfer distance metric learning for image classification, IEEE Trans. Image Process., 23 (2014) 3789-3801.

[33]

J. Yu, D. Tao, J. Li, J. Cheng, Semantic preserving distance metric learning and applications, Inf. Sci., 281 (2014) 674-686.

Digital Library

[34]

M. Ring, F. Otto, M. Becker, T. Niebler, D. Landes, A. Hotho, Condist: A context-driven categorical distance measure, In: Proceedings of ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Springer, Cham, Switzerland, 2015, pp. 251-266.

Digital Library

[35]

R.J. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Series in Machine Learning, Morgan Kaufmann, Burlington, MA, USA, 1993.

Digital Library

[36]

C.J. Van Rijsbergen, Information Retrieval, Butterworth-Heinemann, Newton, MA, USA, 1979.

Digital Library

[37]

S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math. Stat., 22 (1951) 49-86.

[38]

J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res., 7 (2006) 1-30.

Digital Library

Cited By

Wang YZhao SZhao SWu RXu YTao JLv TLi SHu ZPan G(2023)PU-Detector: A PU Learning-based Framework for Real Money Trading Detection in MMORPGACM Transactions on Knowledge Discovery from Data10.1145/363856118:4(1-26)Online publication date: 29-Dec-2023
https://dl.acm.org/doi/10.1145/3638561
Saunders JFreitas A(2022)Evaluating the Predictive Performance of Positive- Unlabelled ClassifiersACM SIGKDD Explorations Newsletter10.1145/3575637.357564224:2(5-11)Online publication date: 8-Dec-2022
https://dl.acm.org/doi/10.1145/3575637.3575642
Wu ZHe J(2022)Fairness-aware Model-agnostic Positive and Unlabeled LearningProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency10.1145/3531146.3533225(1698-1708)Online publication date: 21-Jun-2022
https://dl.acm.org/doi/10.1145/3531146.3533225
Show More Cited By

Positive and unlabeled learning in categorical data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches

Recommendations

A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set

Computation of similarity between categorical data objects in unsupervised learning is an important data mining problem. We propose a method to compute distance between two attribute values of same attribute for unsupervised learning. This approach is ...
From Context to Distance: Learning Dissimilarity for Categorical Data Clustering

Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of a categorical attribute, since the values are not ...
Model-aware categorical data embedding: a data-driven approach

Learning from categorical data is a critical yet challenging task. Current research focuses on either leveraging the complex interaction between and within categorical values to generate a numerical representation, or designing a model that can tackle ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Neurocomputing

Neurocomputing Volume 196, Issue C

July 2016

214 pages

ISSN:0925-2312

Issue’s Table of Contents

Copyright © Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 05 July 2016

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang YZhao SZhao SWu RXu YTao JLv TLi SHu ZPan G(2023)PU-Detector: A PU Learning-based Framework for Real Money Trading Detection in MMORPGACM Transactions on Knowledge Discovery from Data10.1145/363856118:4(1-26)Online publication date: 29-Dec-2023
https://dl.acm.org/doi/10.1145/3638561
Saunders JFreitas A(2022)Evaluating the Predictive Performance of Positive- Unlabelled ClassifiersACM SIGKDD Explorations Newsletter10.1145/3575637.357564224:2(5-11)Online publication date: 8-Dec-2022
https://dl.acm.org/doi/10.1145/3575637.3575642
Wu ZHe J(2022)Fairness-aware Model-agnostic Positive and Unlabeled LearningProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency10.1145/3531146.3533225(1698-1708)Online publication date: 21-Jun-2022
https://dl.acm.org/doi/10.1145/3531146.3533225
Battaglia ECelano SPensa R(2021)Differentially Private Distance Learning in Categorical DataData Mining and Knowledge Discovery10.1007/s10618-021-00778-035:5(2050-2088)Online publication date: 1-Sep-2021
https://dl.acm.org/doi/10.1007/s10618-021-00778-0
Bekker JDavis J(2020)Learning from positive and unlabeled data: a surveyMachine Language10.1007/s10994-020-05877-5109:4(719-760)Online publication date: 1-Apr-2020
https://dl.acm.org/doi/10.1007/s10994-020-05877-5
Basile TDi Mauro NEsposito FFerilli SVergari A(2019)Ensembles of density estimators for positive-unlabeled learningJournal of Intelligent Information Systems10.1007/s10844-019-00549-w53:2(199-217)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.1007/s10844-019-00549-w
Ke TJing LLv HZhang LHu Y(2018)Global and local learning from positive and unlabeled examplesApplied Intelligence10.1007/s10489-017-1076-z48:8(2373-2392)Online publication date: 1-Aug-2018
https://dl.acm.org/doi/10.1007/s10489-017-1076-z
Gan HZhang YSong Q(2017)Bayesian belief network for positive unlabeled learning with uncertaintyPattern Recognition Letters10.1016/j.patrec.2017.03.00790:C(28-35)Online publication date: 15-Apr-2017
https://dl.acm.org/doi/10.1016/j.patrec.2017.03.007
Xie YChen YPeng L(2017)An immune-inspired political boycotts action prediction paradigmCluster Computing10.1007/s10586-017-0830-720:2(1379-1386)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1007/s10586-017-0830-7

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents