[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Positive and unlabeled learning in categorical data

Published: 05 July 2016 Publication History

Abstract

In common binary classification scenarios, the presence of both positive and negative examples in training data is needed to build an efficient classifier. Unfortunately, in many domains, this requirement is not satisfied and only one class of examples is available. To cope with this setting, classification algorithms have been introduced that learn from Positive and Unlabeled (PU) data. Originally, these approaches were exploited in the context of document classification. Only few works address the PU problem for categorical datasets. Nevertheless, the available algorithms are mainly based on Naive Bayes classifiers. In this work we present a new distance based PU learning approach for categorical data: Pulce. Our framework takes advantage of the intrinsic relationships between attribute values and exceeds the independence assumption made by Naive Bayes. Pulce, in fact, leverages on the statistical properties of the data to learn a distance metric employed during the classification task. We extensively validate our approach over real world datasets and demonstrate that our strategy obtains statistically significant improvements w.r.t. state-of-the-art competitors. HighlightsWe propose a distance learning framework for PU learning with categorical data.Differently from existing methods for categorical data, our method is not based on the independence assumption.Our distance learning approach leverages attribute context.The experiments show statistically significant improvements in terms of prediction rate.The results are stable and robust with respect to parameter variations.

References

[1]
B. Calvo, N. López-Bigas, S.J. Furney, P. Larrañaga, J.A. Lozano, A partially supervised classification approach to dominant and recessive human disease gene prediction, Comput. Methods Progr. Biomed., 85 (2007) 229-237.
[2]
H. Yu, J. Han, K.C.-C. Chang, Pebl: positive example based learning for web page classification using svm, In: Proceedings of KDD 2002, Edmonton, Alberta, Canada, July 23-26, 2002, ACM, New York, NY, USA, 2002, pp. 239-248.
[3]
M.A. Zuluaga, D. Hush, E.J.F.D. Leyton, M.H. Hoyos, M. Orkisz, Learning from only positive and unlabeled data to detect lesions in vascular ct images, In: Proceedings of MICCAI 2011, Toronto, Canada, September 18-22, 2011, Springer, Berlin, 2011, pp. 9-16.
[4]
F.D. Comité, F. Denis, R. Gilleron, F. Letouzey, Positive and unlabeled examples help learning, In: Proceedings of ALT 1999, Tokyo, Japan, December 6-8, 1999, Springer, Berlin, 1999, pp. 219-230.
[5]
S. Boriah, V. Chandola, V. Kumar, Similarity measures for categorical data: a comparative evaluation, In: Proceedings of SDM 2008, Atlanta, Georgia, USA, April 24-26, 2008, SIAM, Philadelphia, PA, USA, 2008, pp. 243-254.
[6]
D. Ienco, R.G. Pensa, R. Meo, From context to distance, Trans. Knowl. Discov. Data, 6 (2012) 1:1-1:25.
[7]
L. Yu, H. Liu, Feature selection for high-dimensional data: a fast correlation-based filter solution, In: Proceedings of ICML 2003, Washington, DC, USA, August 21-24, 2003, AAAI, Palo Alto, CA, USA, 2003, pp. 856-863.
[8]
C. Elkan, K. Noto, Learning classifiers from only positive and unlabeled data, In: Proceedings of KDD 2008, Las Vegas, Nevada, USA, August 24-27, 2008, ACM, New York, NY, USA, 2008, pp. 213-220.
[9]
Y. Xiao, B. Liu, J. Yin, L. Cao, C. Zhang, Z. Hao, Similarity-based approach for positive and unlabeled learning, In: Proceedings of IJCAI 2011, Barcelona, Catalonia, Spain, July 16-22, 2011, AAAI, Palo Alto, CA, USA, 2011, pp. 1577-1582.
[10]
K. Zhou, G.-R. Xue, Q. Yang, Y. Yu, Learning with positive and unlabeled examples using topic-sensitive plsa, IEEE Trans. Knowl. Data Eng., 22 (2010) 46-58.
[11]
F. Mordelet, J. Vert, A bagging SVM to learn from positive and unlabeled examples, Pattern Recognit. Lett., 37 (2014) 201-209.
[12]
J. Wu, X. Zhu, C. Zhang, Z. Cai, Multi-instance learning from positive and unlabeled bags, In: Proceedings of PAKDD 2014, Tainan, Taiwan, May 13-16, 2014, Springer, Cham, Switzerland, 2014, pp. 237-248.
[13]
H. Li, Z. Chen, B. Liu, X. Wei, J. Shao, Spotting fake reviews via collective positive-unlabeled learning, In: Proceedings of IEEE ICDM 2014, Shenzhen, China, December 14-17, 2014, IEEE, Los Alamitos, CA, 2014, pp. 899-904.
[14]
P. Yang, X. Li, H.-N. Chua, C.-K. Kwoh, S.-K. Ng, Ensemble positive unlabeled learning for disease gene identification, PLoS ONE 9 (5) (2014).
[15]
B. Calvo, P. Larrañaga, J.A. Lozano, Learning Bayesian classifiers from positive and unlabeled examples, Pattern Recognit. Lett., 28 (2007) 2375-2384.
[16]
N. Friedman, D. Geiger, M. Goldszmidt, Bayesian network classifiers, Mach. Learn., 29 (1997) 131-163.
[17]
J. He, Y. Zhang, X. Li, Y. Wang, Naive Bayes classifier for positive unlabeled learning with uncertainty, In: Proceedings of SDM 2010, Columbus, Ohio, USA, April 29 - May 1, 2010, SIAM, Philadelphia, PA, USA, 2010, pp. 361-372.
[18]
Y. Zhao, X. Kong, P.S. Yu, Positive and unlabeled learning for graph classification, In: Proceedings of ICDM 2011, Vancouver, BC, Canada, December 11-14, 2011, IEEE, Los Alamitos, CA, USA, Canada, 2011, pp. 962-971.
[19]
Y. Shao, W. Chen, L. Liu, N. Deng, Laplacian unit-hyperplane learning from positive and unlabeled examples, Inf. Sci., 314 (2015) 152-168.
[20]
B. Liu, W.S. Lee, P.S. Yu, X. Li, Partially supervised classification of text documents, In: Proceedings of ICML 2002, Sydney, Australia, July 8-12, 2002, Morgan Kaufmann, Burlington, MA, USA, 2002, pp. 387-394.
[21]
V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: a survey, ACM Comput. Surv. 41 (3) (2009).
[22]
B.S. John, J.C. Platt, J. Shawe-taylor, A.J. Smola, R.C. Williamson, Estimating the support of a high-dimensional distribution, Neural Comput. 13 (1999) 2001.
[23]
A. Blum, T.M. Mitchell, Combining labeled and unlabeled sata with co-training, In: Proceedings of COLT 1998, Madison, WI, USA, July 24-26, 1998, ACM, New York, NY, USA, 1998, pp. 92-100.
[24]
M.C. du Plessis, M. Sugiyama, Semi-supervised learning of class balance under class-prior change by distribution matching, Neural Netw., 50 (2014) 110-119.
[25]
H. Gan, R. Huang, Z. Luo, Y. Fan, F. Gao, Towards a probabilistic semi-supervised kernel minimum squared error algorithm, Neurocomputing, 171 (2016) 149-155.
[26]
K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, Constrained k-means clustering with background knowledge, In: Proceedings of ICML 2001, Williamstown, MA, USA, June 28 - July 1, 2001, Morgan Kaufmann, Burlington, MA, USA, 2001, pp. 577-584.
[27]
J. Li, Y. Xia, Z. Shan, Y. Liu, Scalable constrained spectral clustering, IEEE Trans. Knowl. Data Eng., 27 (2015) 589-593.
[28]
S. Basu, A. Banerjee, R.J. Mooney, Semi-supervised clustering by seeding, In: Proceedings of ICML 2002, Sydney, Australia, July 8-12, 2002, Morgan Kaufmann, Burlington, MA, USA, 2002, pp. 27-34.
[29]
I. Davidson, S. Ravi, The complexity of non-hierarchical clustering with instance and cluster level constraints, Data Min. Knowl. Discov., 14 (2007) 25-61.
[30]
M. Bilenko, S. Basu, R.J. Mooney, Integrating constraints and metric learning in semi-supervised clustering, In: Proceedings of ICML 2004, Banff, Alberta, Canada, July 4-8, 2004, ACM, New York, NY, USA, 2004, pp. 81-88.
[31]
L. Ma, X. Yang, D. Tao, Person re-identification over camera networks using multi-task distance metric learning, IEEE Trans. Image Process., 23 (2014) 3656-3670.
[32]
Y. Luo, T. Liu, D. Tao, C. Xu, Decomposition-based transfer distance metric learning for image classification, IEEE Trans. Image Process., 23 (2014) 3789-3801.
[33]
J. Yu, D. Tao, J. Li, J. Cheng, Semantic preserving distance metric learning and applications, Inf. Sci., 281 (2014) 674-686.
[34]
M. Ring, F. Otto, M. Becker, T. Niebler, D. Landes, A. Hotho, Condist: A context-driven categorical distance measure, In: Proceedings of ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Springer, Cham, Switzerland, 2015, pp. 251-266.
[35]
R.J. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Series in Machine Learning, Morgan Kaufmann, Burlington, MA, USA, 1993.
[36]
C.J. Van Rijsbergen, Information Retrieval, Butterworth-Heinemann, Newton, MA, USA, 1979.
[37]
S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math. Stat., 22 (1951) 49-86.
[38]
J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res., 7 (2006) 1-30.

Cited By

View all
  • (2023)PU-Detector: A PU Learning-based Framework for Real Money Trading Detection in MMORPGACM Transactions on Knowledge Discovery from Data10.1145/363856118:4(1-26)Online publication date: 29-Dec-2023
  • (2022)Evaluating the Predictive Performance of Positive- Unlabelled ClassifiersACM SIGKDD Explorations Newsletter10.1145/3575637.357564224:2(5-11)Online publication date: 8-Dec-2022
  • (2022)Fairness-aware Model-agnostic Positive and Unlabeled LearningProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency10.1145/3531146.3533225(1698-1708)Online publication date: 21-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Neurocomputing
Neurocomputing  Volume 196, Issue C
July 2016
214 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 05 July 2016

Author Tags

  1. Categorical data
  2. Distance learning
  3. Partially supervised learning
  4. Positive unlabeled learning

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)PU-Detector: A PU Learning-based Framework for Real Money Trading Detection in MMORPGACM Transactions on Knowledge Discovery from Data10.1145/363856118:4(1-26)Online publication date: 29-Dec-2023
  • (2022)Evaluating the Predictive Performance of Positive- Unlabelled ClassifiersACM SIGKDD Explorations Newsletter10.1145/3575637.357564224:2(5-11)Online publication date: 8-Dec-2022
  • (2022)Fairness-aware Model-agnostic Positive and Unlabeled LearningProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency10.1145/3531146.3533225(1698-1708)Online publication date: 21-Jun-2022
  • (2021)Differentially Private Distance Learning in Categorical DataData Mining and Knowledge Discovery10.1007/s10618-021-00778-035:5(2050-2088)Online publication date: 1-Sep-2021
  • (2020)Learning from positive and unlabeled data: a surveyMachine Language10.1007/s10994-020-05877-5109:4(719-760)Online publication date: 1-Apr-2020
  • (2019)Ensembles of density estimators for positive-unlabeled learningJournal of Intelligent Information Systems10.1007/s10844-019-00549-w53:2(199-217)Online publication date: 1-Oct-2019
  • (2018)Global and local learning from positive and unlabeled examplesApplied Intelligence10.1007/s10489-017-1076-z48:8(2373-2392)Online publication date: 1-Aug-2018
  • (2017)Bayesian belief network for positive unlabeled learning with uncertaintyPattern Recognition Letters10.1016/j.patrec.2017.03.00790:C(28-35)Online publication date: 15-Apr-2017
  • (2017)An immune-inspired political boycotts action prediction paradigmCluster Computing10.1007/s10586-017-0830-720:2(1379-1386)Online publication date: 1-Jun-2017

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media