More Web Proxy on the site http://driver.im/

Article

Text classification from positive and unlabeled documents

Authors:

ChengXiang Zhai,

Jiawei HanAuthors Info & Claims

CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

Pages 232 - 239

https://doi.org/10.1145/956863.956909

Published: 03 November 2003 Publication History

Abstract

Most existing studies of text classification assume that the training data are completely labeled. In reality, however, many information retrieval problems can be more accurately described as learning a binary classifier from a set of incompletely labeled examples, where we typically have a small number of labeled positive examples and a very large number of unlabeled examples. In this paper, we study such a problem of performing Text Classification WithOut labeled Negative data TC-WON). In this paper, we explore an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) [17]for the TC-WON tasks. Our analyses show that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods because SVMC basically exploits the natural "gap" between positive and negative documents in the feature space, which eventually corresponds to improving the generalization performance. In the text domain there are likely to exist many gaps in the feature space because a document is usually mapped to a sparse and high dimensional feature space. However, as the number of positive training data decreases, the boundary of SVMC starts overfitting at some point and end up generating very poor results.This is because when the positive training data is too few, the boundary over-iterates and trespasses the natural gaps between positive and negative class in the feature space and thus ends up fitting tightly around the few positive training data.

References

[1]

K. M. A. Chai, H. T. Ng, and H. L. Chieu. Bayesian online classifiers for text classfication and filtering. In Proc. 25th ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR'02), pages 97--104, Tampere, Finland, 2002.

Digital Library

[2]

C.-C. Chang and C.-J. Lin. Training nu-support vector classifiers: theory and algorithms. Neural Computation, 13:2119--2147, 2001.

Digital Library

[3]

T. Joachims. Text categorization with support vector machines. In Proc. 10th European Conference on Machine Learning (ECML'98), pages 137--142, Chemnitz, Germany, 1998.

Digital Library

[4]

T. Joachims. Transductive inference for text classification using support vector machines. In Proc. 16th Int. Conf. Machine Learning (ICML'00), pages 200--209, Bled, Slovenia, 1999.

Digital Library

[5]

T. Joachims. A statistical learning model of text classification with support vector machines. In Proceedings of SIGIR-01, 24th ACM Int. Conf. on Research and Development in Information Retrieval, pages 128--136, New Orleans, US, 2001.

Digital Library

[6]

D. D. Lewis. Representation and Learning in Information Retrieval. PhD thesis, Department of Computer and Information Science, University of Massachusetts, 1992.

Digital Library

[7]

B. Liu, W. S. Lee, P. S. Yu, and X. Li. Partially supervised classification of text documents. In Proc. 19th Int. Conf. Machine Learning (ICML'02), pages 387--394, Sydney, Australia, 2002.

Digital Library

[8]

L. M. Manevitz and M. Yousef. One-class SVMs for document classification. Journal of Machine Learning Research, 2:139--154, 2001.

Digital Library

[9]

A. McCallum, K. Nigam, J. Rennie, and K. Seymore. A machine learning approach to building domain-specific search engines. In The Sixteenth Int. Joint Conf. on Artificial Intelligence (IJCAI-99), Stockholm, Sweden, 1999.

Digital Library

[10]

K. Nigam. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39:103--134, 2000.

Digital Library

[11]

B. Scholkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12:1083--1121, 2000.

Digital Library

[12]

F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.

Digital Library

[13]

D. M. J. Tax and R. P. W. Duin. Uniform object generation for optimizing one-class classifiers. Journal of Machine Learning Research, 2:155--173, 2001.

Digital Library

[14]

V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.

Digital Library

[15]

Y. Yang. A study on thresholding strategies for text categorization. In Proc. 24th ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR'01), pages 137--145, New Orleans, Louisiana, 2001.

Digital Library

[16]

Y. Yang and X. Liu. A re-examination of text categorization methods. In Proc. 22th ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR'99), pages 42--49, Berkeley, CA, 1999.

Digital Library

[17]

H. Yu. SVMC: Single-class classification with support vector machines. In Proc. Int. Joint Conf. on Articial Intelligence (IJCAI-03), Acapulco, Maxico, 2003.

Digital Library

[18]

H. Yu, J. Han, and K. C. Chang. PEBL: Positive-example based learning for Web page classification using SVM. In Proc. 8th Int. Conf. Knowledge Discovery and Data Mining (KDD'02), pages 239--248, Edmonton, Canada, 2002.

Digital Library

Cited By

Carnevali JGeraldeli Rossi RMilios Ede Andrade Lopes A(2021)A graph-based approach for positive and unlabeled learningInformation Sciences: an International Journal10.1016/j.ins.2021.08.099580:C(655-672)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1016/j.ins.2021.08.099
Park SLaxpati NGutekunst CConnolly MTung JBerglund KMahmoudi BGross R(2019)A Machine Learning Approach to Characterize the Modulation of the Hippocampal Rhythms Via Optogenetic Stimulation of the Medial SeptumInternational Journal of Neural Systems10.1142/S012906571950020529:10(1950020)Online publication date: 17-Dec-2019
https://doi.org/10.1142/S0129065719500205
Jaskie KSpanias A(2019)Positive And Unlabeled Learning Algorithms And Applications: A Survey2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA)10.1109/IISA.2019.8900698(1-8)Online publication date: Jul-2019
https://doi.org/10.1109/IISA.2019.8900698
Show More Cited By

Index Terms

Text classification from positive and unlabeled documents
1. Applied computing
  1. Document management and text processing

Recommendations

Text Classification from Labeled and Unlabeled Documents using EM
Special issue on information retrieval

This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining ...
Text classification with relatively small positive documents and unlabeled data
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

This paper addresses the problem of dealing with a collection of negative training documents which is suitable for relatively small number of positive documents, and presents a method for eliminating the need for manually collecting negative training ...
Text classification from unlabeled documents with bootstrapping and feature projection techniques

Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

November 2003

592 pages

ISBN:1581137230

DOI:10.1145/956863

General Chair:
Donald Kraft
Louisiana State University
,
Program Chairs:
Ophir Frieder
Illinois Institute of Technology
,
Joachim Hammer
University of Florida
,
Sajda Qureshi
University of Nebraska, Omaha
,
Len Seligman
The MITRE Corporation

Copyright © 2003 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2003

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CIKM03

Sponsor:

CIKM03: 12th International Conference on Information and Knowledge Management

November 3 - 8, 2003

LA, New Orleans, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
1,479
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Carnevali JGeraldeli Rossi RMilios Ede Andrade Lopes A(2021)A graph-based approach for positive and unlabeled learningInformation Sciences: an International Journal10.1016/j.ins.2021.08.099580:C(655-672)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1016/j.ins.2021.08.099
Park SLaxpati NGutekunst CConnolly MTung JBerglund KMahmoudi BGross R(2019)A Machine Learning Approach to Characterize the Modulation of the Hippocampal Rhythms Via Optogenetic Stimulation of the Medial SeptumInternational Journal of Neural Systems10.1142/S012906571950020529:10(1950020)Online publication date: 17-Dec-2019
https://doi.org/10.1142/S0129065719500205
Jaskie KSpanias A(2019)Positive And Unlabeled Learning Algorithms And Applications: A Survey2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA)10.1109/IISA.2019.8900698(1-8)Online publication date: Jul-2019
https://doi.org/10.1109/IISA.2019.8900698
Du CZhuang FHe JHe QLong G(2016)Learning Beyond Predefined Label Space via Bayesian Nonparametric Topic ModellingEuropean Conference on Machine Learning and Knowledge Discovery in Databases - Volume 985110.1007/978-3-319-46128-1_10(148-164)Online publication date: 19-Sep-2016
https://dl.acm.org/doi/10.1007/978-3-319-46128-1_10
Ko Y(2015)A new term-weighting scheme for text classification using the odds of positive and negative class probabilitiesJournal of the Association for Information Science and Technology10.1002/asi.2333866:12(2553-2565)Online publication date: 1-Dec-2015
https://dl.acm.org/doi/10.1002/asi.23338
Zhang CChen Y(2014)New Words Identification Based on Ensemble MethodsApplied Mechanics and Materials10.4028/www.scientific.net/AMM.602-605.1626602-605(1626-1629)Online publication date: Aug-2014
https://doi.org/10.4028/www.scientific.net/AMM.602-605.1626
Khan SMadden M(2014)One-class classification: taxonomy of study and review of techniquesThe Knowledge Engineering Review10.1017/S026988891300043X29:3(345-374)Online publication date: 24-Jan-2014
https://doi.org/10.1017/S026988891300043X
Day WChi CChen RCheng P(2012)Sampling the Web as Training Data for Text ClassificationMultimedia Storage and Retrieval Innovations for Digital Library Systems10.4018/978-1-4666-0900-6.ch015(293-310)Online publication date: 2012
https://doi.org/10.4018/978-1-4666-0900-6.ch015
Sellamanickam SGarg PSelvaraj S(2011)A pairwise ranking based approach to learning with positive and unlabeled examplesProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063675(663-672)Online publication date: 24-Oct-2011
https://dl.acm.org/doi/10.1145/2063576.2063675
Day WChi CChen RCheng P(2010)Sampling the Web as Training Data for Text ClassificationInternational Journal of Digital Library Systems10.4018/jdls.20101001021:4(24-42)Online publication date: 1-Oct-2010
https://dl.acm.org/doi/10.4018/jdls.2010100102
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten