[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/956863.956909acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Text classification from positive and unlabeled documents

Published: 03 November 2003 Publication History

Abstract

Most existing studies of text classification assume that the training data are completely labeled. In reality, however, many information retrieval problems can be more accurately described as learning a binary classifier from a set of incompletely labeled examples, where we typically have a small number of labeled positive examples and a very large number of unlabeled examples. In this paper, we study such a problem of performing Text Classification WithOut labeled Negative data TC-WON). In this paper, we explore an efficient extension of the standard Support Vector Machine (SVM) approach, called SVMC (Support Vector Mapping Convergence) [17]for the TC-WON tasks. Our analyses show that when the positive training data is not too under-sampled, SVMC significantly outperforms other methods because SVMC basically exploits the natural "gap" between positive and negative documents in the feature space, which eventually corresponds to improving the generalization performance. In the text domain there are likely to exist many gaps in the feature space because a document is usually mapped to a sparse and high dimensional feature space. However, as the number of positive training data decreases, the boundary of SVMC starts overfitting at some point and end up generating very poor results.This is because when the positive training data is too few, the boundary over-iterates and trespasses the natural gaps between positive and negative class in the feature space and thus ends up fitting tightly around the few positive training data.

References

[1]
K. M. A. Chai, H. T. Ng, and H. L. Chieu. Bayesian online classifiers for text classfication and filtering. In Proc. 25th ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR'02), pages 97--104, Tampere, Finland, 2002.
[2]
C.-C. Chang and C.-J. Lin. Training nu-support vector classifiers: theory and algorithms. Neural Computation, 13:2119--2147, 2001.
[3]
T. Joachims. Text categorization with support vector machines. In Proc. 10th European Conference on Machine Learning (ECML'98), pages 137--142, Chemnitz, Germany, 1998.
[4]
T. Joachims. Transductive inference for text classification using support vector machines. In Proc. 16th Int. Conf. Machine Learning (ICML'00), pages 200--209, Bled, Slovenia, 1999.
[5]
T. Joachims. A statistical learning model of text classification with support vector machines. In Proceedings of SIGIR-01, 24th ACM Int. Conf. on Research and Development in Information Retrieval, pages 128--136, New Orleans, US, 2001.
[6]
D. D. Lewis. Representation and Learning in Information Retrieval. PhD thesis, Department of Computer and Information Science, University of Massachusetts, 1992.
[7]
B. Liu, W. S. Lee, P. S. Yu, and X. Li. Partially supervised classification of text documents. In Proc. 19th Int. Conf. Machine Learning (ICML'02), pages 387--394, Sydney, Australia, 2002.
[8]
L. M. Manevitz and M. Yousef. One-class SVMs for document classification. Journal of Machine Learning Research, 2:139--154, 2001.
[9]
A. McCallum, K. Nigam, J. Rennie, and K. Seymore. A machine learning approach to building domain-specific search engines. In The Sixteenth Int. Joint Conf. on Artificial Intelligence (IJCAI-99), Stockholm, Sweden, 1999.
[10]
K. Nigam. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39:103--134, 2000.
[11]
B. Scholkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12:1083--1121, 2000.
[12]
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002.
[13]
D. M. J. Tax and R. P. W. Duin. Uniform object generation for optimizing one-class classifiers. Journal of Machine Learning Research, 2:155--173, 2001.
[14]
V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.
[15]
Y. Yang. A study on thresholding strategies for text categorization. In Proc. 24th ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR'01), pages 137--145, New Orleans, Louisiana, 2001.
[16]
Y. Yang and X. Liu. A re-examination of text categorization methods. In Proc. 22th ACM Int. Conf. on Research and Development in Information Retrieval (SIGIR'99), pages 42--49, Berkeley, CA, 1999.
[17]
H. Yu. SVMC: Single-class classification with support vector machines. In Proc. Int. Joint Conf. on Articial Intelligence (IJCAI-03), Acapulco, Maxico, 2003.
[18]
H. Yu, J. Han, and K. C. Chang. PEBL: Positive-example based learning for Web page classification using SVM. In Proc. 8th Int. Conf. Knowledge Discovery and Data Mining (KDD'02), pages 239--248, Edmonton, Canada, 2002.

Cited By

View all
  • (2021)A graph-based approach for positive and unlabeled learningInformation Sciences: an International Journal10.1016/j.ins.2021.08.099580:C(655-672)Online publication date: 1-Nov-2021
  • (2019)A Machine Learning Approach to Characterize the Modulation of the Hippocampal Rhythms Via Optogenetic Stimulation of the Medial SeptumInternational Journal of Neural Systems10.1142/S012906571950020529:10(1950020)Online publication date: 17-Dec-2019
  • (2019)Positive And Unlabeled Learning Algorithms And Applications: A Survey2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA)10.1109/IISA.2019.8900698(1-8)Online publication date: Jul-2019
  • Show More Cited By

Index Terms

  1. Text classification from positive and unlabeled documents

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management
    November 2003
    592 pages
    ISBN:1581137230
    DOI:10.1145/956863
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 November 2003

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. SVM
    2. machine learning
    3. text classification
    4. text filtering

    Qualifiers

    • Article

    Conference

    CIKM03

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)16
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 15 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)A graph-based approach for positive and unlabeled learningInformation Sciences: an International Journal10.1016/j.ins.2021.08.099580:C(655-672)Online publication date: 1-Nov-2021
    • (2019)A Machine Learning Approach to Characterize the Modulation of the Hippocampal Rhythms Via Optogenetic Stimulation of the Medial SeptumInternational Journal of Neural Systems10.1142/S012906571950020529:10(1950020)Online publication date: 17-Dec-2019
    • (2019)Positive And Unlabeled Learning Algorithms And Applications: A Survey2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA)10.1109/IISA.2019.8900698(1-8)Online publication date: Jul-2019
    • (2016)Learning Beyond Predefined Label Space via Bayesian Nonparametric Topic ModellingEuropean Conference on Machine Learning and Knowledge Discovery in Databases - Volume 985110.1007/978-3-319-46128-1_10(148-164)Online publication date: 19-Sep-2016
    • (2015)A new term-weighting scheme for text classification using the odds of positive and negative class probabilitiesJournal of the Association for Information Science and Technology10.1002/asi.2333866:12(2553-2565)Online publication date: 1-Dec-2015
    • (2014)New Words Identification Based on Ensemble MethodsApplied Mechanics and Materials10.4028/www.scientific.net/AMM.602-605.1626602-605(1626-1629)Online publication date: Aug-2014
    • (2014)One-class classification: taxonomy of study and review of techniquesThe Knowledge Engineering Review10.1017/S026988891300043X29:3(345-374)Online publication date: 24-Jan-2014
    • (2012)Sampling the Web as Training Data for Text ClassificationMultimedia Storage and Retrieval Innovations for Digital Library Systems10.4018/978-1-4666-0900-6.ch015(293-310)Online publication date: 2012
    • (2011)A pairwise ranking based approach to learning with positive and unlabeled examplesProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063675(663-672)Online publication date: 24-Oct-2011
    • (2010)Sampling the Web as Training Data for Text ClassificationInternational Journal of Digital Library Systems10.4018/jdls.20101001021:4(24-42)Online publication date: 1-Oct-2010
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media