[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1935826.1935919acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
poster

Large-scale hierarchical text classification without labelled data

Published: 09 February 2011 Publication History

Abstract

The traditional machine learning approaches for text classification often require labelled data for learning classifiers. However, when applied to large-scale classification involving thousands of categories, creating such labelled data is extremely expensive since typically the data is manually labelled by humans. Motivated by this, we propose a novel approach for large-scale hierarchical text classification which does not require any labelled data. We explore a perspective where the meaning of a category is not defined by human-labelled documents, but by its description and more importantly its relationships with other categories (e.g. its ascendants and descendants). Specifically, we take advantage of the ontological knowledge in all phases of the whole process, namely when retrieving pseudo-labelled documents, when iteratively training the category models and when categorizing test documents. Our experiments based on a taxonomy containing 1131 categories and widely adopted in the news industry as a standard for the NewsML framework demonstrate the effectiveness of our approach in these phases both qualitatively and quantitatively. In particular, we emphasize that just by taking the simple ontological knowledge defined in the category hierarchy, we could automatically build a large-scale hierarchical classifier with reasonable performance of 67% in terms of the hierarchy-based F-1 measure.

References

[1]
H. Avancini, A. Lavelli, F. Sebastiani, and R. Zanoli. Automatic expansion of domain-specific lexicons by term categorization. ACM Transactions on Speech and Language Processing (TSLP) 3(1):1--30, 2006.
[2]
D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In Advances in Neural Information Processing Systems MIT Press, 2004.
[3]
D. Blei, A. Y. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research 3:993--1022, 2003.
[4]
E. Costa, A. Lorena, A. Carvalho, and A. Freitas. A review of performance evaluation measures for hierarchical classifiers. In Evaluation Methods for Machine Learning II: papers from the AAAI-2007 Workshop pages 1--6. AAAI Press, 2007.
[5]
A. A. Dayanik, D. D. Lewis, D. Madigan, V. Menkov, and A. Genkin. Constructing informative prior distributions from domain knowledge in text classification. In Proceedings of SIGIR pages 493--500. ACM Press, 2006.
[6]
S. Dumais. Hierarchical classification of web content. In Proceedings of SIGIR pages 256--263. ACM Press, 2000.
[7]
E. Gaussier, C. Goutte, K. Popat, and F. Chen. A hierarchical model for clustering and categorising documents. In Proceedings of ECIR 2002.
[8]
A. Gliozzo, C. Strapparava, and I. Dagan. Improving text categorization bootstrapping via unsupervised learning. ACM Transactions on Speech and Language Processing (TSLP) 6(1), 2009.
[9]
S. Godbole, A. Harpale, S. Sarawagi, and S. Chakrabarti. Document classification through interactive supervision of document and term labels. In Proceedings of PKDD pages 185--196, 2004.
[10]
T. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences 101:5228--5235, 2004.
[11]
V. Ha-Thuc, Y. Mejova, C. Harris, and P. Srinivasan. News event modeling and tracking in the social web with ontological guidance. In Proceedings of IEEE International Conference on Semantic Computing 2010.
[12]
T. Hofmann. Probabilistic latent semantic analysis. In Proceedings of Uncertainty in Artificial Intelligence, 1999 pages 289--296, 1999.
[13]
C.-M. Hung and L.-F. Chien. Web-based text classification in the absence of manually labeled training documents. JASIST 58(1):88--96, 2007.
[14]
Y. Ko and J. Seo. Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics 2004.
[15]
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proceedings of ICML 1997.
[16]
A. Krithara, M. Amini, J. michel Renders, and C. Goutte. Semi-supervised document classification with a mislabeling error model. In Proceedings of ECIR 2008.
[17]
A. Mccallum and K. Nigam. Text classification by bootstrapping with keywords, em and shrinkage. In Workshop for Unsupervised Learning in Natural Language Processing pages 52--58, 1999.
[18]
A. Mccallum, R. Rosenfeld, T. Mitchell, and A. Ng. Improving text classification by shrinkage in a hierarchy of classes, 1998.
[19]
D. Mimno, W. Li, and A. McCallum. Mixtures of hierarchical topics with pachinko allocation. In Proceedings of ICML pages 633--640. ACM, 2007.
[20]
A. Sun and E.-P. Lim. Hierarchical text classification and evaluation. In Proceedings of ICDM 2001.
[21]
K. Toutanova and F. Chen. Text classification in a hierarchical mixture model for small training sets. In Proceedings of CIKM pages 105--113. ACM Press, 2001.
[22]
S. Veeramachaneni, D. Sona, and P. Avesani. Hierarchical dirichlet model for document classification. In Proceedings of ICML volume 119, pages 928--935. ACM, 2005.
[23]
P. Wang and C. Domeniconi. Towards a universal text classifier: Transfer learning using encyclopedic knowledge. In Proceedings of ICDM Workshops 2009.
[24]
R. Wetzker, T. Alpcan, C. Bauckhage, W. Umbrath, and S. Albayrak. An unsupervised hierarchical approach to document categorization. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence pages 52--58, 2007.
[25]
C. Zhang, G.-R. Xue, and Y. Yu. Knowledge supervised text classification with no labeled documents. In Proceedings of PRICAI Springer, 2008.

Cited By

View all
  • (2024)Extracting key topics from massive COVID-19 information on social networks: An integrated deep learning and LDA frameworkHigh-Confidence Computing10.1016/j.hcc.2024.100213(100213)Online publication date: Feb-2024
  • (2022)VisIRML: Visualization with an Interactive Information Retrieval and Machine Learning ClassifierIntegrating Artificial Intelligence and Visualization for Visual Knowledge Discovery10.1007/978-3-030-93119-3_13(337-357)Online publication date: 5-Jun-2022
  • (2021)Handling imbalance in hierarchical classification problems using local classifiers approachesData Mining and Knowledge Discovery10.1007/s10618-021-00762-835:4(1564-1621)Online publication date: 1-Jul-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
February 2011
870 pages
ISBN:9781450304931
DOI:10.1145/1935826
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. classification with no labelled data
  2. hierarchical text classification
  3. topic models
  4. weakly supervised classification

Qualifiers

  • Poster

Conference

Acceptance Rates

WSDM '11 Paper Acceptance Rate 83 of 372 submissions, 22%;
Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)23
  • Downloads (Last 6 weeks)4
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Extracting key topics from massive COVID-19 information on social networks: An integrated deep learning and LDA frameworkHigh-Confidence Computing10.1016/j.hcc.2024.100213(100213)Online publication date: Feb-2024
  • (2022)VisIRML: Visualization with an Interactive Information Retrieval and Machine Learning ClassifierIntegrating Artificial Intelligence and Visualization for Visual Knowledge Discovery10.1007/978-3-030-93119-3_13(337-357)Online publication date: 5-Jun-2022
  • (2021)Handling imbalance in hierarchical classification problems using local classifiers approachesData Mining and Knowledge Discovery10.1007/s10618-021-00762-835:4(1564-1621)Online publication date: 1-Jul-2021
  • (2019)Visual Analytic System for Subject Matter Expert Document Tagging using Information Retrieval and Semi-Supervised Machine Learning2019 23rd International Conference Information Visualisation (IV)10.1109/IV.2019.00047(234-240)Online publication date: Jul-2019
  • (2018)Doc2Cube: Allocating Documents to Text Cube Without Labeled Data2018 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2018.00169(1260-1265)Online publication date: Nov-2018
  • (2018)Hierarchy construction and text classification based on the relaxation strategy and least information modelExpert Systems with Applications: An International Journal10.1016/j.eswa.2018.02.003100:C(157-164)Online publication date: 15-Jun-2018
  • (2017)WikiLDAProceedings of the 9th Knowledge Capture Conference10.1145/3148011.3154465(1-4)Online publication date: 4-Dec-2017
  • (2016)On Horizontal and Vertical Separation in Hierarchical Text ClassificationProceedings of the 2016 ACM International Conference on the Theory of Information Retrieval10.1145/2970398.2970408(185-194)Online publication date: 12-Sep-2016
  • (2015)A new term-weighting scheme for text classification using the odds of positive and negative class probabilitiesJournal of the Association for Information Science and Technology10.1002/asi.2333866:12(2553-2565)Online publication date: 1-Dec-2015
  • (2013)Structured summarization for news eventsProceedings of the 22nd International Conference on World Wide Web10.1145/2487788.2487940(343-348)Online publication date: 13-May-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media