[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/383952.383975acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

A study of thresholding strategies for text categorization

Published: 01 September 2001 Publication History

Abstract

Thresholding strategies in automated text categorization are an underexplored area of research. This paper presents an examination of the effect of thresholding strategies on the performance of a classifier under various conditions. Using k-Nearest Neighbor (kNN) as the classifier and five evaluation benchmark collections as the testbets, three common thresholding methods were investigated, including rank-based thresholding (RCut), proportion-based assignments (PCut) and score-based local optimization (SCut); in addition, new variants of these methods are proposed to overcome significant problems in the existing approaches. Experimental results show that the choice of thresholding strategy can significantly influence the performance of kNN, and that the ``optimal'' strategy may vary by application. SCut is potentially better for fine-tuning but risks overfitting. PCut copes better with rare categories and exhibits a smoother trade-off in recall versus precision, but is not suitable for online decision making. RCut is most natural for online response but is too coarse-grained for global or local optimization. RTCut, a new method combining the strength of category ranking and scoring, outperforms both PCut and RCut significantly.

References

[1]
P. N. Bennett. Assessing the calibration of Naive Bayes' posterior estimates. In Technical Report CMU-CS-00-155, computer Science Department, School of Computer Science, Carnegie Mellon University, 2000.
[2]
W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. In SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval, 1996. 307-315.
[3]
R. Ghani, R. Jones, D. Mladenic, K. Nigam, and S. Slattery. Data mining on symbolic knowledge extracted from the web. In Proceedings of the Workshop on Text Mining at the Sixth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD-2000), 2000.
[4]
R. Ghani, S. Slattery, and Y. Yang. Hypertext categorization using hyperlink patterns and meta data. In The Eighteenth International Conference on Machine Learning (ICML'01), page (submitted), 2001.
[5]
N. Govert, M. Lalmas, and N. Fuhr. A probabilistic description-oriented approach for categorising web documents. In Proceedings of the Eights International Conference on Information and Knowledge Management, pages 475-482, New York, 1999. ACM.
[6]
W. Hersh, C.Buckley, T. Leone, and D. Hickman. Ohsumed: an interactive retrieval evaluation and new large text collection for research. In Proceedings of ACM SIGIR'94, pages 192-201, 1994.
[7]
T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In European Conference on Machine Learning (ECML), pages 137-142, Berlin, 1998. Springer.
[8]
D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In 15th Ann Int ACM SIGIR Conference onResearch and Development in Information Retrieval (SIGIR'92), pages 37-50, 1992.
[9]
D. Lewis and M. Ringuette. Comparison of two learning algorithms for text categorization. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR'94), Nevada, Las Vegas, 1994. University of Nevada, Las Vegas.
[10]
D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka. Training algorithms for linear text classifiers. In SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996. 298-306.
[11]
J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, 1999.
[12]
R. E. Schapire, Y. Singer, and A. Singhal. Boosting and Rocchio applied to text filtering. In Proceedings of the Twenty-first Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval, pages 215-223, New York, 1998. The Association for Computing Machinery.
[13]
C. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.
[14]
Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In 17th Ann Int ACM SIGIR Conference onResearch and Development in Information Retrieval (SIGIR'94), pages 13-22, 1994.
[15]
Y. Yang. An evaluation of statistical approach to text categorization. In Technical Report CMU-CS-97-127, Computer Science Department, Carnegie Mellon University, 1997.
[16]
Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67-88, 1999.
[17]
Y. Yang, T. Ault, and T. Pierce. Improving text categorization methods for event tracking. In Proceedings of ACM SIGIR'2000, 65-72.
[18]
Y. Yang and X. Liu. A re-examination of text categorization methods. In The 22th Ann Int ACM SIGIR Conference onResearch and Development in Information Retrieval (SIGIR'99), pages 42-49, 1999.
[19]
Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. In Journal of Intelligent Information Systems. Kluwer Academic Press, (accepted).

Cited By

View all
  • (2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/3639363Online publication date: 3-Jan-2024
  • (2024)AnyLoss: Transforming Classification Metrics into Loss FunctionsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672017(992-1003)Online publication date: 25-Aug-2024
  • (2024)Spatiotemporal 3-D Variations Modeling With Self-Attention for Multilabel ECG ClassificationIEEE Sensors Journal10.1109/JSEN.2024.339201724:11(18710-18724)Online publication date: 1-Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
September 2001
454 pages
ISBN:1581133316
DOI:10.1145/383952
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2001

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGIR01
Sponsor:

Acceptance Rates

SIGIR '01 Paper Acceptance Rate 47 of 201 submissions, 23%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)39
  • Downloads (Last 6 weeks)1
Reflects downloads up to 28 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/3639363Online publication date: 3-Jan-2024
  • (2024)AnyLoss: Transforming Classification Metrics into Loss FunctionsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672017(992-1003)Online publication date: 25-Aug-2024
  • (2024)Spatiotemporal 3-D Variations Modeling With Self-Attention for Multilabel ECG ClassificationIEEE Sensors Journal10.1109/JSEN.2024.339201724:11(18710-18724)Online publication date: 1-Jun-2024
  • (2024)A Multi-label Few-Shot Learning with Combinations of LayersIntelligent Systems and Applications10.1007/978-3-031-47715-7_53(792-806)Online publication date: 30-Jan-2024
  • (2023)Generalized test utilities for long-tail performance in extreme multi-label classificationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667100(22269-22303)Online publication date: 10-Dec-2023
  • (2023)From Scores to Predictions in Multi-Label Classification: Neural Thresholding StrategiesApplied Sciences10.3390/app1313759113:13(7591)Online publication date: 27-Jun-2023
  • (2023)On the Thresholding Strategy for Infrequent Labels in Multi-label ClassificationProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614996(1441-1450)Online publication date: 21-Oct-2023
  • (2023)Semi-Supervised Deep Learning for Cell Type Identification From Single-Cell Transcriptomic DataIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2022.317358720:2(1492-1505)Online publication date: 1-Mar-2023
  • (2023) Mapping common and glossy buckthorns ( Frangula alnus and Rhamnus cathartica ) using multi-date satellite imagery WorldView-3, GeoEye-1 and SPOT-7 International Journal of Digital Earth10.1080/17538947.2022.216213616:1(31-42)Online publication date: 3-Jan-2023
  • (2023)Mapping invasive alien plant species with very high spatial resolution and multi-date satellite imagery using object-based and machine learning techniques: A comparative studyGIScience & Remote Sensing10.1080/15481603.2023.219020360:1Online publication date: 24-Mar-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media