More Web Proxy on the site http://driver.im/

Article

A study of thresholding strategies for text categorization

Author:

Yiming YangAuthors Info & Claims

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 137 - 145

https://doi.org/10.1145/383952.383975

Published: 01 September 2001 Publication History

Abstract

Thresholding strategies in automated text categorization are an underexplored area of research. This paper presents an examination of the effect of thresholding strategies on the performance of a classifier under various conditions. Using k-Nearest Neighbor (kNN) as the classifier and five evaluation benchmark collections as the testbets, three common thresholding methods were investigated, including rank-based thresholding (RCut), proportion-based assignments (PCut) and score-based local optimization (SCut); in addition, new variants of these methods are proposed to overcome significant problems in the existing approaches. Experimental results show that the choice of thresholding strategy can significantly influence the performance of kNN, and that the ``optimal'' strategy may vary by application. SCut is potentially better for fine-tuning but risks overfitting. PCut copes better with rare categories and exhibits a smoother trade-off in recall versus precision, but is not suitable for online decision making. RCut is most natural for online response but is too coarse-grained for global or local optimization. RTCut, a new method combining the strength of category ranking and scoring, outperforms both PCut and RCut significantly.

References

[1]

P. N. Bennett. Assessing the calibration of Naive Bayes' posterior estimates. In Technical Report CMU-CS-00-155, computer Science Department, School of Computer Science, Carnegie Mellon University, 2000.

[2]

W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. In SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval, 1996. 307-315.

Digital Library

[3]

R. Ghani, R. Jones, D. Mladenic, K. Nigam, and S. Slattery. Data mining on symbolic knowledge extracted from the web. In Proceedings of the Workshop on Text Mining at the Sixth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD-2000), 2000.

[4]

R. Ghani, S. Slattery, and Y. Yang. Hypertext categorization using hyperlink patterns and meta data. In The Eighteenth International Conference on Machine Learning (ICML'01), page (submitted), 2001.

Digital Library

[5]

N. Govert, M. Lalmas, and N. Fuhr. A probabilistic description-oriented approach for categorising web documents. In Proceedings of the Eights International Conference on Information and Knowledge Management, pages 475-482, New York, 1999. ACM.

Digital Library

[6]

W. Hersh, C.Buckley, T. Leone, and D. Hickman. Ohsumed: an interactive retrieval evaluation and new large text collection for research. In Proceedings of ACM SIGIR'94, pages 192-201, 1994.

Digital Library

[7]

T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In European Conference on Machine Learning (ECML), pages 137-142, Berlin, 1998. Springer.

Digital Library

[8]

D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In 15th Ann Int ACM SIGIR Conference onResearch and Development in Information Retrieval (SIGIR'92), pages 37-50, 1992.

Digital Library

[9]

D. Lewis and M. Ringuette. Comparison of two learning algorithms for text categorization. In Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR'94), Nevada, Las Vegas, 1994. University of Nevada, Las Vegas.

[10]

D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka. Training algorithms for linear text classifiers. In SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1996. 298-306.

Digital Library

[11]

J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, 1999.

[12]

R. E. Schapire, Y. Singer, and A. Singhal. Boosting and Rocchio applied to text filtering. In Proceedings of the Twenty-first Annual International ACM SIGIR Conference onResearch and Development in Information Retrieval, pages 215-223, New York, 1998. The Association for Computing Machinery.

Digital Library

[13]

C. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.

Digital Library

[14]

Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In 17th Ann Int ACM SIGIR Conference onResearch and Development in Information Retrieval (SIGIR'94), pages 13-22, 1994.

Digital Library

[15]

Y. Yang. An evaluation of statistical approach to text categorization. In Technical Report CMU-CS-97-127, Computer Science Department, Carnegie Mellon University, 1997.

[16]

Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67-88, 1999.

Digital Library

[17]

Y. Yang, T. Ault, and T. Pierce. Improving text categorization methods for event tracking. In Proceedings of ACM SIGIR'2000, 65-72.

Digital Library

[18]

Y. Yang and X. Liu. A re-examination of text categorization methods. In The 22th Ann Int ACM SIGIR Conference onResearch and Development in Information Retrieval (SIGIR'99), pages 42-49, 1999.

Digital Library

[19]

Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. In Journal of Intelligent Information Systems. Kluwer Academic Press, (accepted).

Digital Library

Cited By

Fan WLu PPang KJin R(2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/3639363Online publication date: 3-Jan-2024
https://doi.org/10.1145/3639363
Han DMoniz NChawla NBaeza-Yates RBonchi F(2024)AnyLoss: Transforming Classification Metrics into Loss FunctionsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672017(992-1003)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3672017
Xia PZhang HYao YXu LBai ZChen XLi ZZhu YLi XDu LWang PFang Z(2024)Spatiotemporal 3-D Variations Modeling With Self-Attention for Multilabel ECG ClassificationIEEE Sensors Journal10.1109/JSEN.2024.339201724:11(18710-18724)Online publication date: 1-Jun-2024
https://doi.org/10.1109/JSEN.2024.3392017
Show More Cited By

Index Terms

A study of thresholding strategies for text categorization

Recommendations

A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization
PRICAI '02: Proceedings of the 7th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence

Two main research areas in statistical text categorization are similarity- based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization. ...
Improved Single-Label Text Categorization by Instance Filtration
CISIS '15: Proceedings of the 2015 Ninth International Conference on Complex, Intelligent, and Software Intensive Systems

Machine learning classifiers are widely used for text categorization however a classifier misclassifies some of the instances into a category that is relevant to their actual category. The categorization ability of a classifier can be improved by ...
Effect of term distributions on centroid-based text categorization
Special issue: Informatics and computer science intelligent systems applications

Most of traditional text categorization approaches utilize term frequency (tf) and inverse document frequency (idf) for representing importance of words and/or terms in classifying a text document. This paper describes an approach to apply term ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

September 2001

454 pages

ISBN:1581133316

DOI:10.1145/383952

Chairmen:
Donald H. Kraft
Louisiana State Univ.
,
W. Bruce Croft
University of Massachusetts, (For the Americas)
,
David J. Harper
The Robert Gordon University, (For Europe and Africa)
,
Justin Zobel
RMIT University, (For Asia and Australasia)

Copyright © 2001 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2001

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGIR01

Sponsor:

SIGIR

SIGIR01: 24th ACM/SIGIR International Conference on Research and Development in Information Retrieval

Louisiana, New Orleans, USA

Acceptance Rates

SIGIR '01 Paper Acceptance Rate 47 of 201 submissions, 23%;

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

250
Total Citations
View Citations
1,737
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fan WLu PPang KJin R(2024)Linking Entities across Relations and GraphsACM Transactions on Database Systems10.1145/3639363Online publication date: 3-Jan-2024
https://doi.org/10.1145/3639363
Han DMoniz NChawla NBaeza-Yates RBonchi F(2024)AnyLoss: Transforming Classification Metrics into Loss FunctionsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672017(992-1003)Online publication date: 25-Aug-2024
https://dl.acm.org/doi/10.1145/3637528.3672017
Xia PZhang HYao YXu LBai ZChen XLi ZZhu YLi XDu LWang PFang Z(2024)Spatiotemporal 3-D Variations Modeling With Self-Attention for Multilabel ECG ClassificationIEEE Sensors Journal10.1109/JSEN.2024.339201724:11(18710-18724)Online publication date: 1-Jun-2024
https://doi.org/10.1109/JSEN.2024.3392017
Sert BAydin CYounus A(2024)A Multi-label Few-Shot Learning with Combinations of LayersIntelligent Systems and Applications10.1007/978-3-031-47715-7_53(792-806)Online publication date: 30-Jan-2024
https://doi.org/10.1007/978-3-031-47715-7_53
Schultheis EWydmuch MKotłowski WBabbar RDembczyński KOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Generalized test utilities for long-tail performance in extreme multi-label classificationProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667100(22269-22303)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667100
Draszawka KSzymański J(2023)From Scores to Predictions in Multi-Label Classification: Neural Thresholding StrategiesApplied Sciences10.3390/app1313759113:13(7591)Online publication date: 27-Jun-2023
https://doi.org/10.3390/app13137591
Lin YLin CFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)On the Thresholding Strategy for Infrequent Labels in Multi-label ClassificationProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614996(1441-1450)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3614996
Dong XChowdhury SVictor ULi XQian L(2023)Semi-Supervised Deep Learning for Cell Type Identification From Single-Cell Transcriptomic DataIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2022.317358720:2(1492-1505)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TCBB.2022.3173587
Nininahazwe FVarin MThéau J(2023) Mapping common and glossy buckthorns ( Frangula alnus and Rhamnus cathartica ) using multi-date satellite imagery WorldView-3, GeoEye-1 and SPOT-7 International Journal of Digital Earth10.1080/17538947.2022.216213616:1(31-42)Online publication date: 3-Jan-2023
https://doi.org/10.1080/17538947.2022.2162136
Nininahazwe FThéau JMarc Antoine GVarin M(2023)Mapping invasive alien plant species with very high spatial resolution and multi-date satellite imagery using object-based and machine learning techniques: A comparative studyGIScience & Remote Sensing10.1080/15481603.2023.219020360:1Online publication date: 24-Mar-2023
https://doi.org/10.1080/15481603.2023.2190203
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten