[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2348283.2348411acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

A utility-theoretic ranking method for semi-automated text classification

Published: 12 August 2012 Publication History

Abstract

In Semi-Automated Text Classification (SATC) an automatic classifier F labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by F to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D. An obvious strategy is to rank D so that the documents that F has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop a new utility-theoretic ranking method based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting and correcting a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our ranking method can achieve substantially higher expected reductions in classification error.

References

[1]
P. Burman. Smoothing sparse contingency tables. The Indian Journal of Statistics, 49(1):24--36, 1987.
[2]
N. V. Chawla, N. Japkowicz, and A. Kolcz. Editorial: {S}pecial issue on learning from imbalanced data sets. ACM SIGKDD Explorations, 6(1):1--6, 2004.
[3]
S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL 1996), pages 310--318, Santa Cruz, US, 1996.
[4]
C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI 2001), pages 973--978, Seattle, US, 2001.
[5]
A. Esuli, T. Fagni, and F. Sebastiani. MP-Boost: A multiple-pivot boosting algorithm and its application to text categorization. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE 2006), pages 1--12, Glasgow, UK, 2006.
[6]
A. Esuli and F. Sebastiani. Active learning strategies for multi-label text classification. In Proceedings of the 31st European Conference on Information Retrieval (ECIR 2009), pages 102--113, Toulouse, FR, 2009.
[7]
A. Esuli and F. Sebastiani. Training data cleaning for text classification. In Proceedings of the 2nd International Conference on the Theory of Information Retrieval (ICTIR 2009), pages 29--41, Cambridge, UK, 2009.
[8]
F. Fukumoto and Y. Suzuki. Correcting category errors in text classification. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), pages 868--874, Geneva, CH, 2004.
[9]
W. Gale and K. Church. What's wrong with adding one? In N. Oostdijk and P. de Haan, editors, Corpus-Based Research into Language: In honour of Jan Aarts, pages 189--200. Rodopi, Amsterdam, NL, 1994.
[10]
S. Godbole, A. Harpale, S. Sarawagi, and S. Chakrabarti. Document classification through interactive supervision of document and term labels. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases PKDD 2004, pages 185--196, Pisa, IT, 2004.
[11]
S. C. Hoi, R. Jin, and M. R. Lyu. Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International Conference on World Wide Web (WWW 2006), pages 633--642, Edinburgh, UK, 2006.
[12]
L. S. Larkey and W. B. Croft. Combining classifiers in text categorization. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1996), pages 289--297, Zurich, CH, 1996.
[13]
D. D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of 11th International Conference on Machine Learning (ICML 1994), pages 148--156, New Brunswick, US, 1994.
[14]
A. K. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning (ICML 1998), pages 350--358, Madison, US, 1998.
[15]
A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1), 2008.
[16]
D. W. Oard, J. R. Baron, B. Hedin, D. D. Lewis, and S. Tomlinson. Evaluation of information retrieval for E-discovery. Artificial Intelligence and Law, 18(4):347--386, 2010.
[17]
H. Raghavan, O. Madani, and R. Jones. Active learning with feedback on features and instances. Journal of Machine Learning Research, 7:1655--1686, 2006.
[18]
S. E. Robertson. A new interpretation of average precision. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR 2008), pages 689--690, Singapore, SN, 2008.
[19]
R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135--168, 2000.
[20]
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys}, 34(1):1--47, 2002.
[21]
J. S. Simonoff. A penalty function approach to smoothing large sparse contingency tables. The Annals of Statistics, 11(1):208--218, 1983.
[22]
S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, 2001.
[23]
Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR 1999), pages 42--49, Berkeley, US, 1999.
[24]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems}, 22(2):179--214, 2004.

Cited By

View all
  • (2020)How the Accuracy and Confidence of Sensitivity Classification Affects Digital Sensitivity ReviewACM Transactions on Information Systems10.1145/341733439:1(1-34)Online publication date: 12-Oct-2020
  • (2019)How Sensitivity Classification Effectiveness Impacts Reviewers in Technology-Assisted Sensitivity ReviewProceedings of the 2019 Conference on Human Information Interaction and Retrieval10.1145/3295750.3298962(337-341)Online publication date: 8-Mar-2019
  • (2018)Towards Maximising Openness in Digital Sensitivity Review Using Reviewing Time PredictionsAdvances in Information Retrieval10.1007/978-3-319-76941-7_65(699-706)Online publication date: 1-Mar-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
August 2012
1236 pages
ISBN:9781450314725
DOI:10.1145/2348283
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cost-sensitive learning
  2. ranking
  3. semi-automated text classification
  4. supervised learning
  5. text classification

Qualifiers

  • Research-article

Conference

SIGIR '12
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2020)How the Accuracy and Confidence of Sensitivity Classification Affects Digital Sensitivity ReviewACM Transactions on Information Systems10.1145/341733439:1(1-34)Online publication date: 12-Oct-2020
  • (2019)How Sensitivity Classification Effectiveness Impacts Reviewers in Technology-Assisted Sensitivity ReviewProceedings of the 2019 Conference on Human Information Interaction and Retrieval10.1145/3295750.3298962(337-341)Online publication date: 8-Mar-2019
  • (2018)Towards Maximising Openness in Digital Sensitivity Review Using Reviewing Time PredictionsAdvances in Information Retrieval10.1007/978-3-319-76941-7_65(699-706)Online publication date: 1-Mar-2018
  • (2018)Active Learning Strategies for Technology Assisted Sensitivity ReviewAdvances in Information Retrieval10.1007/978-3-319-76941-7_33(439-453)Online publication date: 1-Mar-2018
  • (2017)Enhancing Sensitivity Classification with Semantic Features Using Word EmbeddingsAdvances in Information Retrieval10.1007/978-3-319-56608-5_35(450-463)Online publication date: 8-Apr-2017
  • (2015)Semi-Automated Text Classification for Sensitivity IdentificationProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806597(1711-1714)Online publication date: 17-Oct-2015
  • (2015)Utility-Theoretic Ranking for Semiautomated Text ClassificationACM Transactions on Knowledge Discovery from Data10.1145/274254810:1(1-32)Online publication date: 22-Jul-2015
  • (2014)Optimising human inspection work in automated verbatim codingInternational Journal of Market Research10.2501/IJMR-2014-03256:4(489-512)Online publication date: 1-Jul-2014
  • (2013)Document Difficulty Framework for Semi-automatic Text ClassificationProceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery - Volume 805710.1007/978-3-642-40131-2_10(110-121)Online publication date: 26-Aug-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media