More Web Proxy on the site http://driver.im/

research-article

A utility-theoretic ranking method for semi-automated text classification

Authors:

Giacomo Berardi,

Fabrizio SebastianiAuthors Info & Claims

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Pages 961 - 970

https://doi.org/10.1145/2348283.2348411

Published: 12 August 2012 Publication History

Abstract

In Semi-Automated Text Classification (SATC) an automatic classifier F labels a set of unlabelled documents D, following which a human annotator inspects (and corrects when appropriate) the labels attributed by F to a subset D' of D, with the aim of improving the overall quality of the labelling. An automated system can support this process by ranking the automatically labelled documents in a way that maximizes the expected increase in effectiveness that derives from inspecting D. An obvious strategy is to rank D so that the documents that F has classified with the lowest confidence are top-ranked. In this work we show that this strategy is suboptimal. We develop a new utility-theoretic ranking method based on the notion of inspection gain, defined as the improvement in classification effectiveness that would derive by inspecting and correcting a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially inspecting a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method above, and according to the proposed measure, our ranking method can achieve substantially higher expected reductions in classification error.

References

[1]

P. Burman. Smoothing sparse contingency tables. The Indian Journal of Statistics, 49(1):24--36, 1987.

[2]

N. V. Chawla, N. Japkowicz, and A. Kolcz. Editorial: {S}pecial issue on learning from imbalanced data sets. ACM SIGKDD Explorations, 6(1):1--6, 2004.

Digital Library

[3]

S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics (ACL 1996), pages 310--318, Santa Cruz, US, 1996.

Digital Library

[4]

C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI 2001), pages 973--978, Seattle, US, 2001.

Digital Library

[5]

A. Esuli, T. Fagni, and F. Sebastiani. MP-Boost: A multiple-pivot boosting algorithm and its application to text categorization. In Proceedings of the 13th International Symposium on String Processing and Information Retrieval (SPIRE 2006), pages 1--12, Glasgow, UK, 2006.

Digital Library

[6]

A. Esuli and F. Sebastiani. Active learning strategies for multi-label text classification. In Proceedings of the 31st European Conference on Information Retrieval (ECIR 2009), pages 102--113, Toulouse, FR, 2009.

Digital Library

[7]

A. Esuli and F. Sebastiani. Training data cleaning for text classification. In Proceedings of the 2nd International Conference on the Theory of Information Retrieval (ICTIR 2009), pages 29--41, Cambridge, UK, 2009.

Digital Library

[8]

F. Fukumoto and Y. Suzuki. Correcting category errors in text classification. In Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), pages 868--874, Geneva, CH, 2004.

Digital Library

[9]

W. Gale and K. Church. What's wrong with adding one? In N. Oostdijk and P. de Haan, editors, Corpus-Based Research into Language: In honour of Jan Aarts, pages 189--200. Rodopi, Amsterdam, NL, 1994.

[10]

S. Godbole, A. Harpale, S. Sarawagi, and S. Chakrabarti. Document classification through interactive supervision of document and term labels. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases PKDD 2004, pages 185--196, Pisa, IT, 2004.

Digital Library

[11]

S. C. Hoi, R. Jin, and M. R. Lyu. Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International Conference on World Wide Web (WWW 2006), pages 633--642, Edinburgh, UK, 2006.

Digital Library

[12]

L. S. Larkey and W. B. Croft. Combining classifiers in text categorization. In Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1996), pages 289--297, Zurich, CH, 1996.

Digital Library

[13]

D. D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of 11th International Conference on Machine Learning (ICML 1994), pages 148--156, New Brunswick, US, 1994.

[14]

A. K. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In Proceedings of the 15th International Conference on Machine Learning (ICML 1998), pages 350--358, Madison, US, 1998.

Digital Library

[15]

A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Transactions on Information Systems, 27(1), 2008.

Digital Library

[16]

D. W. Oard, J. R. Baron, B. Hedin, D. D. Lewis, and S. Tomlinson. Evaluation of information retrieval for E-discovery. Artificial Intelligence and Law, 18(4):347--386, 2010.

Digital Library

[17]

H. Raghavan, O. Madani, and R. Jones. Active learning with feedback on features and instances. Journal of Machine Learning Research, 7:1655--1686, 2006.

Digital Library

[18]

S. E. Robertson. A new interpretation of average precision. In Proceedings of the 31st ACM International Conference on Research and Development in Information Retrieval (SIGIR 2008), pages 689--690, Singapore, SN, 2008.

Digital Library

[19]

R. E. Schapire and Y. Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135--168, 2000.

Digital Library

[20]

F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys}, 34(1):1--47, 2002.

Digital Library

[21]

J. S. Simonoff. A penalty function approach to smoothing large sparse contingency tables. The Annals of Statistics, 11(1):208--218, 1983.

[22]

S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, 2001.

Digital Library

[23]

Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR 1999), pages 42--49, Berkeley, US, 1999.

Digital Library

[24]

C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems}, 22(2):179--214, 2004.

Digital Library

Cited By

Mcdonald GMacdonald COunis I(2020)How the Accuracy and Confidence of Sensitivity Classification Affects Digital Sensitivity ReviewACM Transactions on Information Systems10.1145/341733439:1(1-34)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3417334
McDonald GMacdonald COunis IAzzopardi LHalvey MRuthven IJoho HMurdock VQvarfordt P(2019)How Sensitivity Classification Effectiveness Impacts Reviewers in Technology-Assisted Sensitivity ReviewProceedings of the 2019 Conference on Human Information Interaction and Retrieval10.1145/3295750.3298962(337-341)Online publication date: 8-Mar-2019
https://dl.acm.org/doi/10.1145/3295750.3298962
McDonald GMacdonald COunis I(2018)Towards Maximising Openness in Digital Sensitivity Review Using Reviewing Time PredictionsAdvances in Information Retrieval10.1007/978-3-319-76941-7_65(699-706)Online publication date: 1-Mar-2018
https://doi.org/10.1007/978-3-319-76941-7_65
Show More Cited By

Index Terms

A utility-theoretic ranking method for semi-automated text classification
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Utility-Theoretic Ranking for Semiautomated Text Classification

Semiautomated Text Classification (SATC) may be defined as the task of ranking a set D of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-...
Improving Text Classification Accuracy by Training Label Cleaning

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting ...
A robust semi-supervised classification method for transfer learning
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

The transfer learning problem of designing good classifiers with a high generalization ability by using labeled samples whose distribution is different from that of test samples is an important and challenging research issue in the fields of machine ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

August 2012

1236 pages

ISBN:9781450314725

DOI:10.1145/2348283

General Chair:
William Hersh
Oregon Health & Science University, USA
,
Program Chairs:
Jamie Callan
Carnegie Mellon University, USA
,
Yoelle Maarek
Yahoo! Research, Israel
,
Mark Sanderson
Royal Melbourne Institute of Technology, Australia

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '12

Sponsor:

SIGIR

SIGIR '12: The 35th International ACM SIGIR conference on research and development in Information Retrieval

August 12 - 16, 2012

Oregon, Portland, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
384
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mcdonald GMacdonald COunis I(2020)How the Accuracy and Confidence of Sensitivity Classification Affects Digital Sensitivity ReviewACM Transactions on Information Systems10.1145/341733439:1(1-34)Online publication date: 12-Oct-2020
https://dl.acm.org/doi/10.1145/3417334
McDonald GMacdonald COunis IAzzopardi LHalvey MRuthven IJoho HMurdock VQvarfordt P(2019)How Sensitivity Classification Effectiveness Impacts Reviewers in Technology-Assisted Sensitivity ReviewProceedings of the 2019 Conference on Human Information Interaction and Retrieval10.1145/3295750.3298962(337-341)Online publication date: 8-Mar-2019
https://dl.acm.org/doi/10.1145/3295750.3298962
McDonald GMacdonald COunis I(2018)Towards Maximising Openness in Digital Sensitivity Review Using Reviewing Time PredictionsAdvances in Information Retrieval10.1007/978-3-319-76941-7_65(699-706)Online publication date: 1-Mar-2018
https://doi.org/10.1007/978-3-319-76941-7_65
McDonald GMacdonald COunis I(2018)Active Learning Strategies for Technology Assisted Sensitivity ReviewAdvances in Information Retrieval10.1007/978-3-319-76941-7_33(439-453)Online publication date: 1-Mar-2018
https://doi.org/10.1007/978-3-319-76941-7_33
McDonald GMacdonald COunis I(2017)Enhancing Sensitivity Classification with Semantic Features Using Word EmbeddingsAdvances in Information Retrieval10.1007/978-3-319-56608-5_35(450-463)Online publication date: 8-Apr-2017
https://doi.org/10.1007/978-3-319-56608-5_35
Berardi GEsuli AMacdonald COunis ISebastiani FBailey JMoffat AAggarwal Cde Rijke MKumar RMurdock VSellis TYu J(2015)Semi-Automated Text Classification for Sensitivity IdentificationProceedings of the 24th ACM International on Conference on Information and Knowledge Management10.1145/2806416.2806597(1711-1714)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1145/2806416.2806597
Berardi GEsuli ASebastiani F(2015)Utility-Theoretic Ranking for Semiautomated Text ClassificationACM Transactions on Knowledge Discovery from Data10.1145/274254810:1(1-32)Online publication date: 22-Jul-2015
https://dl.acm.org/doi/10.1145/2742548
Berardi GEsuli ASebastiani F(2014)Optimising human inspection work in automated verbatim codingInternational Journal of Market Research10.2501/IJMR-2014-03256:4(489-512)Online publication date: 1-Jul-2014
https://doi.org/10.2501/IJMR-2014-032
Martinez-Alvarez MBellogin ARoelleke T(2013)Document Difficulty Framework for Semi-automatic Text ClassificationProceedings of the 15th International Conference on Data Warehousing and Knowledge Discovery - Volume 805710.1007/978-3-642-40131-2_10(110-121)Online publication date: 26-Aug-2013
https://dl.acm.org/doi/10.1007/978-3-642-40131-2_10

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten