Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts Under Precision and Recall Constraints

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11057))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1846 Accesses
7 Citations

Abstract

Digital libraries strive for integration of automatic subject indexing methods into operative information retrieval systems, yet integration is prevented by misleading and incomplete semantic annotations. For this reason, we investigate approaches to detect documents where quality criteria are met. In contrast to mainstream methods, our approach, named Qualle, estimates quality at the document-level rather than the concept-level. Qualle is implemented as a combination of different machine learning models into a deep, multi-layered regression architecture that comprises a variety of content-based indicators, in particular label set size calibration. We evaluated the approach on very short texts from law and economics, investigating the impact of different feature groups on recall estimation. Our results show that Qualle effectively determined subsets of previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Such filtering can therefore be used to control compliance with data quality standards in practice. Qualle allows to make trade-offs between indexing quality and collection coverage, and it can complement semi-automatic indexing to process large datasets more efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 51.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Machine Learning Architectures for Scalable and Reliable Subject Indexing

Towards Semantic Quality Control of Automatic Subject Indexing

Advancing Automatic Subject Indexing: Combining Weak Supervision with Extreme Multi-label Classification

Notes

1.
For brevity, the term subject may be omitted in subject indexing, subject indexer, ..., respectively.
2.
The concept identifier 29638-6 refers to the concept “Low-interest-rate policy”.
3.
If only ranking is relevant, rank-based correlation coefficients should be considered.
4.
http://eurovoc.europa.eu/, accessed: 31.12.2017.
5.
http://www.ke.tu-darmstadt.de/resources/eurlex, accessed 31.12.2017.
6.
http://zbw.eu/stw/version/latest/about.en.html, accessed: 09.01.2018.

References

Bennett, P.N., Chickering, D.M., Meek, C., Zhu, X.: Algorithms for active classifier selection: maximizing recall with precision constraints. In: Proceedings of WSDM 2017, pp. 711–719. ACM (2017)
Google Scholar
Bennett, P.N., Dumais, S.T., Horvitz, E.: Probabilistic combination of text classifiers using reliability indicators: models and results. In: Proceedings of SIGIR 2002, pp. 207–214. ACM (2002). https://doi.org/10.1145/564376.564413
Culotta, A., McCallum, A.: Confidence estimation for information extraction. In: Proceedings of HLT-NAACL 2004: Short Papers, pp. 109–112. ACL (2004)
Google Scholar
Drucker, H.: Improving regressors using boosting techniques. In: Proceedings of ICML 1997, pp. 107–115. Morgan Kaufmann (1997)
Google Scholar
Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367–378 (2002). https://doi.org/10.1016/s0167-9473(01)00065-2
Article MathSciNet MATH Google Scholar
Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Mach. Learn. 63(1), 3–42 (2006). https://doi.org/10.1007/s10994-006-6226-1
Article MATH Google Scholar
Gibaja, E., Ventura, S.: A tutorial on multilabel learning. ACM Comput. Surv. 47(3), 52:1–52:38 (2015). https://doi.org/10.1145/2716262
Article Google Scholar
Huang, M., Névéol, A., Lu, Z.: Recommending MeSH terms for annotating biomedical articles. JAMIA 18(5), 660–667 (2011)
Google Scholar
Liu, J., Chang, W., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In: Proceedings of SIGIR 2017, pp. 115–124. ACM (2017)
Google Scholar
Loza Mencía, E., Fürnkranz, J.: Efficient multilabel classification algorithms for large-scale problems in the legal domain. In: Francesconi, E., Montemagni, S., Peters, W., Tiscornia, D. (eds.) Semantic Processing of Legal Texts. LNCS (LNAI), vol. 6036, pp. 192–215. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-12837-0_11
Chapter Google Scholar
Medelyan, O., Witten, I.H.: Measuring inter-indexer consistency using a thesaurus. In: Proceedings of JCDL 2006, pp. 274–275. ACM (2006)
Google Scholar
Medelyan, O., Witten, I.H.: Domain-independent automatic keyphrase indexing with small training sets. JASIST 59(7), 1026–1040 (2008)
Article Google Scholar
Neveol, A., Zeng, K., Bodenreider, O.: Besides precision & recall: exploring alternative approaches to evaluating an automatic indexing tool for MEDLINE. In: AMIA Annual Symposium Proceedings, pp. 589–593 (2006)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Ribeiro, M.T., Singh, S., Guestrin, C.: Why should I trust you?: explaining the predictions of any classifier. In: Proceedings of SIGKDD 2016, pp. 1135–1144. ACM (2016)
Google Scholar
Rolling, L.N.: Indexing consistency, quality and efficiency. Inf. Process. Manag. 17(2), 69–76 (1981)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Tahmasebi, N., Risse, T.: On the uses of word sense change for research in the digital humanities. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds.) TPDL 2017. LNCS, vol. 10450, pp. 246–257. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67008-9_20
Chapter Google Scholar
Tang, L., Rajan, S., Narayanan, V.K.: Large scale multi-label classification via metalabeler. In: Proceedings of WWW 2009, pp. 211–220. ACM (2009)
Google Scholar
Toepfer, M., Seifert, C.: Descriptor-invariant fusion architectures for automatic subject indexing. In: Proceedings of JCDL 2017, pp. 31–40. IEEE Computer Society (2017)
Google Scholar
Trieschnigg, D., Pezik, P., Lee, V., de Jong, F., Kraaij, W., Rebholz-Schuhmann, D.: MeSH up: effective MeSH text classification for improved document retrieval. Bioinformatics 25(11), 1412–1418 (2009)
Article Google Scholar
Wilbur, W.J., Kim, W.: Stochastic gradient descent and the prediction of MeSH for PubMed records. In: AMIA Annual Symposium Proceedings 2014, pp. 1198–1207 (2014)
Google Scholar
Zadrozny, B., Elkan, C.: Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of SIGKDD 2002, pp. 694–699. ACM (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

ZBW – Leibniz Information Centre for Economics, Kiel, Germany
Martin Toepfer
University of Twente, Enschede, The Netherlands
Christin Seifert

Authors

Martin Toepfer
View author publications
You can also search for this author in PubMed Google Scholar
Christin Seifert
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Toepfer .

Editor information

Editors and Affiliations

University Carlos III, Madrid, Spain
Eva Méndez
USI, Università della Svizzera Italiana, Lugano, Switzerland
Fabio Crestani
INESC TEC, Faculty of Engineering, University of Porto, Porto, Portugal
Cristina Ribeiro
INESC TEC, Faculty of Engineering, University of Porto, Porto, Portugal
Gabriel David
INESC TEC, Faculty of Engineering, University of Porto, Porto, Portugal
João Correia Lopes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Toepfer, M., Seifert, C. (2018). Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts Under Precision and Recall Constraints. In: Méndez, E., Crestani, F., Ribeiro, C., David, G., Lopes, J. (eds) Digital Libraries for Open Knowledge. TPDL 2018. Lecture Notes in Computer Science(), vol 11057. Springer, Cham. https://doi.org/10.1007/978-3-030-00066-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-00066-0_1
Published: 05 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00065-3
Online ISBN: 978-3-030-00066-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics