Abstract
Manual document annotation is an essential technique for knowledge acquisition and capture. Creating high-quality annotations is a difficult task due to inter-annotator discrepancy, the problem that annotators can never agree completely on what and exactly how to annotate. To address this, traditional document annotation involves multiple domain experts working on the same annotation task in an iterative and collaborative manner to identify and resolve discrepancies progressively. However, such a detailed process is often ineffective despite taking significant time and effort; unfortunately, discrepancies remain high in many cases. This paper proposes an alternative approach to document annotation. The approach tackles the problem by firstly studying annotators’ suitability based on the types of information to be annotated; then identifying and isolating the most inconsistent annotators who tend to cause the majority of discrepancies in a task; finally distributing annotation workload among the most suitable annotators. Tested in a named entity annotation task in the domain of archaeology, we show that compared to the traditional approach to document annotation, it produces larger amounts of better quality annotations that result in higher machine learning accuracy while requires significantly less time and effort.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bermingham, A., Smeaton, A.: A Study of Inter-Annotator Agreement for Opinion Retrieval. In: Proceedings of SIGIR 2009 (2009)
Brants, T.: Inter-annotator agreement for a German newspaper corpus. In: Proceedings of the Second International Conference on Language Resources and Evaluation, LREC (2000)
Byrne, K.: Nested Named Entity Recognition in Historical Archive Text. In: Proceedings of International Conference on Semantic Computing (2007)
Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics 22(2), 249–254 (1996)
Colosimo, M., Morgan, A., Yeh, A., Colombe, J., Hirschman, L.: Data preparation and internannotator agreement: BioCreAtIvE Task 1B. BMC Bioinformatics (2005)
Ciravegna, F., Lavelli, A., Satta, G.: Bringing information extraction out of the labs: the Pinocchio Environment. In: Proceedings of the 14th European Conference on Artificial Intelligence (2000)
Cucchiarini, C., Strik, H.: Automatic transcription agreement: An overview, pp. 347–350 (2003)
Ehrmann, M.: Les entites nommees, de la linguistique au TAL: statut theorique et methods de desambiguisation. Ph.D. thesis, Univ. Paris (2008)
Ferro, L., Mani, I., Sundheim, B., Wilson, G.: TIDES Temporal Annotation Guidelines. Draft Version 1.0. MITRE Technical Report MTR 00W0000094 (October 2000)
Fort, K., Ehrmann, M., Nazarenko, A.: Towards a methodology for named entities anntoation. In: Proceedings of the Third Linguistic Annotation Workshop, ACL-IJNLP, pp. 142–145 (2009)
Grishman, R., Sundheim, B.: Message understanding conference - 6: A brief history. In: Proceedings of International Conference on Computational Linguistics (1996)
Gut, U., Bayerl, P.S.: Measuring the Reliability of Manual Annotations of Speech Corpora. In: Proceedings of Speech Prosody (2004), Nara, pp. 565–568 (2004)
Hripcsak, G., Rothschild, A.: Agreement, the F-measure and Reliability in Information Retrieval. Journal of the American Medical Informatics Association, 296–298 (2005)
Hripcsak, G., Wilcox, A.: Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance. J. Am. Med. Inform. Assoc., 1–15 (2002)
Iria, J.: Automating Knowledge Capture in the Aerospace Domain. In: Proceedings of the Fifth International Conference on Knowledge Capture, pp. 97–104 (2009)
Jeffrey, S., Richards, J., Ciravegna, F., Chapman, S., Zhang, Z.: The Archaeotools project: Faceted Classification and Natural Language Processing in an Archaeological Context. In: Special Theme Issues of the Philosophical Transactions of the Royal Society A, Crossing Boundaries: Computational Science, E-Science and Global E-Infrastructures (2009)
Kim, J., Ohta, T., Tsujii, J.: Corpus annotations for mining biomedical events from literature. In: BMC Bioinformatics (2008)
Linguistic Data Consortium, Automatic Content Extraction (ACE) (2008), http://projects.ldc.upenn.edu/ace/
Minkov, E., Wang, R., Cohen, W.: Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text. In: Proceedings of HLT/EMNLP 2005 (2005)
Morante, R., Asch, V., Daelemans, W.: A memory–based learning approach to event extraction in biomedical texts. In: Proceedings of the Workshop on BioNLP: Shared Task, pp. 59–67 (2009)
Murphy, T., McIntosh, T., Curran, J.: Named entity recognition for astronomy literature. In: Australian Language Technology Workshop (2006)
Nadeau, D.: PhD Thesis: Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision (2007)
Ng, H., Lim, C., Foo, S.: A Case Study on Inter-Annotator Agreement for Word Sense Disambiguation. In: Proceedings of the ACL SIGLEX Workshop on Standardizing Lexical Resources SIGLEX 1999, pp. 9–13 (1999)
Ohta, T., Tateisi, Y., Kim, J.: The GENIA corpus: an annotated research abstract corpus in molecular biology domain. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 82–86 (2002)
Olsson, F.: PhD thesis: Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora (2008)
Pyysalo, S., Ginter, F., Heimonen, J., Björne, J., Boberg, J., Järvinen, J., Salakoski, T.: BioInfer: a corpus for information extraction in the biomedical domain. BMC Bioinformatics (2007)
Saracevic, T.: Individual differences in organizing, searching, and retrieving information. In: Proceedings of the 54th Annual ASIS Meeting, pp. 82–86 (1991)
Tanabe, L., Xie, N., Thom, L., Matten, W., Wilbur, W.: GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics (2005)
Wilbur, W., Rzhetsky, A., Shatkay, H.: New directions in biomedical text annotation: definitions, guidelines and corpus construction. Bioinformatics (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, Z., Chapman, S., Ciravegna, F. (2010). A Methodology towards Effective and Efficient Manual Document Annotation: Addressing Annotator Discrepancy and Annotation Quality. In: Cimiano, P., Pinto, H.S. (eds) Knowledge Engineering and Management by the Masses. EKAW 2010. Lecture Notes in Computer Science(), vol 6317. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16438-5_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-16438-5_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16437-8
Online ISBN: 978-3-642-16438-5
eBook Packages: Computer ScienceComputer Science (R0)