Abstract
Automatic morphosyntactic tagging of corpora is usually imperfect. Wrong or strange tagging may be automatically repeated following some patterns. It is usually hard to manually detect all these errors, as corpora may contain millions of tags. This paper presents an approach to detect sequences of part-of-speech tags that have an internal cohesiveness in corpora. Some sequences match to syntactic chunks or correct sequences, but some are strange or incorrect, usually due to systematically wrong tagging. The amount of time spent in separating incorrect bigrams and trigrams from correct ones is very small, but it allows us to detect 70% of all tagging errors in the corpus.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Dickinson, M., Meurers, W.D.: Detecting Errors in Part-of-Speech Annotation. In: EACL 2003. Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (2003), http://ling.osu.edu/~dickinso/papers/dickinson-meurers-03.html
Kveton, P., Oliva, K.: (Semi-) Automatic Detection of Errors in PoS-Tagged Corpora. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING) (2002), http://acl.ldc.upenn.edu/C/C02/C02-1021.pdf
Marques, N.C., Lopes, G.P.: Tagging With Small Training Corpora. In: Hoffmann, F., Adams, N., Fisher, D., Guimarães, G., Hand, D.J. (eds.) IDA 2001. LNCS, vol. 2189, pp. 63–72. Springer, Heidelberg (2001)
Rocio, V., Lopes, G.P., de la Clergerie, E.: Tabulation for multi-purpose partial parsing. Grammars 4(1), 41–65 (2001)
Rocio, V.: Syntactic Infra-structure for fault finding and fault overcoming. PhD thesis. FCT/UNL (2002)
Silva, J.F., Dias, G., Guilloré, S., Lopes, G.P.: Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. In: Barahona, P., Alferes, J.J. (eds.) EPIA 1999. LNCS (LNAI), vol. 1695, pp. 113–132. Springer, Heidelberg (1999)
Silva, J.F., Lopes, G.P.: A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In: Proceedings of the 6th Meeting on the Mathematics of Language, Orlando, pp. 369–381 (1999)
Silva, J.F., Lopes, G.P., Mexia, J.T.: A Statistical Approach for Multilingual Document Clustering and Topic Extraction from Clusters. Pliska Studia Mathematica Bulgarica 15, 207–228 (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rocio, V., Silva, J., Lopes, G. (2007). Detection of Strange and Wrong Automatic Part-of-Speech Tagging. In: Neves, J., Santos, M.F., Machado, J.M. (eds) Progress in Artificial Intelligence. EPIA 2007. Lecture Notes in Computer Science(), vol 4874. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77002-2_57
Download citation
DOI: https://doi.org/10.1007/978-3-540-77002-2_57
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77000-8
Online ISBN: 978-3-540-77002-2
eBook Packages: Computer ScienceComputer Science (R0)