Abstract
This paper presents a study aiming to find out the best strategy to develop a fast and accurate HMM tagger when only a limited amount of training material is available. This is a crucial factor when dealing with languages for which small annotated material is not easily available.
First, we develop some experiments in English, using WSJ corpus as a test-bench to establish the differences caused by the use of large or a small train set. Then, we port the results to develop an accurate Spanish PoS tagger using a limited amount of training data.
Different configurations of a HMM tagger are studied. Namely, trigram and 4-gram models are tested, as well as different smoothing techniques. The performance of each configuration depending on the size of the training corpus is tested in order to determine the most appropriate setting to develop HMM PoS taggers for languages with reduced amount of corpus available.
This research has been partially supported by the European Comission (Meaning, IST-2001-34460) and by the Catalan Government Research Department (DURSI).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brants, T.: Tnt - a statistical part- of-speech tagger. In: Proceedings of the 6th Conference on Applied Natural Language Processing, ANLP, ACL (2000)
Brill, E.: A Corpus–based Approach to Language Learning. PhD thesis, Department of Computer and Information Science, University of Pennsylvania (1993), http://www.cs.jhu.edu/~brill/acadpubs.html
Carreras, X., Chao, I., Padró, L., Padró, M.: Freeling: An open-source suite of language analyzers. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal (2004)
Civit, M.: Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. PhD thesis, Linguistics Department, Universitat de Barcelona (2003)
Church, K.W.: A stochastic parts program and noun phrase parser for unrestricted text. In: Proceedings of the 1st Conference on Applied Natural Language Processing, ANLP, ACL, pp. 136–143 (1988)
Cutting, D., Kupiec, J., Pedersen, J.O., Sibun, P.: A practical part–of–speech tagger. In: Proceedings of the 3rd Conference on Applied Natural Language Processing, ANLP, ACL, pp. 133–140 (1992)
Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: Mbt: A memory–based part–of–speech tagger generator. In: Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 14–27 (1996)
Karlsson, F.: Constraint grammar as a framework for parsing running text. In: Proceedings of 13th International Conference on Computational Linguistics, COLING, Helsinki, Finland, pp. 168–173 (1990)
Laplace, P.S.m.: Philosophical Essay on Probabilities. Springer, Heidelberg (1995)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1998)
Merialdo, B.: Tagging english text with a probabilistic model. Computational Linguistics 20, 155–171 (1994)
Ratnaparkhi, A.: A maximum entropy part–of–speech tagger. In: Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing, EMNLP (1996)
Schmid, H.: Improvements in part–of–speech tagging with an application to german. In: Proceedings of the EACL SIGDAT Workshop, Dublin, Ireland (1995)
Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 260–269 (1967)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Padró, M., Padró, L. (2004). Developing Competitive HMM PoS Taggers Using Small Training Corpora. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds) Advances in Natural Language Processing. EsTAL 2004. Lecture Notes in Computer Science(), vol 3230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30228-5_12
Download citation
DOI: https://doi.org/10.1007/978-3-540-30228-5_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23498-2
Online ISBN: 978-3-540-30228-5
eBook Packages: Springer Book Archive