Developing Competitive HMM PoS Taggers Using Small Training Corpora

Muntsa Padró⁵ &
Lluís Padró⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3230))

Included in the following conference series:

International Conference on Natural Language Processing (in Spain)

676 Accesses
6 Citations

Abstract

This paper presents a study aiming to find out the best strategy to develop a fast and accurate HMM tagger when only a limited amount of training material is available. This is a crucial factor when dealing with languages for which small annotated material is not easily available.

First, we develop some experiments in English, using WSJ corpus as a test-bench to establish the differences caused by the use of large or a small train set. Then, we port the results to develop an accurate Spanish PoS tagger using a limited amount of training data.

Different configurations of a HMM tagger are studied. Namely, trigram and 4-gram models are tested, as well as different smoothing techniques. The performance of each configuration depending on the size of the training corpus is tested in order to determine the most appropriate setting to develop HMM PoS taggers for languages with reduced amount of corpus available.

This research has been partially supported by the European Comission (Meaning, IST-2001-34460) and by the Catalan Government Research Department (DURSI).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Minimal Data for Maximum Impact: An Indonesian Part-of-Speech Tagging Case Study

From 0 to 10 million annotated words: part-of-speech tagging for Middle High German

Article 08 April 2019

Turkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets

References

Brants, T.: Tnt - a statistical part- of-speech tagger. In: Proceedings of the 6th Conference on Applied Natural Language Processing, ANLP, ACL (2000)
Google Scholar
Brill, E.: A Corpus–based Approach to Language Learning. PhD thesis, Department of Computer and Information Science, University of Pennsylvania (1993), http://www.cs.jhu.edu/~brill/acadpubs.html
Carreras, X., Chao, I., Padró, L., Padró, M.: Freeling: An open-source suite of language analyzers. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal (2004)
Google Scholar
Civit, M.: Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. PhD thesis, Linguistics Department, Universitat de Barcelona (2003)
Google Scholar
Church, K.W.: A stochastic parts program and noun phrase parser for unrestricted text. In: Proceedings of the 1st Conference on Applied Natural Language Processing, ANLP, ACL, pp. 136–143 (1988)
Google Scholar
Cutting, D., Kupiec, J., Pedersen, J.O., Sibun, P.: A practical part–of–speech tagger. In: Proceedings of the 3rd Conference on Applied Natural Language Processing, ANLP, ACL, pp. 133–140 (1992)
Google Scholar
Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: Mbt: A memory–based part–of–speech tagger generator. In: Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 14–27 (1996)
Google Scholar
Karlsson, F.: Constraint grammar as a framework for parsing running text. In: Proceedings of 13th International Conference on Computational Linguistics, COLING, Helsinki, Finland, pp. 168–173 (1990)
Google Scholar
Laplace, P.S.m.: Philosophical Essay on Probabilities. Springer, Heidelberg (1995)
MATH Google Scholar
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1998)
MATH Google Scholar
Merialdo, B.: Tagging english text with a probabilistic model. Computational Linguistics 20, 155–171 (1994)
Google Scholar
Ratnaparkhi, A.: A maximum entropy part–of–speech tagger. In: Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing, EMNLP (1996)
Google Scholar
Schmid, H.: Improvements in part–of–speech tagging with an application to german. In: Proceedings of the EACL SIGDAT Workshop, Dublin, Ireland (1995)
Google Scholar
Viterbi, A.J.: Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 260–269 (1967)
Google Scholar

Download references

Author information

Authors and Affiliations

TALP Research Center, Universitat Politècnica de Catalunya, Spain
Muntsa Padró & Lluís Padró

Authors

Muntsa Padró
View author publications
You can also search for this author in PubMed Google Scholar
Lluís Padró
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Software and Computing Systems, University of Alicante, Spain
José Luis Vicedo
Natural Language Processing and Information Systems Group, Department of Software and Computing Systems, University of Alicante, Spain
Patricio Martínez-Barco
Grupo de investigación del Procesamiento del Lenguaje y Sistemas de Información, Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Alicante, Spain
Rafael Muńoz
Departamento de Lenguajes y Sistemas Informáticos, Carretera de San Vicente del Raspeig, Universidad de Alicante, 03690 San Vicente del Raspeig, Alicante, Spain
Maximiliano Saiz Noeda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Padró, M., Padró, L. (2004). Developing Competitive HMM PoS Taggers Using Small Training Corpora. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds) Advances in Natural Language Processing. EsTAL 2004. Lecture Notes in Computer Science(), vol 3230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30228-5_12

Download citation

DOI: https://doi.org/10.1007/978-3-540-30228-5_12
Published: 20 October 2004
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23498-2
Online ISBN: 978-3-540-30228-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Developing Competitive HMM PoS Taggers Using Small Training Corpora

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Minimal Data for Maximum Impact: An Indonesian Part-of-Speech Tagging Case Study

From 0 to 10 million annotated words: part-of-speech tagging for Middle High German

Turkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Developing Competitive HMM PoS Taggers Using Small Training Corpora

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Minimal Data for Maximum Impact: An Indonesian Part-of-Speech Tagging Case Study

From 0 to 10 million annotated words: part-of-speech tagging for Middle High German

Turkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation