Abstract
In this paper we tackle the problem of lemmatization of inflectional languages. We introduce a new algorithm which utilizes vector models of words. Current approaches in this area are limited to knowing either full grammar rules or the translation matrix between the word and its basic form. However, this information is encoded in natural text. Our solution uses text corpora to build vector models of words and a small amount of user input to infer lemmas. We have evaluated our approach on the Slovak language and present interesting findings on its feasibility for real-world utilization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bansal, M., Gimpel, K., Livescu, K.: Tailoring continuous word representations for dependency parsing. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2014)
Garabík, R.: Slovak morphology analyzer based on Levenshtein edit operations. In: Proceedings of 1st Workshop on Intelligent and Knowledge-Oriented Technologies, pp. 2–5 (2006)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781
Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of Conference of the North American Chapter of the ACL: Human Language Technologies, HLT-NAACL 2013, pp. 746–751 (2013)
JÚĽĽŠ: Slovak national corpus - prim-6.0-public-all. Bratislava: ĽĽ. Štúr Institute of Linguistics SAS (2013). http://korpus.juls.savba.sk
Cortes, C., Vapnik, V.: Support-vector networks. In: Machine learning, p. 99 (1995)
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Brychcín, T., Konopík, M.: Hps: high precision stemmer. Inf. Process. Manage. 51(1), 68–91 (2015)
Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 621–630. ACM (2009)
Krajči, S., Novotný, R.: Hľľadanie základného tvaru slovenského slova na základe spoločného konca slov (In Slovak). In: 1st Workshop on Intelligent and Knowledge Oriented Technologies, pp. 99–101 (2006)
Šajgalík, M., Barla, M., Bieliková, M.: Exploring multidimensional continuous feature space to extract relevant words. In: Besacier, L., Dediu, A.-H., Martín-Vide, C. (eds.) SLSP 2014. LNCS, vol. 8791, pp. 159–170. Springer, Heidelberg (2014)
Acknowledgments
This work was partially supported by the Scientific Grant Agency of Slovak Republic, grant No. VG 1/0646/15 and the Cultural and Educational Grant Agency of the Slovak Republic, grant No. KEGA 009STU-4/2014.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gallay, L., Šimko, M. (2016). Utilizing Vector Models for Automatic Text Lemmatization. In: Freivalds, R., Engels, G., Catania, B. (eds) SOFSEM 2016: Theory and Practice of Computer Science. SOFSEM 2016. Lecture Notes in Computer Science(), vol 9587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49192-8_43
Download citation
DOI: https://doi.org/10.1007/978-3-662-49192-8_43
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49191-1
Online ISBN: 978-3-662-49192-8
eBook Packages: Computer ScienceComputer Science (R0)