Utilizing Vector Models for Automatic Text Lemmatization

Ladislav Gallay¹⁶ &
Marián Šimko¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9587))

Included in the following conference series:

International Conference on Current Trends in Theory and Practice of Informatics

1062 Accesses

Abstract

In this paper we tackle the problem of lemmatization of inflectional languages. We introduce a new algorithm which utilizes vector models of words. Current approaches in this area are limited to knowing either full grammar rules or the translation matrix between the word and its basic form. However, this information is encoded in natural text. Our solution uses text corpora to build vector models of words and a small amount of user input to infer lemmas. We have evaluated our approach on the Slovak language and present interesting findings on its feasibility for real-world utilization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Identification of Lemmatization Errors Using Neural Models

A Comparative Study of Lemmatization Approaches for Rojak Language

Lemmatization of Multi-Word Entity Names for Polish Language Using Rules Automatically Generated Based on the Corpus Analysis

References

Bansal, M., Gimpel, K., Livescu, K.: Tailoring continuous word representations for dependency parsing. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (2014)
Google Scholar
Garabík, R.: Slovak morphology analyzer based on Levenshtein edit operations. In: Proceedings of 1st Workshop on Intelligent and Knowledge-Oriented Technologies, pp. 2–5 (2006)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781
Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of Conference of the North American Chapter of the ACL: Human Language Technologies, HLT-NAACL 2013, pp. 746–751 (2013)
Google Scholar
JÚĽĽŠ: Slovak national corpus - prim-6.0-public-all. Bratislava: ĽĽ. Štúr Institute of Linguistics SAS (2013). http://korpus.juls.savba.sk
Cortes, C., Vapnik, V.: Support-vector networks. In: Machine learning, p. 99 (1995)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Brychcín, T., Konopík, M.: Hps: high precision stemmer. Inf. Process. Manage. 51(1), 68–91 (2015)
Article Google Scholar
Chapelle, O., Metlzer, D., Zhang, Y., Grinspan, P.: Expected reciprocal rank for graded relevance. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 621–630. ACM (2009)
Google Scholar
Krajči, S., Novotný, R.: Hľľadanie základného tvaru slovenského slova na základe spoločného konca slov (In Slovak). In: 1st Workshop on Intelligent and Knowledge Oriented Technologies, pp. 99–101 (2006)
Google Scholar
Šajgalík, M., Barla, M., Bieliková, M.: Exploring multidimensional continuous feature space to extract relevant words. In: Besacier, L., Dediu, A.-H., Martín-Vide, C. (eds.) SLSP 2014. LNCS, vol. 8791, pp. 159–170. Springer, Heidelberg (2014)
Google Scholar

Download references

Acknowledgments

This work was partially supported by the Scientific Grant Agency of Slovak Republic, grant No. VG 1/0646/15 and the Cultural and Educational Grant Agency of the Slovak Republic, grant No. KEGA 009STU-4/2014.

Author information

Authors and Affiliations

Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, Ilkovičova 2, 842 16, Bratislava, Slovakia
Ladislav Gallay & Marián Šimko

Authors

Ladislav Gallay
View author publications
You can also search for this author in PubMed Google Scholar
Marián Šimko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marián Šimko .

Editor information

Editors and Affiliations

University of Latvia, Riga, Latvia
Rūsiņš Mārtiņš Freivalds
University of Paderborn, Paderborn, Germany
Gregor Engels
University of Genoa, Genoa, Italy
Barbara Catania

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gallay, L., Šimko, M. (2016). Utilizing Vector Models for Automatic Text Lemmatization. In: Freivalds, R., Engels, G., Catania, B. (eds) SOFSEM 2016: Theory and Practice of Computer Science. SOFSEM 2016. Lecture Notes in Computer Science(), vol 9587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49192-8_43

Download citation

DOI: https://doi.org/10.1007/978-3-662-49192-8_43
Published: 08 January 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49191-1
Online ISBN: 978-3-662-49192-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Utilizing Vector Models for Automatic Text Lemmatization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Identification of Lemmatization Errors Using Neural Models

A Comparative Study of Lemmatization Approaches for Rojak Language

Lemmatization of Multi-Word Entity Names for Polish Language Using Rules Automatically Generated Based on the Corpus Analysis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Utilizing Vector Models for Automatic Text Lemmatization

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Identification of Lemmatization Errors Using Neural Models

A Comparative Study of Lemmatization Approaches for Rojak Language

Lemmatization of Multi-Word Entity Names for Polish Language Using Rules Automatically Generated Based on the Corpus Analysis

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation