Neural Named Entity Recognition for Kazakh

Gulmira Tolegen⁸,
Alymzhan Toleu⁸,
Orken Mamyrbayev⁸ &
…
Rustam Mussabayev⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13452))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

493 Accesses

Abstract

We present several neural networks to address the task of named entity recognition for morphologically complex languages (MCL). Kazakh is a morphologically complex language in which each root/stem can produce hundreds or thousands of variant word forms. This nature of the language could lead to a serious data sparsity problem, which may prevent the deep learning models from being well trained for under-resourced MCLs. In order to model the MCLs’ words effectively, we introduce root and entity tag embedding plus tensor layer to the neural networks. The effects of those are significant for improving NER model performance of MCLs. The proposed models outperform state-of-the-art including character-based approaches, and can be potentially applied to other morphologically complex languages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

DeepSpacy-NER: an efficient deep learning model for named entity recognition for Punjabi language

Article 03 August 2022

Neural Networks for Featureless Named Entity Recognition in Czech

A Named Entity Recognition Approach for Albanian Using Deep Learning

Notes

1.
The dictionary is extracted from training data and performed some pre-processing, namely lowercasing and word-stemming. Words outside this dictionary are replaced by a single special symbol.
2.
The words exceeding the sentence boundaries are mapped to one of two special symbols, namely “start" and “end" symbols.
3.
www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt.
4.
It often appears when the organization name is given after someone’s name.
5.
https://code.google.com/p/word2vec/.
6.
In order to reduce dictionary size of root and surface word, we did some pre-processing namely, lowercasing and word stemming by morphological analyzer and disambiguator.

References

Baba Ali, B., Wójcik, W., Orken, M., Turdalyuly, M., Mekebayev, N.: Speech recognizer-based non-uniform spectral compression for robust MFCC feature extraction. Przegl. Elektrotechniczny 94, 90–93 (2018). https://doi.org/10.15199/48.2018.06.17
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Google Scholar
Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy approach. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 160–163. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
Google Scholar
Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 168–171. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
Google Scholar
Graves, A., Fernández, S., Gomez, F.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning, ICML 2006, pp. 369–376 (2006)
Google Scholar
He, D., et al.: Dual learning for machine translation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems vol. 29, pp. 820–828. Curran Associates, Inc. (2016)
Google Scholar
Klein, D., Smarr, J., Nguyen, H., Manning, C.D.: Named entity recognition with character-level models. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 180–183. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
Google Scholar
Kuru, O., Can, O.A., Yuret, D.: CharNER: character-level named entity recognition. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 911–921. The COLING 2016 Organizing Committee (December 2016)
Google Scholar
Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, San Francisco, CA, USA, pp. 282–289. Morgan Kaufmann Publishers Inc., (2001)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics (2016)
Google Scholar
Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1520–1530. Association for Computational Linguistics (September 2015)
Google Scholar
Ma, X., Hovy, E.: End-to-end sequence labeling via Bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), pp. 1064–1074. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/P16-1101,https://aclweb.org/anthology/P16-1101
Mamyrbayev, O., Toleu, A., Tolegen, G., Mekebayev, N.: Neural architectures for gender detection and speaker identification. Cogent Eng. 7(1), 1727168 (2020). https://doi.org/10.1080/23311916.2020.1727168
Mamyrbayev, O., et al.: Continuous speech recognition of kazakh language. ITM Web Conf. 24, 01012 (2019). https://doi.org/10.1051/itmconf/20192401012
Mccallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 188–191. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)
Google Scholar
Pei, W., Ge, T., Chang, B.: Max-margin tensor neural network for chinese word segmentation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland (Vol. 1: Long Papers), pp. 293–303. Association for Computational Linguistics (June 2014)
Google Scholar
Seker, G.A., Eryigit, G.: Initial explorations on using CRFs for turkish named entity recognition. In: COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8–15 December 2012, Mumbai, India, pp. 2459–2474 (2012)
Google Scholar
Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In: Proceedings of the 6th Conference on Natural Language Learning - Vol. 20, pp. 1–4. COLING 2002, Association for Computational Linguistics, Stroudsburg, PA, USA (2002)
Google Scholar
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 142–147. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
Google Scholar
Tkachenko, M., Simanovsky, A.: Named entity recognition: exploring features. In: Jancsary, J. (ed.) Proceedings of KONVENS 2012. pp. 118–127. ÖGAI (September 2012). main track: oral presentations
Google Scholar
Tolegen, G., Toleu, A., Zheng, X.: Named entity recognition for kazakh using conditional random fields. In: Proceedings of the 4-th International Conference on Computer Processing of Turkic Languages TurkLang 2016, pp. 118–127. Izvestija KGTU im.I.Razzakova (2016)
Google Scholar
Toleu, A., Tolegen, G., Mussabayev, R.: Comparison of various approaches for dependency parsing. In: 2019 15th International Asian School-Seminar Optimization Problems of Complex Systems (OPCS), pp. 192–195 (2019)
Google Scholar
Toleu, A., Tolegen, G., Makazhanov, A.: Character-aware neural morphological disambiguation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada (Vol. 2: Short Papers), pp. 666–671. Association for Computational Linguistics (July 2017)
Google Scholar
Tür, G., Hakkani-tür, D., Oflazer, K.: A statistical information extraction system for turkish. Nat. Lang. Eng. 9(2), 181–210 (2003)
Article Google Scholar
Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Charagram: embedding words and sentences via character n-grams. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1504–1515. Association for Computational Linguistics (November 2016)
Google Scholar
Yeniterzi, R.: Exploiting morphology in turkish named entity recognition system. In: Proceedings of the ACL 2011 Student Session, pp. 105–110. HLT-SS 2011, Association for Computational Linguistics, Stroudsburg, PA, USA (2011)
Google Scholar
Zhai, Z., Nguyen, D.Q., Verspoor, K.: Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition. In: Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis, pp. 38–43. Association for Computational Linguistics (2018)
Google Scholar
Zheng, X., Chen, H., Xu, T.: Deep learning for Chinese word segmentation and POS tagging. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 647–657. Association for Computational Linguistics (October 2013)
Google Scholar

Download references

Acknowledgments

The work was funded by the Committee of Science of Ministry of Education and Science of the Republic of Kazakhstan under the grant AP09259324.

Author information

Authors and Affiliations

Institute of Information and Computational Technologies, Almaty, Kazakhstan
Gulmira Tolegen, Alymzhan Toleu, Orken Mamyrbayev & Rustam Mussabayev

Authors

Gulmira Tolegen
View author publications
You can also search for this author in PubMed Google Scholar
Alymzhan Toleu
View author publications
You can also search for this author in PubMed Google Scholar
Orken Mamyrbayev
View author publications
You can also search for this author in PubMed Google Scholar
Rustam Mussabayev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alymzhan Toleu .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tolegen, G., Toleu, A., Mamyrbayev, O., Mussabayev, R. (2023). Neural Named Entity Recognition for Kazakh. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13452. Springer, Cham. https://doi.org/10.1007/978-3-031-24340-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-24340-0_1
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24339-4
Online ISBN: 978-3-031-24340-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics