Abstract
We present several neural networks to address the task of named entity recognition for morphologically complex languages (MCL). Kazakh is a morphologically complex language in which each root/stem can produce hundreds or thousands of variant word forms. This nature of the language could lead to a serious data sparsity problem, which may prevent the deep learning models from being well trained for under-resourced MCLs. In order to model the MCLs’ words effectively, we introduce root and entity tag embedding plus tensor layer to the neural networks. The effects of those are significant for improving NER model performance of MCLs. The proposed models outperform state-of-the-art including character-based approaches, and can be potentially applied to other morphologically complex languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The dictionary is extracted from training data and performed some pre-processing, namely lowercasing and word-stemming. Words outside this dictionary are replaced by a single special symbol.
- 2.
The words exceeding the sentence boundaries are mapped to one of two special symbols, namely “start" and “end" symbols.
- 3.
- 4.
It often appears when the organization name is given after someone’s name.
- 5.
- 6.
In order to reduce dictionary size of root and surface word, we did some pre-processing namely, lowercasing and word stemming by morphological analyzer and disambiguator.
References
Baba Ali, B., Wójcik, W., Orken, M., Turdalyuly, M., Mekebayev, N.: Speech recognizer-based non-uniform spectral compression for robust MFCC feature extraction. Przegl. Elektrotechniczny 94, 90–93 (2018). https://doi.org/10.15199/48.2018.06.17
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014)
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Chieu, H.L., Ng, H.T.: Named entity recognition with a maximum entropy approach. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 160–163. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
Florian, R., Ittycheriah, A., Jing, H., Zhang, T.: Named entity recognition through classifier combination. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 168–171. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
Graves, A., Fernández, S., Gomez, F.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning, ICML 2006, pp. 369–376 (2006)
He, D., et al.: Dual learning for machine translation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems vol. 29, pp. 820–828. Curran Associates, Inc. (2016)
Klein, D., Smarr, J., Nguyen, H., Manning, C.D.: Named entity recognition with character-level models. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 180–183. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
Kuru, O., Can, O.A., Yuret, D.: CharNER: character-level named entity recognition. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 911–921. The COLING 2016 Organizing Committee (December 2016)
Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, San Francisco, CA, USA, pp. 282–289. Morgan Kaufmann Publishers Inc., (2001)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics (2016)
Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1520–1530. Association for Computational Linguistics (September 2015)
Ma, X., Hovy, E.: End-to-end sequence labeling via Bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), pp. 1064–1074. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/P16-1101,https://aclweb.org/anthology/P16-1101
Mamyrbayev, O., Toleu, A., Tolegen, G., Mekebayev, N.: Neural architectures for gender detection and speaker identification. Cogent Eng. 7(1), 1727168 (2020). https://doi.org/10.1080/23311916.2020.1727168
Mamyrbayev, O., et al.: Continuous speech recognition of kazakh language. ITM Web Conf. 24, 01012 (2019). https://doi.org/10.1051/itmconf/20192401012
Mccallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 188–191. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)
Pei, W., Ge, T., Chang, B.: Max-margin tensor neural network for chinese word segmentation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland (Vol. 1: Long Papers), pp. 293–303. Association for Computational Linguistics (June 2014)
Seker, G.A., Eryigit, G.: Initial explorations on using CRFs for turkish named entity recognition. In: COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, 8–15 December 2012, Mumbai, India, pp. 2459–2474 (2012)
Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In: Proceedings of the 6th Conference on Natural Language Learning - Vol. 20, pp. 1–4. COLING 2002, Association for Computational Linguistics, Stroudsburg, PA, USA (2002)
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Vol. 4, pp. 142–147. CONLL 2003, Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
Tkachenko, M., Simanovsky, A.: Named entity recognition: exploring features. In: Jancsary, J. (ed.) Proceedings of KONVENS 2012. pp. 118–127. ÖGAI (September 2012). main track: oral presentations
Tolegen, G., Toleu, A., Zheng, X.: Named entity recognition for kazakh using conditional random fields. In: Proceedings of the 4-th International Conference on Computer Processing of Turkic Languages TurkLang 2016, pp. 118–127. Izvestija KGTU im.I.Razzakova (2016)
Toleu, A., Tolegen, G., Mussabayev, R.: Comparison of various approaches for dependency parsing. In: 2019 15th International Asian School-Seminar Optimization Problems of Complex Systems (OPCS), pp. 192–195 (2019)
Toleu, A., Tolegen, G., Makazhanov, A.: Character-aware neural morphological disambiguation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada (Vol. 2: Short Papers), pp. 666–671. Association for Computational Linguistics (July 2017)
Tür, G., Hakkani-tür, D., Oflazer, K.: A statistical information extraction system for turkish. Nat. Lang. Eng. 9(2), 181–210 (2003)
Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Charagram: embedding words and sentences via character n-grams. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1504–1515. Association for Computational Linguistics (November 2016)
Yeniterzi, R.: Exploiting morphology in turkish named entity recognition system. In: Proceedings of the ACL 2011 Student Session, pp. 105–110. HLT-SS 2011, Association for Computational Linguistics, Stroudsburg, PA, USA (2011)
Zhai, Z., Nguyen, D.Q., Verspoor, K.: Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition. In: Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis, pp. 38–43. Association for Computational Linguistics (2018)
Zheng, X., Chen, H., Xu, T.: Deep learning for Chinese word segmentation and POS tagging. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 647–657. Association for Computational Linguistics (October 2013)
Acknowledgments
The work was funded by the Committee of Science of Ministry of Education and Science of the Republic of Kazakhstan under the grant AP09259324.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Tolegen, G., Toleu, A., Mamyrbayev, O., Mussabayev, R. (2023). Neural Named Entity Recognition for Kazakh. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13452. Springer, Cham. https://doi.org/10.1007/978-3-031-24340-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-24340-0_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24339-4
Online ISBN: 978-3-031-24340-0
eBook Packages: Computer ScienceComputer Science (R0)