Hybrid end-to-end model for Kazakh speech recognition

324 Accesses
11 Citations
Explore all metrics

Abstract

Modern automatic speech recognition systems based on end-to-end (E2E) models show good results in terms of the accuracy of language recognition, which have large corpuses for several thousand hours of speech for system training. Such models require a very large amount of training data, which is problematic for low-resource languages like the Kazakh language. However, many studies have shown that the combination of connectionist temporal classification with other E2E models improves the performance of systems even with limited training data. In this regard, the speech corpus of the Kazakh language was assembled, and this corpus was expanded using the augmentation method. Our work presents the implementation of a joint model of CTC and the attention mechanism for recognition of Kazakh speech, which solves the problem of rapid decoding and training of the system. The results demonstrated that the proposed E2E model using language models improved the system performance and showed the best result on our dataset for the Kazakh language. As a result of the experiment, the system achieved competitive results in Kazakh speech recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Exploring end-to-end framework towards Khasi speech recognition system

Article 27 January 2021

Tigrinya End-to-End Speech Recognition: A Hybrid Connectionist Temporal Classification-Attention Approach

A hybrid CTC+Attention model based on end-to-end framework for multilingual speech recognition

Article Open access 20 May 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Alsayadi, H., Abdelhamid, A., Hegazy, I., & Fayed, Z. (2021). Arabic speech recognition using end-to-end deep learning. IET Signal Processing. https://doi.org/10.1049/sil2.12057
Article Google Scholar
Amirgaliyev, N., Kuanyshbay, D., & Baimuratov, O. (2020). Development of automatic speech recognition for Kazakh language using transfer learning. Speech recognition for Kazakh language project.
Brown, J., & Smaragdis, P. (2009). Hidden Markov and Gaussian mixture models for automatic call classification. The Journal of the Acoustical Society of America, 125, EL221–EL224. https://doi.org/10.1121/1.3124659
Article Google Scholar
Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016b). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016b IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, China, 2016 (pp. 4960–4964). https://doi.org/10.1109/ICASSP.2016.7472621.
Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016a). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai (pp. 4960–4964). https://doi.org/10.1109/ICASSP.2016a.7472621.
Chen, J., Nishimura, R., & Kitaoka, N. (2020). End-to-end recognition of streaming Japanese speech using CTC and local attention. APSIPA Transactions on Signal and Information Processing. https://doi.org/10.1017/ATSIP.2020.23
Article Google Scholar
Emiru, E., Li, Y., Fesseha, A., & Diallo, M. (2021). Improving Amharic Speech Recognition System using connectionist temporal classification with attention model and phoneme-based byte-pair-encodings. Information, 12, 62. https://doi.org/10.3390/info12020062
Article Google Scholar
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural 'networks. In ICML 2006—Proceedings of the 23rd international conference on machine learning, 2006 (pp. 369–376). https://doi.org/10.1145/1143844.1143891.
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
Article Google Scholar
Hori, T., Watanabe, S., Zhang, Y., & Chan, W. (2017). Advances in Joint CTC–attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. In INTERSPEECH 2017.
Ignatenko, G. S., & Lamchanovsky, A. G. (2019). Classification of audio signals using neural networks. Young Scientist, 48(286), 23–25. Retrieved 07/02/2022, from https://moluch.ru/archive/286/64455/.
Juang, B., & Rabiner, L. (1991). Hidden Markov models for speech recognition. Technometrics, 33(3), 251–272. https://doi.org/10.2307/1268779
Article MathSciNet MATH Google Scholar
Keren, G., & Schuller, B. (2016). Convolutional RNN: An enhanced model for extracting features from sequential data. In Proceedings of the international joint conference on neural networks, 2016 (pp. 3412–3419).
Kim, S., Hori, T., & Watanabe, S. (2016). Joint CTC–attention based end-to-end speech recognition using multi-task learning. In ICASSP 2017.
Kingma, D., & Ba, J. (2014). Adam: a method for stochastic optimization. In Proceedings of the 3rd international conference for learning representation, 2014.
Levenshtein, V. I. (1996). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.
MathSciNet Google Scholar
Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., & Zhumazhanov, B. (2022). Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-European Journal of Enterprise Technologies, 1(9 (115)), 84–92. https://doi.org/10.15587/1729-4061.2022.252801
Article Google Scholar
Mamyrbayev, O., Kydyrbekova, A., Alimhan, K., Oralbekova, D., Zhumazhanov, B., & Nuranbayeva, B. (2021). Development of security systems using DNN and i & x-vector classifiers. Eastern-European Journal of Enterprise Technologies, 4(9 (112)), 32–45. https://doi.org/10.15587/1729-4061.2021.239186
Article Google Scholar
Mamyrbayev, O., & Oralbekova, D. (2020). Modern trends in the development of speech recognition systems. News of the National Academy of Sciences of the Republic of Kazakhstan, 4(3320), 42–51. https://doi.org/10.32014/2020.2518-1726.64
Article Google Scholar
Mamyrbayev, O., Turdalyuly, M., Mekebayev, N., Kuralai, M., Alimhan, K., BabaAli, B., Nabieva, G., Duisenbayeva, A., & Akhmetov, B. (2019). Continuous speech recognition of Kazakh language. ITM Web of Conferences, 24, 01012. https://doi.org/10.1051/itmconf/20192401012
Article Google Scholar
Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., & Bekarystankyzy, A. (2021a). End-to-end model based on RNN-T for Kazakh speech recognition. In 2021a 3rd International conference on computer communication and the Internet (ICCCI), 2021 (pp. 163–167). https://doi.org/10.1109/ICCCI51764.2021.9486811.
Mamyrbayev, O., Alimhan, K., & Zhumazhanov, B., Turdalykyzy, T., & Gusmanova, F. (2020). End-to-end speech recognition in agglutinative languages.https://doi.org/10.1007/978-3-030-42058-1_33
Miao, H., Cheng, G., Zhang, P., & Li, T., & Yan, Y. (2019). Online hybrid CTC/attention architecture for end-to-end speech recognition. In INTERSPEECH 2019 (pp. 2623–2627). https://doi.org/10.21437/Interspeech.2019-2018.
Nie, M., & Lei, Z. (2020). Hybrid CTC/attention architecture with self-attention and convolution hybrid encoder for speech recognition. Journal of Physics: Conference Series, 1549, 052034. https://doi.org/10.1088/1742-6596/1549/5/052034
Article Google Scholar
Park H., Seo S., Sogang, D., Rim J., Kim C., Son H., Park, J., & Kim J. (2019). Korean grapheme unit-based speech recognition using attention–CTC ensemble network. In 2019 International symposium on multimedia and communication technology (ISMAC), 2019 (pp. 1–5). https://doi.org/10.1109/ISMAC.2019.8836146
Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240–1253. https://doi.org/10.1109/JSTSP.2017.2763455
Article Google Scholar
Wu, L., Li, T., Wang, L., & Yan, Y. (2019). Improving hybrid CTC/attention architecture with time-restricted self-attention CTC for end-to-end speech recognition. Applied Sciences, 9, 4639. https://doi.org/10.3390/app9214639
Article Google Scholar
Zeyer, A., Irie, K., Schlüter, R., & Ney, H. (2018). Improved training of end-to-end attention models for speech recognition. In INTERSPEECH 2018 (pp. 7–11). https://doi.org/10.21437/Interspeech.2018-1616.
Zweig, G., & Nguyen, P. (2009). A segmental CRF approach to large vocabulary continuous speech recognition. In IEEE workshop on automatic speech recognition and understanding (pp. 152–157). https://doi.org/10.1109/ASRU.2009.5372916.

Download references

Acknowledgements

This research has been funded by the Science Committee of the Ministry of Education and Science of the Republic of Kazakhstan (Grant No. AP08855743).

Author information

Authors and Affiliations

Institute of Information and Computational Technologies CS MES RK, 28 Shevchenko Str., Almaty, Kazakhstan
Orken Zh. Mamyrbayev, Dina O. Oralbekova & Keylan Alimhan
Satbayev University, Almaty, Kazakhstan
Dina O. Oralbekova
Al-Farabi Kazakh National University, Almaty, Kazakhstan
Orken Zh. Mamyrbayev
L.N. Gumilyov Eurasian National University, Satpayev Str., 2, Nur-Sultan, 010008, Kazakhstan
Keylan Alimhan
Caspian University, Dostyk 85A, Almaty, Kazakhstan
Bulbul M. Nuranbayeva

Authors

Orken Zh. Mamyrbayev
View author publications
You can also search for this author in PubMed Google Scholar
Dina O. Oralbekova
View author publications
You can also search for this author in PubMed Google Scholar
Keylan Alimhan
View author publications
You can also search for this author in PubMed Google Scholar
Bulbul M. Nuranbayeva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dina O. Oralbekova.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mamyrbayev, O.Z., Oralbekova, D.O., Alimhan, K. et al. Hybrid end-to-end model for Kazakh speech recognition. Int J Speech Technol 26, 261–270 (2023). https://doi.org/10.1007/s10772-022-09983-8

Download citation

Received: 30 July 2021
Accepted: 09 June 2022
Published: 02 August 2022
Issue Date: July 2023
DOI: https://doi.org/10.1007/s10772-022-09983-8

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring end-to-end framework towards Khasi speech recognition system

Tigrinya End-to-End Speech Recognition: A Hybrid Connectionist Temporal Classification-Attention Approach

A hybrid CTC+Attention model based on end-to-end framework for multilingual speech recognition

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Hybrid end-to-end model for Kazakh speech recognition

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring end-to-end framework towards Khasi speech recognition system

Tigrinya End-to-End Speech Recognition: A Hybrid Connectionist Temporal Classification-Attention Approach

A hybrid CTC+Attention model based on end-to-end framework for multilingual speech recognition

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation