Abstract
Modern automatic speech recognition systems based on end-to-end (E2E) models show good results in terms of the accuracy of language recognition, which have large corpuses for several thousand hours of speech for system training. Such models require a very large amount of training data, which is problematic for low-resource languages like the Kazakh language. However, many studies have shown that the combination of connectionist temporal classification with other E2E models improves the performance of systems even with limited training data. In this regard, the speech corpus of the Kazakh language was assembled, and this corpus was expanded using the augmentation method. Our work presents the implementation of a joint model of CTC and the attention mechanism for recognition of Kazakh speech, which solves the problem of rapid decoding and training of the system. The results demonstrated that the proposed E2E model using language models improved the system performance and showed the best result on our dataset for the Kazakh language. As a result of the experiment, the system achieved competitive results in Kazakh speech recognition.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alsayadi, H., Abdelhamid, A., Hegazy, I., & Fayed, Z. (2021). Arabic speech recognition using end-to-end deep learning. IET Signal Processing. https://doi.org/10.1049/sil2.12057
Amirgaliyev, N., Kuanyshbay, D., & Baimuratov, O. (2020). Development of automatic speech recognition for Kazakh language using transfer learning. Speech recognition for Kazakh language project.
Brown, J., & Smaragdis, P. (2009). Hidden Markov and Gaussian mixture models for automatic call classification. The Journal of the Acoustical Society of America, 125, EL221–EL224. https://doi.org/10.1121/1.3124659
Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016b). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016b IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, China, 2016 (pp. 4960–4964). https://doi.org/10.1109/ICASSP.2016.7472621.
Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016a). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai (pp. 4960–4964). https://doi.org/10.1109/ICASSP.2016a.7472621.
Chen, J., Nishimura, R., & Kitaoka, N. (2020). End-to-end recognition of streaming Japanese speech using CTC and local attention. APSIPA Transactions on Signal and Information Processing. https://doi.org/10.1017/ATSIP.2020.23
Emiru, E., Li, Y., Fesseha, A., & Diallo, M. (2021). Improving Amharic Speech Recognition System using connectionist temporal classification with attention model and phoneme-based byte-pair-encodings. Information, 12, 62. https://doi.org/10.3390/info12020062
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural 'networks. In ICML 2006—Proceedings of the 23rd international conference on machine learning, 2006 (pp. 369–376). https://doi.org/10.1145/1143844.1143891.
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82–97.
Hori, T., Watanabe, S., Zhang, Y., & Chan, W. (2017). Advances in Joint CTC–attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. In INTERSPEECH 2017.
Ignatenko, G. S., & Lamchanovsky, A. G. (2019). Classification of audio signals using neural networks. Young Scientist, 48(286), 23–25. Retrieved 07/02/2022, from https://moluch.ru/archive/286/64455/.
Juang, B., & Rabiner, L. (1991). Hidden Markov models for speech recognition. Technometrics, 33(3), 251–272. https://doi.org/10.2307/1268779
Keren, G., & Schuller, B. (2016). Convolutional RNN: An enhanced model for extracting features from sequential data. In Proceedings of the international joint conference on neural networks, 2016 (pp. 3412–3419).
Kim, S., Hori, T., & Watanabe, S. (2016). Joint CTC–attention based end-to-end speech recognition using multi-task learning. In ICASSP 2017.
Kingma, D., & Ba, J. (2014). Adam: a method for stochastic optimization. In Proceedings of the 3rd international conference for learning representation, 2014.
Levenshtein, V. I. (1996). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10, 707–710.
Mamyrbayev, O., Alimhan, K., Oralbekova, D., Bekarystankyzy, A., & Zhumazhanov, B. (2022). Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level. Eastern-European Journal of Enterprise Technologies, 1(9 (115)), 84–92. https://doi.org/10.15587/1729-4061.2022.252801
Mamyrbayev, O., Kydyrbekova, A., Alimhan, K., Oralbekova, D., Zhumazhanov, B., & Nuranbayeva, B. (2021). Development of security systems using DNN and i & x-vector classifiers. Eastern-European Journal of Enterprise Technologies, 4(9 (112)), 32–45. https://doi.org/10.15587/1729-4061.2021.239186
Mamyrbayev, O., & Oralbekova, D. (2020). Modern trends in the development of speech recognition systems. News of the National Academy of Sciences of the Republic of Kazakhstan, 4(3320), 42–51. https://doi.org/10.32014/2020.2518-1726.64
Mamyrbayev, O., Turdalyuly, M., Mekebayev, N., Kuralai, M., Alimhan, K., BabaAli, B., Nabieva, G., Duisenbayeva, A., & Akhmetov, B. (2019). Continuous speech recognition of Kazakh language. ITM Web of Conferences, 24, 01012. https://doi.org/10.1051/itmconf/20192401012
Mamyrbayev, O., Oralbekova, D., Kydyrbekova, A., Turdalykyzy, T., & Bekarystankyzy, A. (2021a). End-to-end model based on RNN-T for Kazakh speech recognition. In 2021a 3rd International conference on computer communication and the Internet (ICCCI), 2021 (pp. 163–167). https://doi.org/10.1109/ICCCI51764.2021.9486811.
Mamyrbayev, O., Alimhan, K., & Zhumazhanov, B., Turdalykyzy, T., & Gusmanova, F. (2020). End-to-end speech recognition in agglutinative languages.https://doi.org/10.1007/978-3-030-42058-1_33
Miao, H., Cheng, G., Zhang, P., & Li, T., & Yan, Y. (2019). Online hybrid CTC/attention architecture for end-to-end speech recognition. In INTERSPEECH 2019 (pp. 2623–2627). https://doi.org/10.21437/Interspeech.2019-2018.
Nie, M., & Lei, Z. (2020). Hybrid CTC/attention architecture with self-attention and convolution hybrid encoder for speech recognition. Journal of Physics: Conference Series, 1549, 052034. https://doi.org/10.1088/1742-6596/1549/5/052034
Park H., Seo S., Sogang, D., Rim J., Kim C., Son H., Park, J., & Kim J. (2019). Korean grapheme unit-based speech recognition using attention–CTC ensemble network. In 2019 International symposium on multimedia and communication technology (ISMAC), 2019 (pp. 1–5). https://doi.org/10.1109/ISMAC.2019.8836146
Watanabe, S., Hori, T., Kim, S., Hershey, J. R., & Hayashi, T. (2017). Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1240–1253. https://doi.org/10.1109/JSTSP.2017.2763455
Wu, L., Li, T., Wang, L., & Yan, Y. (2019). Improving hybrid CTC/attention architecture with time-restricted self-attention CTC for end-to-end speech recognition. Applied Sciences, 9, 4639. https://doi.org/10.3390/app9214639
Zeyer, A., Irie, K., Schlüter, R., & Ney, H. (2018). Improved training of end-to-end attention models for speech recognition. In INTERSPEECH 2018 (pp. 7–11). https://doi.org/10.21437/Interspeech.2018-1616.
Zweig, G., & Nguyen, P. (2009). A segmental CRF approach to large vocabulary continuous speech recognition. In IEEE workshop on automatic speech recognition and understanding (pp. 152–157). https://doi.org/10.1109/ASRU.2009.5372916.
Acknowledgements
This research has been funded by the Science Committee of the Ministry of Education and Science of the Republic of Kazakhstan (Grant No. AP08855743).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mamyrbayev, O.Z., Oralbekova, D.O., Alimhan, K. et al. Hybrid end-to-end model for Kazakh speech recognition. Int J Speech Technol 26, 261–270 (2023). https://doi.org/10.1007/s10772-022-09983-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-022-09983-8