Abstract
This paper presents the results obtained using several variants of trigram language models in a large vocabulary continuous speech recognition (LVCSR) system for the Serbian language, based on the deep neural network (DNN) framework implemented within the Kaldi speech recognition toolkit. This training approach allows parallelization using several threads on either multiple GPUs or multiple CPUs, and provides a natural-gradient modification to the stochastic gradient descent (SGD) optimization method. Acoustic models are trained over a fixed number of training epochs with parameter averaging in the end. This paper discusses recognition using different language models trained with Kneser-Ney or Good-Turing smoothing methods, as well as several pruning parameter values. The results on a test set containing more than 120000 words and different utterance types are explored and compared to the referent results with GMM-HMM speaker-adapted models for the same speech database. Online and offline recognition results are compared to each other as well. Finally, the effect of additional discriminative training using a language model prior to the DNN stage is explored.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Popović, B., Pakoci, E., Ostrogonac, S., Pekar, D.: Large vocabulary continuous speech recognition for Serbian using the Kaldi toolkit. In: Proceedings of the 10th Conference on Digital Speech and Image Processing, DOGS, Novi Sad, pp. 31–34 (2014)
Popović, B., Ostrogonac, S., Pakoci, E., Jakovljević, N., Delić, V.: Deep neural network based continuous speech recognition for Serbian using the Kaldi toolkit. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015, LNCS, vol. 9319, pp. 186–192. Springer, Cham (2015)
Povey, D., Kuo, H.-K.J., Soltau, H.: Fast speaker adaptive training for speech recognition. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association, INTERSPEECH, Brisbane, pp. 1245–1248 (2008)
Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. thesis, University, Engineering Department, Cambridge (2003)
Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the 27th International Conference on Acoustics, Speech and Signal Processing, ICASSP, Orlando, pp. I-105–I-108 (2002)
Suzić, S., Ostrogonac, S., Pakoci, E., Bojanić, M.: Building a speech repository for a Serbian LVCSR system. Telfor J. 6(2), 109–114 (2014). Paunović, Đ., Milić, L. (Eds.) Telecommunications Society, Belgrade
Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, Waikoloa, p. 5 (2011)
Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: Proceedings of the 20th International Conference on Acoustics, Speech and Signal Processing, ICASSP, Detroit, pp. 181–184 (1995)
Gale, W.A., Sampson, G.: Good-Turing smoothing without tears. J. Quant. Linguist. 2(3), 217–237 (1995). Köhler, R. (Ed.) Swets & Zeitlinger, Lisse
Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of the 33rd International Conference on Acoustics, Speech and Signal Processing, ICASSP, Las Vegas, pp. 4057–4060 (2008)
Povey, D., Zhang, X., Khudanpur, S.: Parallel training of DNNs with natural gradient and parameter averaging. In: Proceedings of the 3rd International Conference on Learning Representations Workshop, ICLR, San Diego (2015)
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Proceedings of the 14th Annual Conference of the International Speech Communication Association, INTERSPEECH, Dresden, pp. 2–6 (2015)
Bhanuprasad, K., Svenson, D.: Errgrams – a way to improving ASR for highly inflective Dravidian languages. In: Proceedings of the 3rd International Joint Conference on Natural Language Processing, IJCNLP, Hyderabad, pp. 805–810 (2008)
Acknowledgments
. The work described in this paper was supported in part by the Ministry of Education, Science and Technological Development of the Republic of Serbia, within the project “Development of Dialogue Systems for Serbian and Other South Slavic Languages”, EUREKA project DANSPLAT, “A Platform for the Applications of Speech Technologies on Smartphones for the Languages of the Danube Region”, ID E! 9944, and the Provincial Secretariat for Higher Education and Scientific Research, within the project “Central Audio-Library of the University of Novi Sad”, No. 114-451-2570/2016-02.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Pakoci, E., Popović, B., Pekar, D. (2017). Language Model Optimization for a Deep Neural Network Based Speech Recognition System for Serbian. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_48
Download citation
DOI: https://doi.org/10.1007/978-3-319-66429-3_48
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)