Abstract
Recently, it has been demonstrated that speech recognition systems are able to achieve human parity. While much research is done for resource-rich languages like English, there exists a long tail of languages for which no speech recognition systems do yet exist. The major obstacle in building systems for new languages is the lack of available resources. In the past, several methods have been proposed to build systems in low-resource conditions by using data from additional source languages during training. While it has been shown that DNN/HMM hybrid setups trained in low-resource conditions benefit from additional data, we are proposing a similar technique using sequence based neural network acoustic models with Connectionist Temporal Classification (CTC) loss function. We demonstrate that setups with multilingual phone sets benefit from the addition of Language Feature Vectors (LFVs).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
PyTorch. http://pytorch.org. Accessed 13 Apr 2017
warp-ctc. https://github.com/baidu-research/warp-ctc. Accessed 13 Apr 2017
Woszczyna, M., et al.: JANUS 93: towards spontaneous speech translation. In: International Conference on Acoustics, Speech, and Signal Processing 1994, Adelaide, Australia (1994)
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., et al.: Deep speech 2: end-to-end speech recognition in english and mandarin. arXiv preprint (2015). arXiv:1512.02595
Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)
Chen, D., Mak, B., Leung, C.C., Sivadas, S.: Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5592–5596. IEEE (2014)
Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a Matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)
Ghoshal, A., Swietojanski, P., Renals, S.: Multilingual training of deep-neural networks. In: Proceedings of the ICASSP, Vancouver, Canada (2013)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference On Machine Learning, pp. 369–376. ACM (2006)
Gretter, R.: Euronews: a multilingual benchmark for ASR and LID. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Grézl, F., Karafiát, M., Vesely, K.: Adaptation of multilingual stacked bottle-neck neural network structure for new language. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7654–7658. IEEE (2014)
Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M., Devin, M., Dean, J.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of the ICASSP, Vancouver, Canada, May 2013
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Huang, H., Sim, K.C.: An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition. In: ICASSP, pp. 4610–4613. IEEE (2015)
Kim, S., Hori, T., Watanabe, S.: Joint ctc-attention based end-to-end speech recognition using multi-task learning. arXiv preprint (2016). arXiv:1609.06773
Laskowski, K., Heldner, M., Edlund, J.: The fundamental frequency variation spectrum. In: Proceedings of the 21st Swedish Phonetics Conference (Fonetik 2008), pp. 29–32, Gothenburg, Sweden, June 2008
Lu, L., Kong, L., Dyer, C., Smith, N.A.: Multi-task learning with ctc and segmental crf for speech recognition. arXiv preprint (2017). arXiv:1702.06378
Metze, F., Sheikh, Z., Waibel, A., Gehring, J., Kilgour, K., Nguyen, Q.B., Nguyen, V.H., et al.: Models of Tone for tonal and non-tonal languages. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 261–266. IEEE (2013)
Miao, Y., Gowayyed, M., Metze, F.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 167–174. IEEE (2015)
Miao, Y., Zhang, H., Metze, F.: Towards speaker adaptive training of deep neural network acoustic models (2014)
Mohan, A., Rose, R.: Multi-lingual speech recognition with low-rank multi-task deep neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4994–4998. IEEE (2015)
Müller, M., Stüker, S., Waibel, A.: Language Adaptive DNNs for improved low resource speech recognition. In: Interspeech (2016)
Müller, M., Stüker, S., Waibel, A.: Language feature vectors for resource constraint speech recognition. In: ITG Symposium, Proceedings of Speech Communication, vol. 12. VDE (2016)
Müller, M., Waibel, A.: Using language adaptive deep neural networks for improved multilingual speech recognition. In: IWSLT (2015)
Sak, H., Rao, K.: Multi-accent speech recognition with hierarchical grapheme based models (2017)
Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhadran, B., Picheny, M., Lim, L.L., et al.: English conversational telephone speech recognition by humans and machines. arXiv preprint (2017). arXiv:1703.02136
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-Vectors. In: ASRU, pp. 55–59. IEEE (2013)
Scanzio, S., Laface, P., Fissore, L., Gemello, R., Mana, F.: On the use of a multilingual neural network front-end. In: Proceedings of the Interspeech, pp. 2711–2714 (2008)
Schröder, M., Trouvain, J.: The German text-to-speech synthesis system MARY: a tool for research, development and teaching. Int. J. Speech Technol. 6(4), 365–377 (2003)
Schubert, K.: Grundfrequenzverfolgung und deren Anwendung in der Spracherkennung. Master’s thesis, Universität Karlsruhe (TH), Germany (1999) (in German)
Schultz, T., Waibel, A.: Fast bootstrapping of lvcsr systems with multilingual phoneme sets. In: Eurospeech (1997)
Schultz, T., Waibel, A.: Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Commun. 35(1), 31–51 (2001)
Soltau, H., Liao, H., Sak, H.: Neural speech recognizer: acoustic-to-word lstm model for large vocabulary speech recognition. arXiv preprint (2016). arXiv:1610.09975
Soltau, H., Metze, F., Fugen, C., Waibel, A.: A One-pass decoder based on polymorphic linguistic context assignment. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2001, pp. 214–217. IEEE (2001)
Stüker, S.: Acoustic modelling for under-resourced languages. Ph.D. thesis, Karlsruhe University, Dissertation (2009)
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on Machine Learning (ICML-2013), pp. 1139–1147 (2013)
Swietojanski, P., Ghoshal, A., Renals, S.: Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR. In: SLT, pp. 246–251. IEEE (2012)
Vesely, K., Karafiat, M., Grezl, F., Janda, M., Egorova, E.: The language-independent bottleneck features. In: Proceedings of the Spoken Language Technology Workshop (SLT), pp. 336–341. IEEE (2012)
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K.: Phoneme recognition using time-delay neural networks. In: ATR Interpreting Telephony Research Laboratories, 30 October 1987
Wheatley, B., Kondo, K., Anderson, W., Muthusamy, Y.: An evaluation of cross-language adaptation for rapid hmm development in a new language. In: 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-1994, vol. 1, pp. I-237. IEEE (1994)
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., Zweig, G.: Achieving human parity in conversational speech recognition. arXiv preprint (2016). arXiv:1610.05256
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Müller, M., Stüker, S., Waibel, A. (2017). Language Adaptive Multilingual CTC Speech Recognition. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_47
Download citation
DOI: https://doi.org/10.1007/978-3-319-66429-3_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)