Language Adaptive Multilingual CTC Speech Recognition

Markus Müller^16,17,
Sebastian Stüker^16,17 &
Alex Waibel^16,17,18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10458))

Included in the following conference series:

International Conference on Speech and Computer

2491 Accesses

Abstract

Recently, it has been demonstrated that speech recognition systems are able to achieve human parity. While much research is done for resource-rich languages like English, there exists a long tail of languages for which no speech recognition systems do yet exist. The major obstacle in building systems for new languages is the lack of available resources. In the past, several methods have been proposed to build systems in low-resource conditions by using data from additional source languages during training. While it has been shown that DNN/HMM hybrid setups trained in low-resource conditions benefit from additional data, we are proposing a similar technique using sequence based neural network acoustic models with Connectionist Temporal Classification (CTC) loss function. We demonstrate that setups with multilingual phone sets benefit from the addition of Language Feature Vectors (LFVs).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Exploring CTC Based End-To-End Techniques for Myanmar Speech Recognition

Multilingual Speech Recognition for Indian Languages

Lattice Based Transcription Loss for End-to-End Speech Recognition

Article 30 September 2017

Notes

1.
see: https://github.com/baidu-research/warp-ctc, accessed 2017-04-13.

References

PyTorch. http://pytorch.org. Accessed 13 Apr 2017
warp-ctc. https://github.com/baidu-research/warp-ctc. Accessed 13 Apr 2017
Woszczyna, M., et al.: JANUS 93: towards spontaneous speech translation. In: International Conference on Acoustics, Speech, and Signal Processing 1994, Adelaide, Australia (1994)
Google Scholar
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., et al.: Deep speech 2: end-to-end speech recognition in english and mandarin. arXiv preprint (2015). arXiv:1512.02595
Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997)
Article MathSciNet Google Scholar
Chen, D., Mak, B., Leung, C.C., Sivadas, S.: Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5592–5596. IEEE (2014)
Google Scholar
Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a Matlab-like environment for machine learning. In: BigLearn, NIPS Workshop (2011)
Google Scholar
Ghoshal, A., Swietojanski, P., Renals, S.: Multilingual training of deep-neural networks. In: Proceedings of the ICASSP, Vancouver, Canada (2013)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference On Machine Learning, pp. 369–376. ACM (2006)
Google Scholar
Gretter, R.: Euronews: a multilingual benchmark for ASR and LID. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Google Scholar
Grézl, F., Karafiát, M., Vesely, K.: Adaptation of multilingual stacked bottle-neck neural network structure for new language. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7654–7658. IEEE (2014)
Google Scholar
Heigold, G., Vanhoucke, V., Senior, A., Nguyen, P., Ranzato, M., Devin, M., Dean, J.: Multilingual acoustic models using distributed deep neural networks. In: Proceedings of the ICASSP, Vancouver, Canada, May 2013
Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Article MathSciNet MATH Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, H., Sim, K.C.: An investigation of augmenting speaker representations to improve speaker normalisation for DNN-based speech recognition. In: ICASSP, pp. 4610–4613. IEEE (2015)
Google Scholar
Kim, S., Hori, T., Watanabe, S.: Joint ctc-attention based end-to-end speech recognition using multi-task learning. arXiv preprint (2016). arXiv:1609.06773
Laskowski, K., Heldner, M., Edlund, J.: The fundamental frequency variation spectrum. In: Proceedings of the 21st Swedish Phonetics Conference (Fonetik 2008), pp. 29–32, Gothenburg, Sweden, June 2008
Google Scholar
Lu, L., Kong, L., Dyer, C., Smith, N.A.: Multi-task learning with ctc and segmental crf for speech recognition. arXiv preprint (2017). arXiv:1702.06378
Metze, F., Sheikh, Z., Waibel, A., Gehring, J., Kilgour, K., Nguyen, Q.B., Nguyen, V.H., et al.: Models of Tone for tonal and non-tonal languages. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 261–266. IEEE (2013)
Google Scholar
Miao, Y., Gowayyed, M., Metze, F.: EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 167–174. IEEE (2015)
Google Scholar
Miao, Y., Zhang, H., Metze, F.: Towards speaker adaptive training of deep neural network acoustic models (2014)
Google Scholar
Mohan, A., Rose, R.: Multi-lingual speech recognition with low-rank multi-task deep neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4994–4998. IEEE (2015)
Google Scholar
Müller, M., Stüker, S., Waibel, A.: Language Adaptive DNNs for improved low resource speech recognition. In: Interspeech (2016)
Google Scholar
Müller, M., Stüker, S., Waibel, A.: Language feature vectors for resource constraint speech recognition. In: ITG Symposium, Proceedings of Speech Communication, vol. 12. VDE (2016)
Google Scholar
Müller, M., Waibel, A.: Using language adaptive deep neural networks for improved multilingual speech recognition. In: IWSLT (2015)
Google Scholar
Sak, H., Rao, K.: Multi-accent speech recognition with hierarchical grapheme based models (2017)
Google Scholar
Saon, G., Kurata, G., Sercu, T., Audhkhasi, K., Thomas, S., Dimitriadis, D., Cui, X., Ramabhadran, B., Picheny, M., Lim, L.L., et al.: English conversational telephone speech recognition by humans and machines. arXiv preprint (2017). arXiv:1703.02136
Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-Vectors. In: ASRU, pp. 55–59. IEEE (2013)
Google Scholar
Scanzio, S., Laface, P., Fissore, L., Gemello, R., Mana, F.: On the use of a multilingual neural network front-end. In: Proceedings of the Interspeech, pp. 2711–2714 (2008)
Google Scholar
Schröder, M., Trouvain, J.: The German text-to-speech synthesis system MARY: a tool for research, development and teaching. Int. J. Speech Technol. 6(4), 365–377 (2003)
Article Google Scholar
Schubert, K.: Grundfrequenzverfolgung und deren Anwendung in der Spracherkennung. Master’s thesis, Universität Karlsruhe (TH), Germany (1999) (in German)
Google Scholar
Schultz, T., Waibel, A.: Fast bootstrapping of lvcsr systems with multilingual phoneme sets. In: Eurospeech (1997)
Google Scholar
Schultz, T., Waibel, A.: Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Commun. 35(1), 31–51 (2001)
Article MATH Google Scholar
Soltau, H., Liao, H., Sak, H.: Neural speech recognizer: acoustic-to-word lstm model for large vocabulary speech recognition. arXiv preprint (2016). arXiv:1610.09975
Soltau, H., Metze, F., Fugen, C., Waibel, A.: A One-pass decoder based on polymorphic linguistic context assignment. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2001, pp. 214–217. IEEE (2001)
Google Scholar
Stüker, S.: Acoustic modelling for under-resourced languages. Ph.D. thesis, Karlsruhe University, Dissertation (2009)
Google Scholar
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International Conference on Machine Learning (ICML-2013), pp. 1139–1147 (2013)
Google Scholar
Swietojanski, P., Ghoshal, A., Renals, S.: Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR. In: SLT, pp. 246–251. IEEE (2012)
Google Scholar
Vesely, K., Karafiat, M., Grezl, F., Janda, M., Egorova, E.: The language-independent bottleneck features. In: Proceedings of the Spoken Language Technology Workshop (SLT), pp. 336–341. IEEE (2012)
Google Scholar
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K.: Phoneme recognition using time-delay neural networks. In: ATR Interpreting Telephony Research Laboratories, 30 October 1987
Google Scholar
Wheatley, B., Kondo, K., Anderson, W., Muthusamy, Y.: An evaluation of cross-language adaptation for rapid hmm development in a new language. In: 1994 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-1994, vol. 1, pp. I-237. IEEE (1994)
Google Scholar
Xiong, W., Droppo, J., Huang, X., Seide, F., Seltzer, M., Stolcke, A., Yu, D., Zweig, G.: Achieving human parity in conversational speech recognition. arXiv preprint (2016). arXiv:1610.05256

Download references

Author information

Authors and Affiliations

Institute for Anthropomatics and Robotics, Karlsruhe, Germany
Markus Müller, Sebastian Stüker & Alex Waibel
Karlsruhe Institute of Technology, Karlsruhe, Germany
Markus Müller, Sebastian Stüker & Alex Waibel
Language Technology Institute, Carnegie Mellon University, Pittsburgh, PA, USA
Alex Waibel

Authors

Markus Müller
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Stüker
View author publications
You can also search for this author in PubMed Google Scholar
Alex Waibel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Markus Müller .

Editor information

Editors and Affiliations

SPIIRAS, Saint Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova
University of Hertfordshire, Hatfield, United Kingdom
Iosif Mporas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Müller, M., Stüker, S., Waibel, A. (2017). Language Adaptive Multilingual CTC Speech Recognition. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_47

Download citation

DOI: https://doi.org/10.1007/978-3-319-66429-3_47
Published: 13 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics