More Web Proxy on the site http://driver.im/

article

Deep neural network training for whispered speech recognition using small databases and generative model sampling

Authors:

Shabnam Ghaffarzadegan,

John H. HansenAuthors Info & Claims

International Journal of Speech Technology, Volume 20, Issue 4

Pages 1063 - 1075

https://doi.org/10.1007/s10772-017-9461-x

Published: 01 December 2017 Publication History

Abstract

State-of-the-art speech recognition solutions currently employ hidden Markov models (HMMs) to capture the time variability in a speech signal and deep neural networks (DNNs) to model the HMM state distributions. It has been shown that DNN---HMM hybrid systems outperform traditional HMM and Gaussian mixture model (GMM) hybrids in many applications. This improvement is mainly attributed to the ability of DNNs to model more complex data structures. However, having sufficient data samples is one key point in training a high accuracy DNN as a discriminative model. This barrier makes DNNs unsuitable for many applications with limited amounts of data. In this study, we introduce a method to produce an excessive amount of pseudo-samples that requires availability of only a small amount of transcribed data from the target domain. In this method, a universal background model (UBM) is trained to capture a parametric estimate of the data distributions. Next, random sampling is used to generate a large amount of pseudo-samples from the UBM. Frame-Shuffling is then applied to smooth the temporal cepstral trajectories in the generated pseudo-sample sequences to better resemble the temporal characteristics of a natural speech signal. Finally, the pseudo-sample sequences are combined with the original training data to train the DNN---HMM acoustic model of a speech recognizer. The proposed method is evaluated on small-sized sets of neutral and whisper data drawn from the UT-Vocal Effort II corpus. It is shown that phoneme error rates (PERs) of a DNN---HMM based speech recognizer are considerably reduced when incorporating the generated pseudo-samples in the training process, with + 79.0 and + 45.6% relative PER improvements for neutral---neutral training/test and whisper---whisper training/test scenarios, respectively.

References

[1]

Atal, B. S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. The Journal of the Acoustical Society of America, 55(6), 1304-1312.

[2]

Bo¿il, H., Grézl, F., & Hansen, J. H. L. (2011). Front-end compensation methods for LVCSR under Lombard effect. INTERSPEECH 2011 Florence, pp. 1257-1260.

[3]

Bo¿il, H., & Hansen, J. H. L. (2011). UT-Scope: Towards LVCSR under Lombard effect induced by varying types and levels of noisy background. In IEEE ICASSP 2011, May 22-27, 2011, Prague, pp. 4472-4475.

[4]

Bou-Ghazale, S., & Hansen, J. H. L. (1994). Duration and spectral based stress token generation for HMM speech recognition under stress. In Proceedings of ICASSP '94, Adelaide, April 19-22, pp. 413-416.

[5]

Bou-Ghazale, S., & Hansen, J. H. L. (1996). Generating stressed speech from neutral speech using a modified celp vocoder. Speech Communication, 20(1-2), 93-110.

Digital Library

[6]

Brooks, S. (1998). Markov chain monte carlo method and its application. Journal of the Royal Statistical Society, 47(1), 69-100.

[7]

Casella, G., & George, E. I. (1992). Explaining the gibbs sampler. The American Statistician, 46(3), 167-174.

[8]

Dahl, G., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30-42.

Digital Library

[9]

Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357-366.

[10]

Deng, L., Hinton, G., & Kingsbury, B. New types of deep neural network learning for speech recognition and related applications: An overview. In ICASSP 2013, Vancouver, pp. 8599-8603.

[11]

Gales, M. (1998). Maximum likelihood linear transformations for hmm-based speech recognition. Computer Speech and Language, 12, 75-98.

[12]

Ghaffarzadegan, S., Boril, H., & Hansen, J. H. L. (2016). Generative modeling of pseudo-whisper for robust whispered speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(10), 1705-1720.

Digital Library

[13]

Ghaffarzadegan, S., Boril, H., & Hansen, J. H. L. (2014a). Model and feature based compensation for whispered speech recognition. In Interspeech 2014, Singapore, pp. 2420-2424.

[14]

Ghaffarzadegan, S., Boril, H., & Hansen, J. H. L. (2014b). UTVOCAL EFFORT II: Analysis and constrained-lexicon recognition of whispered speech. In IEEE ICASSP 2014, Florence, pp. 2563-2567.

[15]

Ghaffarzadegan, S., Boril, H., & Hansen, J. H. L. (2015). Generative modeling of pseudo-target domain adaptation samples for whispered speech recognition. In IEEE ICASSP 2015, Brisbane.

[16]

Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. ICASSP 1992, Washington, DC, pp. 13-16.

[17]

Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. In IEEE Transactions on Speech and Acoustics, 2, 587-589.

[18]

Hinton, G., Deng, L., Yu, D., rahman Mohamed, A., Jaitly, N., Senior, A., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82-97.

[19]

Huang, C., & Moraga, C. (2004). A diffusion-neural-network for learning from small samples. International Journal of Approximate Reasoning, 35(2), 137-161.

[20]

Ketabdar, H., & Bourlard, H. (2010). Enhanced phone posteriors for improving speech recognition systems. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1094-1106.

Digital Library

[21]

Lasserre, J. A., Bishop, C. M., & Minka, T. P. (2006). Principled hybrids of generative and discriminative models. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1, CVPR '06, Washington, DC, USA. IEEE Computer Society, pp. 87-94.

[22]

Li, D., Wu, C., Tsai, T., & Lin, Y. (2007a). Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Computers & Operations Research, 34(4), 966-982.

[23]

Li, D.-C., Hsu, H.-C., Tsai, T.-I., Lu, T.-J., & Hu, S. C. (2007b). A new method to help diagnose cancers for small sample size. Expert Systems with Applications, 33(2), 420-424.

Digital Library

[24]

Liporace, L. (2006). Maximum likelihood estimation for multivariate observations of markov sources. IEEE Transactions on Information Theory, 28(5), 729-734.

[25]

Mao, R., Chen, A., Zhang, L., & Zhu, H. (2006). A new method to assist small data set neural network learning. 2006 6th International Conference on Intelligent Systems Design and Applications, 01:17-22.

[26]

Matsoukas, S., Schwartz, R., Jin, H., & Nguyen, L. (1997). Practical implementations of speaker-adaptive training. In DARPA Speech Recognition Workshop.

[27]

Neal, R. M. (2010). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54, 113-162.

[28]

Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In Advances in Neural Information Processing Systems 14, MIT Press, pp. 841-848.

Digital Library

[29]

Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.

[30]

Rabiner, L. R. (1990). Readings in speech recognition. chapter A tutorial on hidden Markov models and selected applications in speech Recognition, San Francisco: Morgan Kaufmann Publishers Inc., pp. 267-296.

[31]

Rubinstein, Y. D. & Hastie, T. (1997). Discriminative vs informative learning. In proc. third int. conf. on knowledge discovery and data mining. AAAI Press, pp. 49-53.

[32]

Seide, F., Li, G., & Yu, D. (2011). Conversational speech transcription using context-dependent deep neural networks. International Speech Communication Association.

[33]

Seltzer, M., Yu, D., & Wang, Y. (2013). An investigation of deep neural networks for noise robust speech recognition. In IEEE ICASSP 2013, Vancouver.

[34]

Young, S. (1996). A review of large-vocabulary continuous-speech recognition. IEEE Signal Processing Magazine, 1996, 13(5).

[35]

Zhang, C., & Hansen, J. H. L. (2009). Advancement in whisper-island detection with normally phonated audio streams. In ISCA INTERSPEECH, Brighton, pp. 860-863.

[36]

Zhang, X., Trmal, J., Povey, D., & Khudanpur, S. (2014). Improving deep neural network acoustic models using generalized maxout networks. In ICASSP 2014, Florence, May 4-9, 2014, pp. 215-219.

[37]

Zue, V., Seneff, S., & Glass, J. (1990). Speech database development at MIT: TIMIT and beyond. Speech Communication, 9(4), 351-356.

Cited By

Pandey LHasan KArif AKitamura YQuigley AIsbister KIgarashi TBjørn PDrucker S(2021)Acceptability of Speech and Silent Speech Input Methods in Private and PublicProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445430(1-13)Online publication date: 6-May-2021
https://dl.acm.org/doi/10.1145/3411764.3445430
Yang ZYu CZheng FShi Y(2019)ProxiTalkProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/33512763:3(1-25)Online publication date: 9-Sep-2019
https://dl.acm.org/doi/10.1145/3351276

Deep neural network training for whispered speech recognition using small databases and generative model sampling
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches

Recommendations

Exploiting deep neural networks for detection-based speech recognition

In recent years deep neural networks (DNNs) - multilayer perceptrons (MLPs) with many hidden layers - have been successfully applied to several speech tasks, i.e., phoneme recognition, out of vocabulary word detection, confidence measure, etc. In this ...
Speech recognition in a dialog system: from conventional to deep processing

The aim of this paper is to illustrate an overview of the automatic speech recognition (ASR) module in a spoken dialog system and how it has evolved from the conventional GMM-HMM (Gaussian mixture model - hidden Markov model) architecture toward the ...
Whispered Speech Recognition Using Deep Denoising Autoencoder and Inverse Filtering

Due to the profound differences between acoustic characteristics of neutral and whispered speech, the performance of traditional automatic speech recognition ASR systems trained on neutral speech degrades significantly when whisper is applied. In order ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Speech Technology

International Journal of Speech Technology Volume 20, Issue 4

December 2017

307 pages

ISSN:1381-2416

Issue’s Table of Contents

Copyright © Copyright © 2017 Springer Science+Business Media, LLC.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 December 2017

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Pandey LHasan KArif AKitamura YQuigley AIsbister KIgarashi TBjørn PDrucker S(2021)Acceptability of Speech and Silent Speech Input Methods in Private and PublicProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445430(1-13)Online publication date: 6-May-2021
https://dl.acm.org/doi/10.1145/3411764.3445430
Yang ZYu CZheng FShi Y(2019)ProxiTalkProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/33512763:3(1-25)Online publication date: 9-Sep-2019
https://dl.acm.org/doi/10.1145/3351276

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents