[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Deep neural network training for whispered speech recognition using small databases and generative model sampling

Published: 01 December 2017 Publication History

Abstract

State-of-the-art speech recognition solutions currently employ hidden Markov models (HMMs) to capture the time variability in a speech signal and deep neural networks (DNNs) to model the HMM state distributions. It has been shown that DNN---HMM hybrid systems outperform traditional HMM and Gaussian mixture model (GMM) hybrids in many applications. This improvement is mainly attributed to the ability of DNNs to model more complex data structures. However, having sufficient data samples is one key point in training a high accuracy DNN as a discriminative model. This barrier makes DNNs unsuitable for many applications with limited amounts of data. In this study, we introduce a method to produce an excessive amount of pseudo-samples that requires availability of only a small amount of transcribed data from the target domain. In this method, a universal background model (UBM) is trained to capture a parametric estimate of the data distributions. Next, random sampling is used to generate a large amount of pseudo-samples from the UBM. Frame-Shuffling is then applied to smooth the temporal cepstral trajectories in the generated pseudo-sample sequences to better resemble the temporal characteristics of a natural speech signal. Finally, the pseudo-sample sequences are combined with the original training data to train the DNN---HMM acoustic model of a speech recognizer. The proposed method is evaluated on small-sized sets of neutral and whisper data drawn from the UT-Vocal Effort II corpus. It is shown that phoneme error rates (PERs) of a DNN---HMM based speech recognizer are considerably reduced when incorporating the generated pseudo-samples in the training process, with + 79.0 and + 45.6% relative PER improvements for neutral---neutral training/test and whisper---whisper training/test scenarios, respectively.

References

[1]
Atal, B. S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. The Journal of the Acoustical Society of America, 55(6), 1304-1312.
[2]
Bo¿il, H., Grézl, F., & Hansen, J. H. L. (2011). Front-end compensation methods for LVCSR under Lombard effect. INTERSPEECH 2011 Florence, pp. 1257-1260.
[3]
Bo¿il, H., & Hansen, J. H. L. (2011). UT-Scope: Towards LVCSR under Lombard effect induced by varying types and levels of noisy background. In IEEE ICASSP 2011, May 22-27, 2011, Prague, pp. 4472-4475.
[4]
Bou-Ghazale, S., & Hansen, J. H. L. (1994). Duration and spectral based stress token generation for HMM speech recognition under stress. In Proceedings of ICASSP '94, Adelaide, April 19-22, pp. 413-416.
[5]
Bou-Ghazale, S., & Hansen, J. H. L. (1996). Generating stressed speech from neutral speech using a modified celp vocoder. Speech Communication, 20(1-2), 93-110.
[6]
Brooks, S. (1998). Markov chain monte carlo method and its application. Journal of the Royal Statistical Society, 47(1), 69-100.
[7]
Casella, G., & George, E. I. (1992). Explaining the gibbs sampler. The American Statistician, 46(3), 167-174.
[8]
Dahl, G., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30-42.
[9]
Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357-366.
[10]
Deng, L., Hinton, G., & Kingsbury, B. New types of deep neural network learning for speech recognition and related applications: An overview. In ICASSP 2013, Vancouver, pp. 8599-8603.
[11]
Gales, M. (1998). Maximum likelihood linear transformations for hmm-based speech recognition. Computer Speech and Language, 12, 75-98.
[12]
Ghaffarzadegan, S., Boril, H., & Hansen, J. H. L. (2016). Generative modeling of pseudo-whisper for robust whispered speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(10), 1705-1720.
[13]
Ghaffarzadegan, S., Boril, H., & Hansen, J. H. L. (2014a). Model and feature based compensation for whispered speech recognition. In Interspeech 2014, Singapore, pp. 2420-2424.
[14]
Ghaffarzadegan, S., Boril, H., & Hansen, J. H. L. (2014b). UTVOCAL EFFORT II: Analysis and constrained-lexicon recognition of whispered speech. In IEEE ICASSP 2014, Florence, pp. 2563-2567.
[15]
Ghaffarzadegan, S., Boril, H., & Hansen, J. H. L. (2015). Generative modeling of pseudo-target domain adaptation samples for whispered speech recognition. In IEEE ICASSP 2015, Brisbane.
[16]
Haeb-Umbach, R., & Ney, H. (1992). Linear discriminant analysis for improved large vocabulary continuous speech recognition. ICASSP 1992, Washington, DC, pp. 13-16.
[17]
Hermansky, H., & Morgan, N. (1994). RASTA processing of speech. In IEEE Transactions on Speech and Acoustics, 2, 587-589.
[18]
Hinton, G., Deng, L., Yu, D., rahman Mohamed, A., Jaitly, N., Senior, A., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29(6), 82-97.
[19]
Huang, C., & Moraga, C. (2004). A diffusion-neural-network for learning from small samples. International Journal of Approximate Reasoning, 35(2), 137-161.
[20]
Ketabdar, H., & Bourlard, H. (2010). Enhanced phone posteriors for improving speech recognition systems. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1094-1106.
[21]
Lasserre, J. A., Bishop, C. M., & Minka, T. P. (2006). Principled hybrids of generative and discriminative models. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 1, CVPR '06, Washington, DC, USA. IEEE Computer Society, pp. 87-94.
[22]
Li, D., Wu, C., Tsai, T., & Lin, Y. (2007a). Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Computers & Operations Research, 34(4), 966-982.
[23]
Li, D.-C., Hsu, H.-C., Tsai, T.-I., Lu, T.-J., & Hu, S. C. (2007b). A new method to help diagnose cancers for small sample size. Expert Systems with Applications, 33(2), 420-424.
[24]
Liporace, L. (2006). Maximum likelihood estimation for multivariate observations of markov sources. IEEE Transactions on Information Theory, 28(5), 729-734.
[25]
Mao, R., Chen, A., Zhang, L., & Zhu, H. (2006). A new method to assist small data set neural network learning. 2006 6th International Conference on Intelligent Systems Design and Applications, 01:17-22.
[26]
Matsoukas, S., Schwartz, R., Jin, H., & Nguyen, L. (1997). Practical implementations of speaker-adaptive training. In DARPA Speech Recognition Workshop.
[27]
Neal, R. M. (2010). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54, 113-162.
[28]
Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In Advances in Neural Information Processing Systems 14, MIT Press, pp. 841-848.
[29]
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The Kaldi speech recognition toolkit. In IEEE 2011 Workshop on Automatic Speech Recognition and Understanding.
[30]
Rabiner, L. R. (1990). Readings in speech recognition. chapter A tutorial on hidden Markov models and selected applications in speech Recognition, San Francisco: Morgan Kaufmann Publishers Inc., pp. 267-296.
[31]
Rubinstein, Y. D. & Hastie, T. (1997). Discriminative vs informative learning. In proc. third int. conf. on knowledge discovery and data mining. AAAI Press, pp. 49-53.
[32]
Seide, F., Li, G., & Yu, D. (2011). Conversational speech transcription using context-dependent deep neural networks. International Speech Communication Association.
[33]
Seltzer, M., Yu, D., & Wang, Y. (2013). An investigation of deep neural networks for noise robust speech recognition. In IEEE ICASSP 2013, Vancouver.
[34]
Young, S. (1996). A review of large-vocabulary continuous-speech recognition. IEEE Signal Processing Magazine, 1996, 13(5).
[35]
Zhang, C., & Hansen, J. H. L. (2009). Advancement in whisper-island detection with normally phonated audio streams. In ISCA INTERSPEECH, Brighton, pp. 860-863.
[36]
Zhang, X., Trmal, J., Povey, D., & Khudanpur, S. (2014). Improving deep neural network acoustic models using generalized maxout networks. In ICASSP 2014, Florence, May 4-9, 2014, pp. 215-219.
[37]
Zue, V., Seneff, S., & Glass, J. (1990). Speech database development at MIT: TIMIT and beyond. Speech Communication, 9(4), 351-356.

Cited By

View all
  • (2021)Acceptability of Speech and Silent Speech Input Methods in Private and PublicProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445430(1-13)Online publication date: 6-May-2021
  • (2019)ProxiTalkProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/33512763:3(1-25)Online publication date: 9-Sep-2019
  1. Deep neural network training for whispered speech recognition using small databases and generative model sampling

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image International Journal of Speech Technology
      International Journal of Speech Technology  Volume 20, Issue 4
      December 2017
      307 pages

      Publisher

      Springer-Verlag

      Berlin, Heidelberg

      Publication History

      Published: 01 December 2017

      Author Tags

      1. Deep neural networks
      2. Gaussian mixture models
      3. Random sampling
      4. Small datasets
      5. Speech recognition

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 12 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Acceptability of Speech and Silent Speech Input Methods in Private and PublicProceedings of the 2021 CHI Conference on Human Factors in Computing Systems10.1145/3411764.3445430(1-13)Online publication date: 6-May-2021
      • (2019)ProxiTalkProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/33512763:3(1-25)Online publication date: 9-Sep-2019

      View Options

      View options

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media