[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Toward Robust Speech Recognition and Understanding

Published: 01 November 2005 Publication History

Abstract

The principal cause of speech recognition errors is a mismatch between trained acoustic/language models and input speech due to the limited amount of training data in comparison with the vast variation of speech. It is crucial to establish methods that are robust against voice variation due to individuality, the physical and psychological condition of the speaker, telephone sets, microphones, network characteristics, additive background noise, speaking styles, and other aspects. This paper overviews robust architecture and modeling techniques for speech recognition and understanding. The topics include acoustic and language modeling for spontaneous speech recognition, unsupervised adaptation of acoustic and language models, robust architecture for spoken dialogue systems, multi-modal speech recognition, and speech summarization. This paper also discusses the most important research problems to be solved in order to achieve ultimate robust speech recognition and understanding systems.

References

[1]
1. B.-H. Juang and S. Furui, "Automatic Recognition and Understanding of Spoken Language--A First Step Towards Natural Human-Machine Communication," Proc. IEEE, vol. 88, no. 8, 2000, pp. 1142-1165.]]
[2]
2. L.R. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993.]]
[3]
3. S. Furui, Digital Speech Processing, Synthesis, and Recognition, 2nd edition, Marcel Dekker, 2000.]]
[4]
4. H. Ney, "Corpus-Based Statistical Methods in Speech and Language Processing," in Corpus-based Methods in Language and Speech Processing, S. Young and G. Bloothooft (Eds.), Kluwer, 1997, pp. 1-26.]]
[5]
5. S. Furui, "Recent Advances in Spontaneous Speech Recognition and Understanding," in Proc. IEEE-ISCA Workshop on Spontaneous Speech Processing and Recognition (SSPR), Tokyo, 2003, pp. 1-6.]]
[6]
6. S. Furui, "Steps Toward Natural Human-Machine Communication in the 21st Century," in Proc. ISCA Workshop on Voice Operated Telecom Services, Ghent, 2000, pp. 17-24.]]
[7]
7. E. Levin et al., "The AT&T-DARPA Communicator Mixed-Initiative Spoken Dialogue System," in Proc. ICSLP, Beijing, 2000, pp. II-122-125.]]
[8]
8. S. Basu et al., "Audio-Visual Large Vocabulary Continuous Speech Recognition in the Broadcast Domain," in Proc. IEEE Multimedia Signal Processing (MMSP), Copenhagen, 1999, pp. 475-481.]]
[9]
9. S. Furui, "Toward Spontaneous Speech Recognition and Understanding," in Pattern Recognition in Speech and language Processing, W. Chou and B.-H. Juang (Eds.), CRC Press, 2003, pp. 191-227.]]
[10]
10. T. Shinozaki et al., "Towards Automatic Transcription of Spontaneous Presentations," in Proc. Eurospeech, Aalborg, vol. 1, 2001, pp. 491-494.]]
[11]
11. T. Shinozaki and S. Furui, "Analysis on Individual Differences in Automatic Transcription of Spontaneous Presentations," in Proc. ICASSP, Orlando, 2002, pp. I-729- 732.]]
[12]
12. Z. Zhang et al., "On-Line Incremental Speaker Adaptation for Broadcast News Transcription," in Speech Communication, vol. 37, 2002, pp. 271-281.]]
[13]
13. Z. Zhang et al., "An Online Incremental Speaker Adaptation Method Using Speaker-Clustered Initial Models," in Proc. ICSLP, Beijing, 2000, pp. III-694-697.]]
[14]
14. M.J.F. Gales et al., "An Improved Approach to the Hidden Markov Model Decomposition of Speech and Noise," in Proc. ICASSP, San Francisco, 1992, pp. 233-236.]]
[15]
15. F. Martin et al., "Recognition of Noisy Speech by Composition of Hidden Markov Models," in Proc. Eurospeech, Berlin, 1993, pp. 1031-1034.]]
[16]
16. S. Furui et al., "Noise Adaptation of HMMs Using Neural Networks," in Proc. ISCA Workshop on Automatic: Speech Recognition, Paris, 2000, pp. 160-167.]]
[17]
17. Z. Zhang et al., "Tree-Structured Noise-Adapted HMM Modeling for Piecewise Linear-Transformation-Based Adaptation," in Proc. Eurospeech, Geneva, 2003.]]
[18]
18. T. Shinozaki and S. Furui, "Time Adjustable Mixture Weights for Speaking Rate Fluctuation," in Proc. Eurospeech, Geneva, 2003.]]
[19]
19. G. Zweig, "Bayesian Network Structures and Inference Techniques for Automatic Speech Recognition," Computer Speech and Language, vol. 17, 2003, pp. 173- 193.]]
[20]
20. Y. Yokoyama et al., "Unsupervised Language Model Adaptation Using Word Classes for Spontaneous Speech Recognition," in Proc. IEEE-ISCA Workshop on Spontaneous Speech Processing and Recognition, Tokyo, 2003, pp. 71- 74.]]
[21]
21. R. Taguma et al., "Parallel Computing-Based Architecture for Mixed-Initiative Spoken Dialogue," in Proc. IEEE Int. Conf. on Multimodal Interfaces (ICMI), Pittsburgh, 2002, pp. 53- 58.]]
[22]
22. S. Tamura et al., "Arobust Multi-Modal Speech Recognition Method Using Optical-Flow Analysis," in Proc. ISCA Workshop on Multi-modal Dialogue in Mobile Environments, Kloster Irsee, 2002.]]
[23]
23. T. Yoshinaga et al., "Audio-Visual Speech Recognition Using Lip Movement Extracted from Side-Face Images," in Proc. Eurospeech, Geneva, 2003.]]
[24]
24. S. Furui et al., "Speech-to-Speech and Speech-to-Text Summarization," in Proc. Int. Workshop on Language Understanding and Agents for Real World Interaction, Sapporo, 2003, pp. 100-106.]]
[25]
25. T. Kikuchi et al., "Two-Stage Automatic Speech Summarization by Sentence Extraction and Compaction," in Proc. IEEE-ISCA Workshop on Spontaneous Speech Processing and Recognition (SSPR), Tokyo, 2003, pp. 207-210.]]
[26]
26. C. Hori et al., "A Statistical Approach to Automatic Speech Summarization," EURASIP Journal on Applied Signal Processing, 2003, pp. 128-139.]]

Cited By

View all
  • (2018)Robust Recognition of Noisy Speech Through Partial Imputation of Missing DataCircuits, Systems, and Signal Processing10.1007/s00034-017-0616-437:4(1625-1648)Online publication date: 1-Apr-2018
  • (2011)Nonlinear enhancement of noisy speech, using continuous attractor dynamics formed in recurrent neural networksNeurocomputing10.1016/j.neucom.2010.12.04474:17(2716-2724)Online publication date: 1-Oct-2011

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of VLSI Signal Processing Systems
Journal of VLSI Signal Processing Systems  Volume 41, Issue 3
November 2005
108 pages

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 November 2005

Author Tags

  1. acoustic models
  2. adaptation
  3. corpus
  4. dialogue
  5. language models
  6. multi-modal
  7. robustness
  8. speech recognition
  9. speech understanding
  10. spontaneous speech
  11. summarization

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Robust Recognition of Noisy Speech Through Partial Imputation of Missing DataCircuits, Systems, and Signal Processing10.1007/s00034-017-0616-437:4(1625-1648)Online publication date: 1-Apr-2018
  • (2011)Nonlinear enhancement of noisy speech, using continuous attractor dynamics formed in recurrent neural networksNeurocomputing10.1016/j.neucom.2010.12.04474:17(2716-2724)Online publication date: 1-Oct-2011

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media