[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3171221.3171280acmconferencesArticle/Chapter ViewAbstractPublication PageshriConference Proceedingsconference-collections
research-article
Open access

DNN-HMM based Automatic Speech Recognition for HRI Scenarios

Published: 26 February 2018 Publication History

Abstract

In this paper, we propose to replace the classical black box integration of automatic speech recognition technology in HRI applications with the incorporation of the HRI environment representation and modeling, and the robot and user states and contexts. Accordingly, this paper focuses on the environment representation and modeling by training a deep neural network-hidden Markov model based automatic speech recognition engine combining clean utterances with the acoustic-channel responses and noise that were obtained from an HRI testbed built with a PR2 mobile manipulation robot. This method avoids recording a training database in all the possible acoustic environments given an HRI scenario. Moreover, different speech recognition testing conditions were produced by recording two types of acoustics sources, i.e. a loudspeaker and human speakers, using a Microsoft Kinect mounted on top of the PR2 robot, while performing head rotations and movements towards and away from the fixed sources. In this generic HRI scenario, the resulting automatic speech recognition engine provided a word error rate that is at least 26% and 38% lower than publicly available speech recognition APIs with the playback (i.e. loudspeaker) and human testing databases, respectively, with a limited amount of training data.

References

[1]
M. A. Goodrich and A. C. Schultz. 2007. Human--Robot Interaction: A Survey. Foundations and Trends in Human--Computer Interaction, vol. 1, no. 3, p. 203--275.
[2]
L. S. Lopes and A. Teixeira. 2000. Human-robot interaction through spoken language dialogue. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems, Takamatsu, Japan.
[3]
G. Hoffman and K. Vanunu. 2013. Effects of robotic companionship on music enjoyment and agent perception. In Proceedings of ACM/IEEE International Conference on Human-Robot Interaction (HRI), Tokyo, Japan.
[4]
C. Y. Lin, K. T. Song, Y. W. Chen, S. C. Chien, S. H. Chen, C. Y. Chiang, J. H. Yang, Y. C. Wu and T. J. Liu. 2012. User identification design by fusion of face recognition and speaker recognition. In Proceedings of 12th International Conference on Control, Automation and Systems, JeJu Island, South Korea.
[5]
K. Zheng, D. F. Glas, T. Kanda, H. Ishiguro and N. Hagita. 2013. Designing and Implementing a Human--Robot Team for Social Interactions. IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 13, no. 4, pp. 843--859.
[6]
Y. Kondo, K. Takemura, J. Takamatsu and T. Ogasawara. 2013. A gesture-centric android system for multi-party human-robot interaction. Journal of Human-Robot Interaction, vol. 2, no. 1, pp. 133--151.
[7]
D. Wang, H. Leung, A. P. Kurian, H. J. Kim and H. Yoon. 2010. A Deconvolutive Neural Network for Speech Classification With Applications to Home Service Robot. IEEE Transactions on Instrumentation and Measurement, vol. 59, no. 12, pp. 3237 - 3243.
[8]
E. L. Meszaros, M. Chandarana, A. Trujillo and B. D. Allen. 2018. Compensating for Limitations in Speech-Based Natural Language Processing with Multimodal Interfaces in UAV Operation. In Advances in Human Factors in Robots and Unmanned Systems. AHFE 2017. Advances in Intelligent Systems and Computing, California, LA, USA.
[9]
S. Han, J. Hong, S. Jeong and M. Hahn. 2010. Robust GSC-based speech enhancement for human machine interface. IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 965--970.
[10]
M. Staudte and M. W. Crocker. 2011. Investigating joint attention mechanisms through spoken human--robot interaction. Cognition, vol. 120, no. 2, pp. 268--291.
[11]
H. Polido. 2014. DARPA Robotics Challenge. Worcester Polytechnic Institute.
[12]
H. Kitano, M. Asada, Y. Kuniyoshi, I. Noda and E. Osawa. 1997. Robocup: The robot world cup initiative. In Proceedings of the first international conference on Autonomous agents, Marina del Rey, CA, USA.
[13]
L. Zhang, L. Du and Z. B. 2016. Deep Learning for Remote Sensing Data: A Technical Tutorial on the State of the Art. IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 2, pp. 22--40.
[14]
S. E. Umbaugh. 2010. Digital image processing and analysis: human and computer vision applications with CVIPtools. CRC press.
[15]
W. Burger and M. J. Burge. 2016. Digital image processing: an algorithmic introduction using Java, Springer.
[16]
J. Nakamura. 2016. Image sensors and signal processing for digital still cameras, CRC press.
[17]
S. Young. 2008. HMMs and Related Speech Recognition Technologies. In Springer Handbook of Speech Processing, Springer, pp. 539--558.
[18]
X. D. Huang, Y. Ariki, and M. A. Jack. 1990. Hidden Markov models for speech recognition. Edinburgh university press Edinburgh, vol. 2004.
[19]
R. Justo and M. I. Torres. 2015. Integration of complex language models in ASR and LU systems. Pattern Analysis and Applications, vol. 18, no. 3, pp. 493--505.
[20]
S. F. Chen, D. Beeferman and R. Rosenfeld. 1998. Evaluation metrics for language models. In Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pp. 275--280.
[21]
M. Chetouani, B. Gas and J. Zarader. 2002. Discriminative Training for Neural Predictive Coding Applied to Speech Features Extraction. In Proceedings of the 2002 International Joint Conference on Neural Networks, Honolulu, HI, USA.
[22]
N. Dave. 2013. Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition. International Journal for Advance Research in Engineering and Technology, vol. 1, no. 6, pp. 1--5.
[23]
S. Furui. 1986. Speaker independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 34, no. 1, p. 52--59.
[24]
L. Bahl, R. Bakis, E. Jelinek and R. Mercer. 1980. Language-model/acoustic channel balance mechanism. IBM Technical Disclosure Bulletin, vol. 23, no. 7B, pp. 3464--3465.
[25]
G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. R. Mohamed, N. Jaitly and B. Kingsbury. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82--97.
[26]
J. Godfrey and E. Holliman. 1997. Switchboard-1 Release 2. Linguistic Data Consortium, Philadelphia.
[27]
G. E. Hinton, S. Osindero and Y.-W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation, vol. 18, no. 7, pp. 1527--1554.
[28]
J. Schröder, J. Anemüller and S. Goetze. 2016. Performance comparison of GMM, HMM and DNN based approaches for acoustic event detection within Task 3 of the DCASE 2016 challenge. In Proceedings of Workshop on Detection and Classification of Acoustic Scenes and Events, Budapest, Hungary.
[29]
S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation, vol. 9, no. 8, p. 1735--1780.
[30]
O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn and D. Yu. 2014. Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing, vol. 22, no. 10, pp. 1533--1545.
[31]
A. Graves, A. R. Mohamed and G. Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada.
[32]
Z. Tang, D. Wang and Z. Zhang. 2016. Recurrent neural network training with dark knowledge transfer. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China.
[33]
J. Li, A. Mohamed, G. Zweig and Y. Gong. 2015. LSTM time and frequency recurrence for automatic speech recognition. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA.
[34]
T. N. Sainath and B. Li. 2016. Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks. In Proceedings of INTERSPEECH, San Francisco, USA.
[35]
Y. Liu and K. Kirchhoff. 2016. Novel Front-End Features Based on Neural Graph Embeddings for DNN-HMM and LSTM-CTC Acoustic Modeling. In Proceedings of INTERSPEECH, San Francisco, USA.
[36]
D. Yu, W. Xiong, J. Droppo, A. Stolcke, G. Ye, J. Li and G. Zweig. 2016. Deep convolutional neural networks with layer-wise context expansion and attention. In Proceedings of INTERSPEECH, San Francisco, USA.
[37]
Y. Qian, M. Bi, T. Tan and K. Yu. 2016. Very Deep Convolutional Neural Networks for Noise Robust Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2263--2276.
[38]
V. Mitra and H. Franco. 2016. Coping with Unseen Data Conditions: Investigating Neural Net Architectures, Robust Features, and Information Fusion for Robust Speech Recognition. In Proceedings of INTERSPEECH, San Francisco, USA.
[39]
C. Weng, D. Yu, M. L. Seltzer and J. Droppo. 2014. Single-channel mixed speech recognition using deep neural networks. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy.
[40]
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey and others. 2006. The HTK book. Cambridge university engineering department, vol. 3, p. 175.
[41]
K. F. Lee, H. W. Hon and R. Reddy. 1990. An overview of the SPHINX speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 38, no. 1, pp. 35--45.
[42]
W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf and J. Woelfel. 2004. Sphinx-4: A flexible open source framework for speech recognition. Sun Microsystems, Inc.
[43]
A. Lee, T. Kawahara and K. Shikano. 2001. JULIUS - an open source real-time large vocabulary recognition engine. In Proceeding of INTERSPEECH, Aalborg, Denmark.
[44]
D. Povey, A. Ghoshal, G. Boulianne, N. Goel, M. Hannemann, Y. Qian, P. Schwarz and G. Stemmer. 2011. The Kaldi Speech Recognition Toolkit. In Proceedings of ASRU, Hawaii, USA, December.
[45]
D. Bolaños. 2012. The Bavieca open-source speech recognition toolkit. In Proceedings of IEEE Spoken Language Technology Workshop (SLT), Miami, FL, USA.
[46]
D. O. Johnson, R. H. Cuijpers, J. F. Juola, E. Torta, M. Simonov, A. Frisiello, M. Bazzani, W. Yan, C. Weber, S. Wermter, N. Meins, J. Oberzaucher, P. Panek, G. Edelmayer and P. Mayer. 2014. Socially Assistive Robots: A Comprehensive Approach to Extending Independent Living. International Journal of Social Robotics, vol. 6, no. 2, p. 195--211.
[47]
J. F. Lehman. 2014. Robo fashion world: a multimodal corpus of multi-child human-computer interaction. In Proceedings of the 2014 Workshop on Understanding and Modeling Multiparty, Multimodal Interactions, Istanbul, Turkey.
[48]
F. Cutugno, A. Finzi, M. Fiore, E. Leone and S. Rossi. 2013. Interacting with robots via speech and gestures, an integrated architecture. In Proceedings of INTERSPEECH, Lyon, France.
[49]
K. Zinchenko, C. Y. Wu and K. T. Song. 2017. A Study on Speech Recognition Control for a Surgical Robot. IEEE Transactions on Industrial Informatics, vol. 13, no. 2, pp. 607--615.
[50]
C. Matuszek, L. Bo, L. Zettlemoyer and D. Fox. 2014. Learning from Unscripted Deictic Gesture and Language for Human-Robot Interactions. In Proceedings of the 28th National Conference on Artificial Intelligence, Québec City, Quebec, Canada.
[51]
J. Kennedy, S. Lemaignan, C. Montassier, P. Lavalade, B. Irfan, F. Papadopoulos, E. Senft and T. Belpaeme. 2017. Child speech recognition in human-robot interaction: evaluations and recommendations. In Proceedings of ACM/IEEE International Conference on Human-Robot Interaction, Vienna, Austria.
[52]
P. Lange and D. Suendermann-Oeft. 2014. Tuning Sphinx to Outperform Google's Speech Recognition API. In Proceedings of the Conference on Electronic Speech Signal Processing, Dresden, Germany.
[53]
O. Mubin, J. Henderson and C. Bartneck. 2014. You just do not understand me! Speech Recognition in Human Robot Interaction. In Proceedings of 23rd IEEE International Symposium on Robot and Human Interactive Communication, Edinburgh, Scotland.
[54]
M. Marge, C. Bonial, B. Byrne, T. Cassidy, A. W. Evans, S. G. Hill and C. Voss. 2017. Applying the Wizard-of-Oz technique to multimodal human-robot dialogue. arXiv preprint arXiv:1703.03714.
[55]
P. Sequeira, P. Alves-Oliveira, T. Ribeiro, E. Di Tullio, S. Petisca, F. S. Melo, G. Castellano and A. Paiva. 2016. Discovering social interaction strategies for robots from restricted-perception Wizard-of-Oz studies. In Proceedings of 11th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Christchurch, New Zealand.
[56]
K. Hensby, J. Wiles, M. Boden, S. Heath, M. Nielsen, P. Pounds, J. Riddell, K. Rogers, N. Rybak, V. Slaughter, M. Smith, J. Taufatofua, P. Worthy and J. Weigel. 2016. Hand in hand: Tools and techniques for understanding children- touch with a social robot. In Proceedings of 11th ACM/IEEE International Conference on Human-Robot Interaction, Christchurch, New Zealand.
[57]
G. Hoffman. OpenWoZ: A Runtime-Configurable Wizard-of-Oz Framework for Human-Robot Interaction. 2016. In Proceedings of AAAI Spring Symposium Series, Palo Alto, CA, USA.
[58]
N. Martelaro. 2016. Wizard-of-Oz Interfaces as a Step Towards Autonomous HRI. In Proceedings of AAAI Spring Symposium Series, Palo Alto, CA, USA.
[59]
S. Pourmehr, J. Thomas and R. Vaughan. 2016. What untrained people do when asked "make the robot come to you". In Proceedings of 11th ACM/IEEE International Conference on Human-Robot Interaction, Christchurch, New Zealand.
[60]
E. Senft, P. Baxter, J. Kennedy, S. Lemaignan and T. Belpaeme. 2016. Providing a robot with learning abilities improves its perception by users. In Proceedings of 11th ACM/IEEE International Conference on Human-Robot Interaction, Christchurch, New Zealand.
[61]
J. M. K. Westlund and C. Breazeal. 2016. Transparency, teleoperation, and children' understanding of social robots. In Proceedings of 11th ACM/IEEE International Conference on Human-Robot Interaction, Christchurch, New Zealand.
[62]
H. W. Löllmann, A. Moore, P. A. Naylor, B. Rafaely, R. Horaud, A. Mazel and W. Kellermann. 2017. Microphone array signal processing for robot audition. In Proceedings of Hands-free Speech Communications and Microphone Arrays, San Francisco, CA, USA.
[63]
A. Deleforge and W. Kellermann. 2015. Phase-optimized K-SVD for signal extraction from underdetermined multichannel sparse mixtures. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Brisbane, QLD, Australia.
[64]
J. Novoa, J. Wuth, J. P. Escudero, J. Fredes, R. Mahu, R. Stern and N. B. Yoma. 2017. Robustness over time-varying channels in DNN-HMM ASR based human-robot interaction. In Proceedings of Interspeech, Stockholm, Sweden.
[65]
K. Dautenhahn, M. Walters, S. Woods, K. L. Koay, C. L. Nehaniv, A. Sisbot, R. Alami and T. Siméon. 2006. How may I serve you?: a robot companion approaching a seated person in a helping context. In Proceedings of ACM Conference on Human Robot Interaction, Salt Lake City, UT, USA.
[66]
J. Novoa, J. Wuth, J. P. Escudero, J. Fredes, R. Mahu and N. Becerra Yoma. 2017. Multichannel Robot Speech Recognition Database: MChRSR. arXiv preprint arXiv: 1801.00061.
[67]
A. Farina. Simultaneous measurement of impulse response and distortion with a swept-sine technique. 2000. In Proceedings of 108th Audio Engineering Society Convention, Paris, France.
[68]
G. Hirsch. 2002. Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends on a Large Vocabulary Task, Version 2.0, AU/417/02. ETSI STQ Aurora DSR Working Group.
[69]
G. Hirsch. 2005. FaNT filtering and noise adding tool. Niederrhein University of Applied Sciences.
[70]
S. Sivasankaran, E. Vincent and I. Illina. 2017. A combined evaluation of established and new approaches for speech recognition in varied reverberation conditions. Computer Speech & Language, vol. 46, no. Supplement C, pp. 444--460.
[71]
P. Lin, D.-C. Lyu, F. Chen, S.-S. Wang and Y. Tsao. 2017. Multi-style learning with denoising autoencoders for acoustic modeling in the internet of things (IoT). Computer Speech & Language, vol. 46, no. Supplement C, pp. 481--495.
[72]
K. Veselý, A. Ghoshal, L. Burget and D. Povey. 2013. Sequence-discriminative training of deep neural networks. In Proceeding of INTERSPEECH, Lyon, France.
[73]
J.-L. Gauvain, L. Lamel and M. Adda-Decker.1995. Developments in continuous speech dictation using the ARPA WSJ task. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing, Detroit, MI, USA.
[74]
A. Zhang. Speech Recognition (Version 3.7). 2017. {Online}. Available: https://github.com/Uberi/speech_recognition#readme. {Accessed 5th September 2017}.
[75]
B. Li, T. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra, I. Shafran, H. Sak, G. Pundak, K. Chin and others. 2017. Acoustic Modeling for Google Home. In Proceedings of INTERSPEECH, Stockholm, Sweden.
[76]
G. Saon, H.-K. J. Kuo, S. Rennie and M. Picheny. 2015. The IBM 2015 English Conversational Telephone Speech Recognition System. In Proceedings of INTERSPEECH, Dresden, Germany.
[77]
W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu and G. Zweig. 2017. The microsoft 2016 conversational speech recognition system. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA.

Cited By

View all
  • (2025)Speech emotion recognition in real static and dynamic human-robot interaction scenariosComputer Speech & Language10.1016/j.csl.2024.10166689(101666)Online publication date: Jan-2025
  • (2024)Predicting transformer temperature field based on physics‐informed neural networksHigh Voltage10.1049/hve2.12435Online publication date: 9-May-2024
  • (2024)Exploring emergent syllables in end-to-end automatic speech recognizers through model explainability techniqueNeural Computing and Applications10.1007/s00521-024-09435-136:12(6875-6901)Online publication date: 13-Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HRI '18: Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction
February 2018
468 pages
ISBN:9781450349536
DOI:10.1145/3171221
This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 February 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dnn-hmm
  2. speech recognition
  3. time-varying acoustic channel

Qualifiers

  • Research-article

Funding Sources

  • ONRG
  • Conicyt-PCHA/Doctorado
  • Conicyt-Fondecyt

Conference

HRI '18
Sponsor:

Acceptance Rates

HRI '18 Paper Acceptance Rate 49 of 206 submissions, 24%;
Overall Acceptance Rate 268 of 1,124 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)206
  • Downloads (Last 6 weeks)24
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Speech emotion recognition in real static and dynamic human-robot interaction scenariosComputer Speech & Language10.1016/j.csl.2024.10166689(101666)Online publication date: Jan-2025
  • (2024)Predicting transformer temperature field based on physics‐informed neural networksHigh Voltage10.1049/hve2.12435Online publication date: 9-May-2024
  • (2024)Exploring emergent syllables in end-to-end automatic speech recognizers through model explainability techniqueNeural Computing and Applications10.1007/s00521-024-09435-136:12(6875-6901)Online publication date: 13-Feb-2024
  • (2023)Automatic Detection of Dyspnea in Real Human–Robot Interaction ScenariosSensors10.3390/s2317759023:17(7590)Online publication date: 1-Sep-2023
  • (2023)Voice Interaction Recognition Design in Real-Life Scenario Mobile Robot ApplicationsApplied Sciences10.3390/app1305335913:5(3359)Online publication date: 6-Mar-2023
  • (2023)RoboClean: Contextual Language Grounding for Human-Robot Interactions in Specialised Low-Resource EnvironmentsProceedings of the 5th International Conference on Conversational User Interfaces10.1145/3571884.3597137(1-11)Online publication date: 19-Jul-2023
  • (2023)Multi-Feature and Multi-Modal Mispronunciation Detection and Diagnosis Method Based on the Squeezeformer EncoderIEEE Access10.1109/ACCESS.2023.327883711(66245-66256)Online publication date: 2023
  • (2023)Gestural and Touchscreen Interaction for Human-Robot Collaboration: A Comparative StudyIntelligent Autonomous Systems 1710.1007/978-3-031-22216-0_9(122-138)Online publication date: 18-Jan-2023
  • (2022)Hidden-state modeling of a cross-section of geoelectric time series data can provide reliable intermediate-term probabilistic earthquake forecasting in TaiwanNatural Hazards and Earth System Sciences10.5194/nhess-22-1931-202222:6(1931-1954)Online publication date: 9-Jun-2022
  • (2022)Learning relationships between audio signals based on reservoir networks2022 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN55064.2022.9892009(1-6)Online publication date: 18-Jul-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media