[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3144457.3144503acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmobiquitousConference Proceedingsconference-collections
research-article
Public Access

Real Time Distant Speech Emotion Recognition in Indoor Environments

Published: 07 November 2017 Publication History

Abstract

We develop solutions to various challenges in different stages of the processing pipeline of a real time indoor distant speech emotion recognition system to reduce the discrepancy between training and test conditions for distant emotion recognition. We use a novel combination of distorted feature elimination, classifier optimization, several signal cleaning techniques and train classifiers with synthetic reverberation obtained from a room impulse response generator to improve performance in a variety of rooms with various source-to-microphone distances. Our comprehensive evaluation is based on a popular emotional corpus from the literature, two new customized datasets and a dataset made of YouTube videos. The two new datasets are the first ever distance aware emotional corpuses and we created them by 1) injecting room impulse responses collected in a variety of rooms with various source-to-microphone distances into a public emotional corpus; and by 2) re-recording the emotional corpus with microphones placed at different distances. The overall performance results show as much as 15.51% improvement in distant emotion detection over baselines, with a final emotion recognition accuracy ranging between 79.44%-95.89% for different rooms, acoustic configurations and source-to-microphone distances. We experimentally evaluate the CPU time of various system components and demonstrate the real time capability of our system.

References

[1]
Jont B Allen and David A Berkley. 1979. Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America 65, 4 (1979), 943--950.
[2]
Felix Burkhardt, Astrid Paeschke, Miriam Rolfes, Walter F Sendlmeier, and Benjamin Weiss. 2005. A database of german emotional speech. In Interspeech, Vol. 5. 1517--1520.
[3]
Yi-Wei Chen and Chih-Jen Lin. 2006. Combining SVMs with various feature selection strategies. Feature extraction (2006), 315--324.
[4]
Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44, 3 (2011), 572--587.
[5]
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459--1462.
[6]
Marco Jeub, Magnus Schafer, and Peter Vary. 2009. A binaural room impulse response database for the evaluation of dereverberation algorithms. In Digital Signal Processing, 2009 16th International Conference on. IEEE, 1--5.
[7]
Kenichi Kumatani, John McDonough, and Bhiksha Raj. 2012. Microphone array processing for distant speech recognition: From close-talking microphones to far-field sensors. IEEE Signal Processing Magazine 29, 6 (2012), 127--140.
[8]
Eric A Lehmann and Anders M Johansson. 2008. Prediction of energy decay in room impulse responses simulated with an image-source model. The Journal of the Acoustical Society of America 124, 1 (2008), 269--277.
[9]
Heiner Löllmann, Emre Yilmaz, Marco Jeub, and Peter Vary. 2010. An improved algorithm for blind reverberation time estimation. In Proceedings of International Workshop on Acoustic Echo and Noise Control (IWAENC). 1--4.
[10]
Steve Renals and Pawel Swietojanski. 2014. Neural networks for distant speech recognition. In Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014 4th Joint Workshop on. IEEE, 172--176.
[11]
M. R. Schroeder. 1965. New Method of Measuring Reverberation Time. The Journal of the Acoustical Society of America 37, 6 (1965), 1187--1188. arXiv:http://dx.doi.org/10.1121/1.1939454
[12]
Andreas Schwarz and Walter Kellermann. 2015. Coherent-to-diffuse power ratio estimation for dereverberation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 6 (2015), 1006--1018.
[13]
Takuya Yoshioka, Xie Chen, and Mark JF Gales. 2014. Impact of single-microphone dereverberation on DNN-based meeting transcription systems. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 5527--5531.
[14]
Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. 2016. Highway long short-term memory rnns for distant speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 5755--5759.

Cited By

View all
  • (2025)Speech emotion recognition in real static and dynamic human-robot interaction scenariosComputer Speech & Language10.1016/j.csl.2024.10166689(101666)Online publication date: Jan-2025
  • (2019)An Investigation of the Accuracy of Real Time Speech Emotion RecognitionArtificial Intelligence XXXVI10.1007/978-3-030-34885-4_26(336-349)Online publication date: 17-Dec-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
MobiQuitous 2017: Proceedings of the 14th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services
November 2017
555 pages
ISBN:9781450353687
DOI:10.1145/3144457
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Emotion
  2. noise and reverberation
  3. speech

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

MobiQuitous 2017
MobiQuitous 2017: Computing, Networking and Services
November 7 - 10, 2017
VIC, Melbourne, Australia

Acceptance Rates

Overall Acceptance Rate 26 of 87 submissions, 30%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)95
  • Downloads (Last 6 weeks)9
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Speech emotion recognition in real static and dynamic human-robot interaction scenariosComputer Speech & Language10.1016/j.csl.2024.10166689(101666)Online publication date: Jan-2025
  • (2019)An Investigation of the Accuracy of Real Time Speech Emotion RecognitionArtificial Intelligence XXXVI10.1007/978-3-030-34885-4_26(336-349)Online publication date: 17-Dec-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media