[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Public Access

Distant Emotion Recognition

Published: 11 September 2017 Publication History

Abstract

Distant emotion recognition (DER) extends the application of speech emotion recognition to the very challenging situation that is determined by variable speaker to microphone distances. The performance of conventional emotion recognition systems degrades dramatically as soon as the microphone is moved away from the mouth of the speaker. This is due to a broad variety of effects such as background noise, feature distortion with distance, overlapping speech from other speakers, and reverberation. This paper presents a novel solution for DER, addressing the key challenges by identification and deletion of features from consideration which are significantly distorted by distance, creating a novel, called Emo2vec, feature modeling and overlapping speech filtering technique, and the use of an LSTM classifier to capture the temporal dynamics of speech states found in emotions. A comprehensive evaluation is conducted on two acted datasets (with artificially generated distance effect) as well as on a new emotional dataset of spontaneous family discussions with audio recorded from multiple microphones placed in different distances. Our solution achieves an average 91.6%, 90.1% and 89.5% accuracy for emotion happy, angry and sad, respectively, across various distances which is more than a 16% increase on average in accuracy compared to the best baseline method.

References

[1]
2017. Google cloud speech API. https://cloud.google.com/speech/. (10 Feb 2017).
[2]
2017. Spaces in New Homes. goo.gl/1z3oVs. (10 Feb 2017).
[3]
2017. spectral noise gating algorithm. http://tinyurl.com/yard8oe. (1 Jan 2017).
[4]
Christos-Nikolaos Anagnostopoulos, Theodoros Iliou, and Ioannis Giannoukos. 2015. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review 43, 2 (2015), 155--177.
[5]
Juan Pablo Arias, Carlos Busso, and Nestor Becerra Yoma. 2014. Shape-based modeling of the fundamental frequency contour for emotion detection in speech. Computer Speech 8 Language 28, 1 (2014), 278--294.
[6]
RG Bachu, S Kopparthi, B Adapa, and BD Barkana. 2008. Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal. In American Society for Engineering Education (ASEE) Zone Conference Proceedings. 1--7.
[7]
Emilia I Barakova and Tino Lourens. 2010. Expressing and interpreting emotional movements in social games with robots. Personal and ubiquitous computing 14, 5 (2010), 457--467.
[8]
Linlin Chao, Jianhua Tao, Minghao Yang, and Ya Li. 2014. Improving generation performance of speech emotion recognition by denoising autoencoders. In Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on. IEEE, 341--344.
[9]
Marti Cleveland-Innes and Prisca Campbell. 2012. Emotional presence, learning, and the online learning environment. The International Review of Research in Open and Distributed Learning 13, 4 (2012), 269--292.
[10]
Christine Evers. 2010. Blind dereverberation of speech from moving and stationary speakers using sequential Monte Carlo methods. (2010).
[11]
C Evers and JR Hopgood. 2008. Parametric modelling for single-channel blind dereverberation of speech from a moving speaker. IET Signal Processing 2, 2 (2008), 59--74.
[12]
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459--1462.
[13]
Daniel Garcia-Romero and Carol Y Espy-Wilson. 2011. Analysis of i-vector Length Normalization in Speaker Recognition Systems. In Interspeech, Vol. 2011. 249--252.
[14]
Ofer Golan, Emma Ashwin, Yael Granader, Suzy McClintock, Kate Day, Victoria Leggett, and Simon Baron-Cohen. 2010. Enhancing emotion recognition in children with autism spectrum conditions: An intervention using animated vehicles with real emotional faces. Journal of autism and developmental disorders 40, 3 (2010), 269--279.
[15]
Yoav Goldberg and Omer Levy. 2014. word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014).
[16]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 6645--6649.
[17]
Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).
[18]
S. Haq and P.J.B. Jackson. 2010. Machine Audition: Principles, Algorithms and Systems. IGI Global, Hershey PA, Chapter Multimodal Emotion Recognition, 398--423.
[19]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[20]
James R Hopgood and Christine Evers. 2007. Block-based TVAR models for single-channel blind dereverberation of speech from a moving speaker. In Statistical Signal Processing, 2007. SSP’07. IEEE/SP 14th Workshop on. IEEE, 274--278.
[21]
Christian Jones and Jamie Sutherland. 2008. Acoustic emotion recognition for affective computer gaming. In Affect and emotion in human-computer interaction. Springer, 209--219.
[22]
Martin Karafiát, Lukáš Burget, Pavel Matějka, Ondřej Glembek, and Jan Černockỳ. 2011. iVector-based discriminative adaptation for automatic speech recognition. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 152--157.
[23]
Kasturi Rangan Krishnamachari, Robert E Yantorno, Jereme M Lovekin, Daniel S Benincasa, and Stanley J Wenndt. 2001. Use of local kurtosis measure for spotting usable speech segments in co-channel speech. In Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP’01). 2001 IEEE International Conference on, Vol. 1. IEEE, 649--652.
[24]
Duc Le and Emily Mower Provost. 2013. Emotion recognition from spontaneous speech using hidden markov models with deep belief networks. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 216--221.
[25]
Sungbok Lee, Serdar Yildirim, Abe Kazemzadeh, and Shrikanth Narayanan. 2005. An articulatory study of emotional speech production. In Interspeech. 497--500.
[26]
ZHOU Lian. 2015. Exploration of the Working Principle and Application of Word2vec. Sci-Tech Information Development 8 Economy 2 (2015), 145--148.
[27]
Harold Lunenfeld. 1989. Human factor considerations of motorist navigation and information systems. In Vehicle Navigation and Information Systems Conference, 1989. Conference Record. IEEE, 35--42.
[28]
Lenka Macková, Anton Čižmár, and Jozef Juhár. 2016. Emotion recognition in i-vector space. In Radioelektronika (RADIOELEKTRONIKA), 2016 26th International Conference. IEEE, 372--375.
[29]
Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica 22, 3 (2012), 276--282.
[30]
Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech, Vol. 2. 3.
[31]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.
[32]
Stephanie Pancoast and Murat Akbacak. 2012. Bag-of-Audio-Words Approach for Multimedia Event Classification. In Interspeech. 2105--2108.
[33]
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. ICML (3) 28 (2013), 1310--1318.
[34]
Rosalind W Picard. 2000. Toward computers that recognize and respond to user emotion. IBM systems journal 39, 3.4 (2000), 705--719.
[35]
Oudeyer Pierre-Yves. 2003. The production and recognition of emotions in speech: features and algorithms. International Journal of Human-Computer Studies 59, 1 (2003), 157--183.
[36]
Tim Polzehl, Alexander Schmitt, Florian Metze, and Michael Wagner. 2011. Anger recognition in speech using acoustic and linguistic cues. Speech Communication 53, 9 (2011), 1198--1209.
[37]
S Ramakrishnan and Ibrahiem MM El Emary. 2013. Speech emotion recognition approaches in human computer interaction. Telecommunication Systems (2013), 1--12.
[38]
Rajib Rana. 2016. Emotion Classification from Noisy Speech-A Deep Learning Approach. arXiv preprint arXiv:1603.05901 (2016).
[39]
Shourabh Rawat, Peter F Schulam, Susanne Burger, Duo Ding, Yipei Wang, and Florian Metze. 2013. Robust audio-codebooks for large-scale event detection in consumer videos. (2013).
[40]
Steve Renals and Pawel Swietojanski. 2014. Neural networks for distant speech recognition. In Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014 4th Joint Workshop on. IEEE, 172--176.
[41]
Steven A Rieger, Rajani Muraleedharan, and Ravi P Ramachandran. 2014. Speech based emotion recognition using spectral feature extraction and an ensemble of kNN classifiers. In Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on. IEEE, 589--593.
[42]
Xin Rong. 2014. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014).
[43]
Melissa Ryan, Janice Murray, and Ted Ruffman. 2009. Aging and the perception of emotion: Processing vocal expressions alone and with faces. Experimental aging research 36, 1 (2009), 1--22.
[44]
Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. 2014. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 1041--1044.
[45]
Asif Salekin, Hongning Wang, Kristine Williams, and John Stankovic. 2017. DAVE: Detecting Agitated Vocal Events. In The Second IEEE/ACM International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE). IEEE/ACM.
[46]
M Schroder and R Cowie. 2006. Issues in emotion-oriented computing toward a shared understanding. In Workshop on emotion and computing.
[47]
Björn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik Marchi, et al. 2013. The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. (2013).
[48]
Navid Shokouhi, Amardeep Sathyanarayana, Seyed Omid Sadjadi, and John HL Hansen. 2013. Overlapped-speech detection with applications to driver assessment for in-vehicle active safety systems. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2834--2838.
[49]
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958.
[50]
Ashish Tawari and Mohan M Trivedi. 2010. Speech emotion analysis in noisy real-world environment. In Pattern Recognition (ICPR), 2010 20th International Conference on. IEEE, 4605--4608.
[51]
George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 5200--5204.
[52]
Chung-Hsien Wu and Wei-Bin Liang. 2011. Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Transactions on Affective Computing 2, 1 (2011), 10--21.
[53]
Bai Xue, Chen Fu, and Zhan Shaobin. 2014. A study on sentiment computing and classification of sina weibo with word2vec. In Big Data (BigData Congress), 2014 IEEE International Congress on. IEEE, 358--363.
[54]
Takuya Yoshioka, Xie Chen, and Mark JF Gales. 2014. Impact of single-microphone dereverberation on DNN-based meeting transcription systems. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 5527--5531.
[55]
Mingyu You, Chun Chen, Jiajun Bu, Jia Liu, and Jianhua Tao. 2006. Emotion recognition from noisy speech. In Multimedia and Expo, 2006 IEEE International Conference on. IEEE, 1653--1656.
[56]
Zhihong Zeng, Maja Pantic, Glenn I Roisman, and Thomas S Huang. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE transactions on pattern analysis and machine intelligence 31, 1 (2009), 39--58.
[57]
Teng Zhang and Ji Wu. 2015. Speech emotion recognition with i-vector feature and RNN model. In Signal and Information Processing (ChinaSIP), 2015 IEEE China Summit and International Conference on. IEEE, 524--528.
[58]
Wan Li Zhang, Guo Xin Li, and Wei Gao. 2014. The Research of Speech Emotion Recognition Based on Gaussian Mixture Model. In Applied Mechanics and Materials, Vol. 668. Trans Tech Publ, 1126--1129.
[59]
Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. 2016. Highway long short-term memory rnns for distant speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 5755--5759.
[60]
Patrick H Zimmerman, J Elizabeth Bolhuis, Albert Willemsen, Erik S Meyer, and Lucas PJJ Noldus. 2009. The Observer XT: A tool for the integration and synchronization of multimodal signals. Behavior research methods 41, 3 (2009), 731--735.

Cited By

View all
  • (2025)Speech emotion recognition in real static and dynamic human-robot interaction scenariosComputer Speech & Language10.1016/j.csl.2024.10166689(101666)Online publication date: Jan-2025
  • (2022)Emotional Speech Recognition Method Based on Word TranscriptionSensors10.3390/s2205193722:5(1937)Online publication date: 2-Mar-2022
  • (2022)Psychophysiological Arousal in Young Children Who StutterProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35503266:3(1-32)Online publication date: 7-Sep-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 1, Issue 3
September 2017
2023 pages
EISSN:2474-9567
DOI:10.1145/3139486
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2017
Accepted: 01 June 2017
Revised: 01 May 2017
Received: 01 February 2017
Published in IMWUT Volume 1, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Distant emotion detection
  2. word2vec

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)83
  • Downloads (Last 6 weeks)6
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Speech emotion recognition in real static and dynamic human-robot interaction scenariosComputer Speech & Language10.1016/j.csl.2024.10166689(101666)Online publication date: Jan-2025
  • (2022)Emotional Speech Recognition Method Based on Word TranscriptionSensors10.3390/s2205193722:5(1937)Online publication date: 2-Mar-2022
  • (2022)Psychophysiological Arousal in Young Children Who StutterProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35503266:3(1-32)Online publication date: 7-Sep-2022
  • (2021)Emotion Recognition Robust to Indoor Environmental Distortions and Non-targeted Emotions Using Out-of-distribution DetectionACM Transactions on Computing for Healthcare10.1145/34923003:2(1-22)Online publication date: 20-Dec-2021
  • (2021)Robustness to noise for speech emotion classification using CNNs and attention mechanismsSmart Health10.1016/j.smhl.2020.10016519(100165)Online publication date: Mar-2021
  • (2021)Sentiment Analysis Model Based on the Word Structural RepresentationBrain Informatics10.1007/978-3-030-86993-9_16(170-178)Online publication date: 15-Sep-2021
  • (2020)A Survey of Speech Emotion Recognition in Natural EnvironmentDigital Signal Processing10.1016/j.dsp.2020.102951(102951)Online publication date: Dec-2020
  • (2019)ARASID: Artificial Reverberation-Adjusted Indoor Speaker Identification Dealing with Variable DistancesProceedings of the 2019 International Conference on Embedded Wireless Systems and Networks10.5555/3324320.3324339(154-165)Online publication date: 25-Feb-2019
  • (2018)A Weakly Supervised Learning Framework for Detecting Social Anxiety and DepressionProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/32142842:2(1-26)Online publication date: 5-Jul-2018

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media