More Web Proxy on the site http://driver.im/

research-article

Public Access

Distant Emotion Recognition

Authors:

Mohsin Y. Ahmed,

Kayla De La Haye,

John A. StankovicAuthors Info & Claims

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 1, Issue 3

Article No.: 96, Pages 1 - 25

https://doi.org/10.1145/3130961

Published: 11 September 2017 Publication History

Abstract

Distant emotion recognition (DER) extends the application of speech emotion recognition to the very challenging situation that is determined by variable speaker to microphone distances. The performance of conventional emotion recognition systems degrades dramatically as soon as the microphone is moved away from the mouth of the speaker. This is due to a broad variety of effects such as background noise, feature distortion with distance, overlapping speech from other speakers, and reverberation. This paper presents a novel solution for DER, addressing the key challenges by identification and deletion of features from consideration which are significantly distorted by distance, creating a novel, called Emo2vec, feature modeling and overlapping speech filtering technique, and the use of an LSTM classifier to capture the temporal dynamics of speech states found in emotions. A comprehensive evaluation is conducted on two acted datasets (with artificially generated distance effect) as well as on a new emotional dataset of spontaneous family discussions with audio recorded from multiple microphones placed in different distances. Our solution achieves an average 91.6%, 90.1% and 89.5% accuracy for emotion happy, angry and sad, respectively, across various distances which is more than a 16% increase on average in accuracy compared to the best baseline method.

References

[1]

2017. Google cloud speech API. https://cloud.google.com/speech/. (10 Feb 2017).

[2]

2017. Spaces in New Homes. goo.gl/1z3oVs. (10 Feb 2017).

[3]

2017. spectral noise gating algorithm. http://tinyurl.com/yard8oe. (1 Jan 2017).

[4]

Christos-Nikolaos Anagnostopoulos, Theodoros Iliou, and Ioannis Giannoukos. 2015. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intelligence Review 43, 2 (2015), 155--177.

Digital Library

[5]

Juan Pablo Arias, Carlos Busso, and Nestor Becerra Yoma. 2014. Shape-based modeling of the fundamental frequency contour for emotion detection in speech. Computer Speech 8 Language 28, 1 (2014), 278--294.

Digital Library

[6]

RG Bachu, S Kopparthi, B Adapa, and BD Barkana. 2008. Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal. In American Society for Engineering Education (ASEE) Zone Conference Proceedings. 1--7.

[7]

Emilia I Barakova and Tino Lourens. 2010. Expressing and interpreting emotional movements in social games with robots. Personal and ubiquitous computing 14, 5 (2010), 457--467.

Digital Library

[8]

Linlin Chao, Jianhua Tao, Minghao Yang, and Ya Li. 2014. Improving generation performance of speech emotion recognition by denoising autoencoders. In Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on. IEEE, 341--344.

[9]

Marti Cleveland-Innes and Prisca Campbell. 2012. Emotional presence, learning, and the online learning environment. The International Review of Research in Open and Distributed Learning 13, 4 (2012), 269--292.

[10]

Christine Evers. 2010. Blind dereverberation of speech from moving and stationary speakers using sequential Monte Carlo methods. (2010).

[11]

C Evers and JR Hopgood. 2008. Parametric modelling for single-channel blind dereverberation of speech from a moving speaker. IET Signal Processing 2, 2 (2008), 59--74.

[12]

Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459--1462.

Digital Library

[13]

Daniel Garcia-Romero and Carol Y Espy-Wilson. 2011. Analysis of i-vector Length Normalization in Speaker Recognition Systems. In Interspeech, Vol. 2011. 249--252.

[14]

Ofer Golan, Emma Ashwin, Yael Granader, Suzy McClintock, Kate Day, Victoria Leggett, and Simon Baron-Cohen. 2010. Enhancing emotion recognition in children with autism spectrum conditions: An intervention using animated vehicles with real emotional faces. Journal of autism and developmental disorders 40, 3 (2010), 269--279.

[15]

Yoav Goldberg and Omer Levy. 2014. word2vec explained: Deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014).

[16]

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 6645--6649.

[17]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).

[18]

S. Haq and P.J.B. Jackson. 2010. Machine Audition: Principles, Algorithms and Systems. IGI Global, Hershey PA, Chapter Multimodal Emotion Recognition, 398--423.

[19]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.

Digital Library

[20]

James R Hopgood and Christine Evers. 2007. Block-based TVAR models for single-channel blind dereverberation of speech from a moving speaker. In Statistical Signal Processing, 2007. SSP’07. IEEE/SP 14th Workshop on. IEEE, 274--278.

Digital Library

[21]

Christian Jones and Jamie Sutherland. 2008. Acoustic emotion recognition for affective computer gaming. In Affect and emotion in human-computer interaction. Springer, 209--219.

Digital Library

[22]

Martin Karafiát, Lukáš Burget, Pavel Matějka, Ondřej Glembek, and Jan Černockỳ. 2011. iVector-based discriminative adaptation for automatic speech recognition. In Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on. IEEE, 152--157.

[23]

Kasturi Rangan Krishnamachari, Robert E Yantorno, Jereme M Lovekin, Daniel S Benincasa, and Stanley J Wenndt. 2001. Use of local kurtosis measure for spotting usable speech segments in co-channel speech. In Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP’01). 2001 IEEE International Conference on, Vol. 1. IEEE, 649--652.

[24]

Duc Le and Emily Mower Provost. 2013. Emotion recognition from spontaneous speech using hidden markov models with deep belief networks. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 216--221.

[25]

Sungbok Lee, Serdar Yildirim, Abe Kazemzadeh, and Shrikanth Narayanan. 2005. An articulatory study of emotional speech production. In Interspeech. 497--500.

[26]

ZHOU Lian. 2015. Exploration of the Working Principle and Application of Word2vec. Sci-Tech Information Development 8 Economy 2 (2015), 145--148.

[27]

Harold Lunenfeld. 1989. Human factor considerations of motorist navigation and information systems. In Vehicle Navigation and Information Systems Conference, 1989. Conference Record. IEEE, 35--42.

[28]

Lenka Macková, Anton Čižmár, and Jozef Juhár. 2016. Emotion recognition in i-vector space. In Radioelektronika (RADIOELEKTRONIKA), 2016 26th International Conference. IEEE, 372--375.

[29]

Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica 22, 3 (2012), 276--282.

[30]

Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Interspeech, Vol. 2. 3.

[31]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.

Digital Library

[32]

Stephanie Pancoast and Murat Akbacak. 2012. Bag-of-Audio-Words Approach for Multimedia Event Classification. In Interspeech. 2105--2108.

[33]

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. ICML (3) 28 (2013), 1310--1318.

Digital Library

[34]

Rosalind W Picard. 2000. Toward computers that recognize and respond to user emotion. IBM systems journal 39, 3.4 (2000), 705--719.

Digital Library

[35]

Oudeyer Pierre-Yves. 2003. The production and recognition of emotions in speech: features and algorithms. International Journal of Human-Computer Studies 59, 1 (2003), 157--183.

Digital Library

[36]

Tim Polzehl, Alexander Schmitt, Florian Metze, and Michael Wagner. 2011. Anger recognition in speech using acoustic and linguistic cues. Speech Communication 53, 9 (2011), 1198--1209.

Digital Library

[37]

S Ramakrishnan and Ibrahiem MM El Emary. 2013. Speech emotion recognition approaches in human computer interaction. Telecommunication Systems (2013), 1--12.

Digital Library

[38]

Rajib Rana. 2016. Emotion Classification from Noisy Speech-A Deep Learning Approach. arXiv preprint arXiv:1603.05901 (2016).

[39]

Shourabh Rawat, Peter F Schulam, Susanne Burger, Duo Ding, Yipei Wang, and Florian Metze. 2013. Robust audio-codebooks for large-scale event detection in consumer videos. (2013).

[40]

Steve Renals and Pawel Swietojanski. 2014. Neural networks for distant speech recognition. In Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014 4th Joint Workshop on. IEEE, 172--176.

[41]

Steven A Rieger, Rajani Muraleedharan, and Ravi P Ramachandran. 2014. Speech based emotion recognition using spectral feature extraction and an ensemble of kNN classifiers. In Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on. IEEE, 589--593.

[42]

Xin Rong. 2014. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014).

[43]

Melissa Ryan, Janice Murray, and Ted Ruffman. 2009. Aging and the perception of emotion: Processing vocal expressions alone and with faces. Experimental aging research 36, 1 (2009), 1--22.

[44]

Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. 2014. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 1041--1044.

Digital Library

[45]

Asif Salekin, Hongning Wang, Kristine Williams, and John Stankovic. 2017. DAVE: Detecting Agitated Vocal Events. In The Second IEEE/ACM International Conference on Connected Health: Applications, Systems and Engineering Technologies (CHASE). IEEE/ACM.

[46]

M Schroder and R Cowie. 2006. Issues in emotion-oriented computing toward a shared understanding. In Workshop on emotion and computing.

[47]

Björn Schuller, Stefan Steidl, Anton Batliner, Alessandro Vinciarelli, Klaus Scherer, Fabien Ringeval, Mohamed Chetouani, Felix Weninger, Florian Eyben, Erik Marchi, et al. 2013. The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. (2013).

[48]

Navid Shokouhi, Amardeep Sathyanarayana, Seyed Omid Sadjadi, and John HL Hansen. 2013. Overlapped-speech detection with applications to driver assessment for in-vehicle active safety systems. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2834--2838.

[49]

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929--1958.

Digital Library

[50]

Ashish Tawari and Mohan M Trivedi. 2010. Speech emotion analysis in noisy real-world environment. In Pattern Recognition (ICPR), 2010 20th International Conference on. IEEE, 4605--4608.

Digital Library

[51]

George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 5200--5204.

Digital Library

[52]

Chung-Hsien Wu and Wei-Bin Liang. 2011. Emotion recognition of affective speech based on multiple classifiers using acoustic-prosodic information and semantic labels. IEEE Transactions on Affective Computing 2, 1 (2011), 10--21.

Digital Library

[53]

Bai Xue, Chen Fu, and Zhan Shaobin. 2014. A study on sentiment computing and classification of sina weibo with word2vec. In Big Data (BigData Congress), 2014 IEEE International Congress on. IEEE, 358--363.

Digital Library

[54]

Takuya Yoshioka, Xie Chen, and Mark JF Gales. 2014. Impact of single-microphone dereverberation on DNN-based meeting transcription systems. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 5527--5531.

[55]

Mingyu You, Chun Chen, Jiajun Bu, Jia Liu, and Jianhua Tao. 2006. Emotion recognition from noisy speech. In Multimedia and Expo, 2006 IEEE International Conference on. IEEE, 1653--1656.

[56]

Zhihong Zeng, Maja Pantic, Glenn I Roisman, and Thomas S Huang. 2009. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE transactions on pattern analysis and machine intelligence 31, 1 (2009), 39--58.

Digital Library

[57]

Teng Zhang and Ji Wu. 2015. Speech emotion recognition with i-vector feature and RNN model. In Signal and Information Processing (ChinaSIP), 2015 IEEE China Summit and International Conference on. IEEE, 524--528.

[58]

Wan Li Zhang, Guo Xin Li, and Wei Gao. 2014. The Research of Speech Emotion Recognition Based on Gaussian Mixture Model. In Applied Mechanics and Materials, Vol. 668. Trans Tech Publ, 1126--1129.

[59]

Yu Zhang, Guoguo Chen, Dong Yu, Kaisheng Yaco, Sanjeev Khudanpur, and James Glass. 2016. Highway long short-term memory rnns for distant speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 5755--5759.

[60]

Patrick H Zimmerman, J Elizabeth Bolhuis, Albert Willemsen, Erik S Meyer, and Lucas PJJ Noldus. 2009. The Observer XT: A tool for the integration and synchronization of multimodal signals. Behavior research methods 41, 3 (2009), 731--735.

Cited By

Grágeda NBusso CAlvarado EGarcía RMahu RHuenupan FYoma N(2025)Speech emotion recognition in real static and dynamic human-robot interaction scenariosComputer Speech & Language10.1016/j.csl.2024.10166689(101666)Online publication date: Jan-2025
https://doi.org/10.1016/j.csl.2024.101666
Bekmanova GYergesh BSharipbay AMukanova A(2022)Emotional Speech Recognition Method Based on Word TranscriptionSensors10.3390/s2205193722:5(1937)Online publication date: 2-Mar-2022
https://doi.org/10.3390/s22051937
Sharma HXiao YTumanova VSalekin A(2022)Psychophysiological Arousal in Young Children Who StutterProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35503266:3(1-32)Online publication date: 7-Sep-2022
https://dl.acm.org/doi/10.1145/3550326
Show More Cited By

Index Terms

Distant Emotion Recognition
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Real Time Distant Speech Emotion Recognition in Indoor Environments
MobiQuitous 2017: Proceedings of the 14th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services

We develop solutions to various challenges in different stages of the processing pipeline of a real time indoor distant speech emotion recognition system to reduce the discrepancy between training and test conditions for distant emotion recognition. We ...
Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition
Abstract
Majority of the automatic speech recognition systems (ASR) are trained with neutral speech and the performance of these systems are affected due to the presence of emotional content in the speech. The recognition of these emotions in human speech ...
Emotion recognition from speech: a review

Emotion recognition from speech has emerged as an important research area in the recent past. In this regard, review of existing work on emotional speech processing is useful for carrying out further research. In this paper, the recent literature on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 1, Issue 3

September 2017

2023 pages

EISSN:2474-9567

DOI:10.1145/3139486

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2017

Accepted: 01 June 2017

Revised: 01 May 2017

Received: 01 February 2017

Published in IMWUT Volume 1, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Ministry of Science, ICT
National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
872
Total Downloads

Downloads (Last 12 months)83
Downloads (Last 6 weeks)6

Reflects downloads up to 21 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Grágeda NBusso CAlvarado EGarcía RMahu RHuenupan FYoma N(2025)Speech emotion recognition in real static and dynamic human-robot interaction scenariosComputer Speech & Language10.1016/j.csl.2024.10166689(101666)Online publication date: Jan-2025
https://doi.org/10.1016/j.csl.2024.101666
Bekmanova GYergesh BSharipbay AMukanova A(2022)Emotional Speech Recognition Method Based on Word TranscriptionSensors10.3390/s2205193722:5(1937)Online publication date: 2-Mar-2022
https://doi.org/10.3390/s22051937
Sharma HXiao YTumanova VSalekin A(2022)Psychophysiological Arousal in Young Children Who StutterProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35503266:3(1-32)Online publication date: 7-Sep-2022
https://dl.acm.org/doi/10.1145/3550326
Gao YSalekin AGordon KRose KWang HStankovic J(2021)Emotion Recognition Robust to Indoor Environmental Distortions and Non-targeted Emotions Using Out-of-distribution DetectionACM Transactions on Computing for Healthcare10.1145/34923003:2(1-22)Online publication date: 20-Dec-2021
https://dl.acm.org/doi/10.1145/3492300
Wijayasingha LStankovic J(2021)Robustness to noise for speech emotion classification using CNNs and attention mechanismsSmart Health10.1016/j.smhl.2020.10016519(100165)Online publication date: Mar-2021
https://doi.org/10.1016/j.smhl.2020.100165
Bekmanova GYergesh BSharipbay A(2021)Sentiment Analysis Model Based on the Word Structural RepresentationBrain Informatics10.1007/978-3-030-86993-9_16(170-178)Online publication date: 15-Sep-2021
https://doi.org/10.1007/978-3-030-86993-9_16
Shah Fahad MRanjan AYadav JDeepak A(2020)A Survey of Speech Emotion Recognition in Natural EnvironmentDigital Signal Processing10.1016/j.dsp.2020.102951(102951)Online publication date: Dec-2020
https://doi.org/10.1016/j.dsp.2020.102951
Chen ZAhmed MSalekin AStankovic JLiu YXing GHe YPicco G(2019)ARASID: Artificial Reverberation-Adjusted Indoor Speaker Identification Dealing with Variable DistancesProceedings of the 2019 International Conference on Embedded Wireless Systems and Networks10.5555/3324320.3324339(154-165)Online publication date: 25-Feb-2019
https://dl.acm.org/doi/10.5555/3324320.3324339
Salekin AEberle JGlenn JTeachman BStankovic J(2018)A Weakly Supervised Learning Framework for Detecting Social Anxiety and DepressionProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/32142842:2(1-26)Online publication date: 5-Jul-2018
https://dl.acm.org/doi/10.1145/3214284

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents