More Web Proxy on the site http://driver.im/

short-paper

Exploring Transfer Learning between Scripted and Spontaneous Speech for Emotion Recognition

Authors:

Theodora ChaspariAuthors Info & Claims

ICMI '19: 2019 International Conference on Multimodal Interaction

Pages 435 - 439

https://doi.org/10.1145/3340555.3353762

Published: 14 October 2019 Publication History

Abstract

Internet of Things technologies yield large amounts of real-life speech data related to human emotions. Yet, labelled data of human emotion from spontaneous speech are extremely limited due to the difficulties in the annotation of such large volumes of audio samples. A potential way to address this limitation is to augment emotion models of spontaneous speech with fully annotated data collected using scripted scenarios. We investigate whether and to what extent knowledge related to speech emotional content can be transferred between datasets of scripted and spontaneous speech. We implement transfer learning through: (1) a feed-forward neural network trained on the source data and whose last layers are fine-tuned based on the target data; and (2) a progressive neural network retaining a pool of pre-trained models and learning lateral connections between source and target task. We explore the effectiveness of the proposed approach using four publicly available datasets of emotional speech. Our results indicate that transfer learning can effectively leverage corpora of scripted data to improve emotion recognition performance for spontaneous speech.

References

[1]

Mohammed Abdelwahab and Carlos Busso. 2015. Supervised domain adaptation for emotion recognition from speech. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 5058–5062.

[2]

Mohammed Abdelwahab and Carlos Busso. 2018. Domain Adversarial for Acoustic Emotion Recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing 26, 12(2018), 2423–2435.

Digital Library

[3]

James Bergstra, Dan Yamins, and David D Cox. 2013. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in Science Conference. Citeseer, 13–20.

[4]

Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42, 4 (2008), 335.

[5]

C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. Mower Provost. 2017. MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception. IEEE Transactions on Affective Computing 8, 1 (January-March 2017), 67–80. https://doi.org/10.1109/TAFFC.2016.2515617

Digital Library

[6]

Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab, Najmeh Sadoughi, and Emily Mower Provost. 2017. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing 8, 1 (2017), 67–80.

Digital Library

[7]

Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. 2014. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing 5, 4 (2014), 377–390.

[8]

Jonathan Chang and Stefan Scherer. 2017. Learning representations of emotional speech with deep convolutional generative adversarial networks. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2746–2750.

Digital Library

[9]

Jun Deng, Zixing Zhang, Erik Marchi, and Bjorn Schuller. 2013. Sparse autoencoder-based feature transfer learning for speech emotion recognition. In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 511–516.

Digital Library

[10]

Abhinav Dhall, Roland Goecke, Tom Gedeon, and Nicu Sebe. 2016. Emotion recognition in the wild.

[11]

Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459–1462.

Digital Library

[12]

Muhammad Ghifary, W Bastiaan Kleijn, and Mengjie Zhang. 2014. Domain adaptive neural networks for object recognition. In Pacific Rim international conference on artificial intelligence. Springer, 898–904.

[13]

John Gideon, Soheil Khorram, Zakaria Aldeneh, Dimitrios Dimitriadis, and Emily Mower Provost. 2017. Progressive neural networks for transfer learning in emotion recognition. arXiv preprint arXiv:1706.03256(2017).

[14]

John Gideon, Emily Mower Provost, and Melvin McInnis. 2016. Mood state prediction from speech of varying acoustic quality for individuals with bipolar disorder. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2359–2363.

Digital Library

[15]

Bing Jiang, Yan Song, Si Wei, Jun-Hua Liu, Ian Vince McLoughlin, and Li-Rong Dai. 2014. Deep bottleneck features for spoken language identification. PloS one 9, 7 (2014), e100795.

[16]

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences (2017), 201611835.

[17]

Dongkeon Lee, Kyo-Joong Oh, and Ho-Jin Choi. 2017. The chatbot feels you-a counseling service using emotional response generation. In Big Data and Smart Computing (BigComp), 2017 IEEE International Conference on. IEEE, 437–440.

[18]

Zhenyu Liu, Bin Hu, Lihua Yan, Tianyang Wang, Fei Liu, Xiaoyu Li, and Huanyu Kang. 2015. Detection of depression in speech. In Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on. IEEE, 743–747.

Digital Library

[19]

Steven R Livingstone, Katlyn Peck, and Frank A Russo. 2012. RAVDESS: The Ryerson audio-visual database of emotional speech and song. In Annual meeting of the canadian society for brain, behaviour and cognitive science. 205–211.

[20]

Olivier Martin, Irene Kotsia, Benoit Macq, and Ioannis Pitas. 2006. The enterface’05 audio-visual emotion database. In Data Engineering Workshops, 2006. Proceedings. 22nd International Conference on. IEEE, 8–8.

[21]

Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671(2016).

[22]

Saurabh Sahu, Rahul Gupta, Ganesh Sivaraman, Wael AbdAlmageed, and Carol Espy-Wilson. 2018. Adversarial Auto-encoders for Speech Based Emotion Recognition. arXiv preprint arXiv:1806.02146(2018).

[23]

Björn Schuller, Stefan Steidl, and Anton Batliner. 2009. The Interspeech 2009 emotion challenge. In Tenth Annual Conference of the International Speech Communication Association.

[24]

Samuel S Sohn, Xun Zhang, Fernando Geraci, and Mubbasir Kapadia. 2018. An Emotionally Aware Embodied Conversational Agent. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2250–2252.

Digital Library

[25]

Yan Song, Ruilian Cui, Xinhai Hong, Ian Mcloughlin, Jiong Shi, and Lirong Dai. 2015. Improved language identification using deep bottleneck network. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4200–4204.

[26]

Adela C Timmons, Theodora Chaspari, Sohyun C Han, Laura Perrone, Shrikanth S Narayanan, and Gayla Margolin. 2017. Using multimodal wearable technology to detect conflict among couples. Computer3(2017), 50–59.

[27]

Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474(2014).

[28]

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?. In Advances in neural information processing systems. 3320–3328.

Cited By

Oh SKim D(2024)Noise-Robust Deep Learning Model for Emotion Classification Using Facial ExpressionsIEEE Access10.1109/ACCESS.2024.343688112(143074-143089)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3436881
Kumawat PRoutray A(2023)Improving Speech Emotion Recognition with Data Expression Aware Multi-Task Learning2023 31st European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO58844.2023.10289986(386-390)Online publication date: 4-Sep-2023
https://doi.org/10.23919/EUSIPCO58844.2023.10289986
Zheng WYan LWang F(2023)Two Birds With One Stone: Knowledge-Embedded Temporal Convolutional Transformer for Depression Detection and Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2023.328270414:4(2595-2613)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TAFFC.2023.3282704
Show More Cited By

Recommendations

Transfer Accent Identification Learning for Enhancing Speech Emotion Recognition
Abstract
Emotional speech has some dependency on language or within a language itself, there are certain variations due to accents. The presence of accents degrades the performance of the speech emotion recognition (SER) system. A pre-trained accent ...
Enhancing Speech Emotion Recognition Using Transfer Learning from Speaker Embeddings
Text, Speech, and Dialogue
Abstract
Understanding and identifying emotions from speech is a key challenge in automatic Speech Emotion Recognition (SER). Speech carries a variety of information about speaker’s emotional state or contextual emotions, but the lack of large and diverse ...
Synthesized speech for model training in cross-corpus recognition of human emotion

Recognizing speakers in emotional conditions remains a challenging issue, since speaker states such as emotion affect the acoustic parameters used in typical speaker recognition systems. Thus, it is believed that knowledge of the current speaker emotion ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICMI '19: 2019 International Conference on Multimodal Interaction

October 2019

601 pages

ISBN:9781450368605

DOI:10.1145/3340555

Editors:
Wen Gao
Peking University, China
,
Helen Mei Ling Meng
Chinese University of Hong Kong, China
,
Matthew Turk
Toyota Technological Institute at Chicago, USA
,
Susan R. Fussell
Cornell University, USA
,
Björn Schuller
Imperial College London / University of Augsburg, UK
,
Yale Song
Microsoft Research, USA
,
Kai Yu
Shanghai Jiao Tong University, China

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

ICMI '19

ICMI '19: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 14 - 18, 2019

Suzhou, China

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
304
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Oh SKim D(2024)Noise-Robust Deep Learning Model for Emotion Classification Using Facial ExpressionsIEEE Access10.1109/ACCESS.2024.343688112(143074-143089)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3436881
Kumawat PRoutray A(2023)Improving Speech Emotion Recognition with Data Expression Aware Multi-Task Learning2023 31st European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO58844.2023.10289986(386-390)Online publication date: 4-Sep-2023
https://doi.org/10.23919/EUSIPCO58844.2023.10289986
Zheng WYan LWang F(2023)Two Birds With One Stone: Knowledge-Embedded Temporal Convolutional Transformer for Depression Detection and Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2023.328270414:4(2595-2613)Online publication date: 1-Oct-2023
https://doi.org/10.1109/TAFFC.2023.3282704
Feng KChaspari T(2023)Few-Shot Learning in Emotion Recognition of Spontaneous Speech Using a Siamese Neural Network With Adaptive Sample Pair FormationIEEE Transactions on Affective Computing10.1109/TAFFC.2021.310948514:2(1627-1633)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TAFFC.2021.3109485
Kumawat PRoutray APal Chaudhuri S(2023)Elicitation-Based Curriculum Learning for Improving Speech Emotion RecognitionIECON 2023- 49th Annual Conference of the IEEE Industrial Electronics Society10.1109/IECON51785.2023.10311721(1-6)Online publication date: 16-Oct-2023
https://doi.org/10.1109/IECON51785.2023.10311721
Tun SOkada SHuang HLeong C(2023)Multimodal Transfer Learning for Oral Presentation AssessmentIEEE Access10.1109/ACCESS.2023.329583211(84013-84026)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3295832
Harati ARutowski TLu YChlebek POliveira RShriberg ELin D(2022)Generalization of Deep Acoustic and NLP Models for Large-Scale Depression ScreeningBiomedical Sensing and Analysis10.1007/978-3-030-99383-2_3(99-132)Online publication date: 20-Jul-2022
https://doi.org/10.1007/978-3-030-99383-2_3
Feng KChaspari T(2020)A Review of Generalizable Transfer Learning in Automatic Emotion RecognitionFrontiers in Computer Science10.3389/fcomp.2020.000092Online publication date: 28-Feb-2020
https://doi.org/10.3389/fcomp.2020.00009

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents