[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3340555.3353762acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

Exploring Transfer Learning between Scripted and Spontaneous Speech for Emotion Recognition

Published: 14 October 2019 Publication History

Abstract

Internet of Things technologies yield large amounts of real-life speech data related to human emotions. Yet, labelled data of human emotion from spontaneous speech are extremely limited due to the difficulties in the annotation of such large volumes of audio samples. A potential way to address this limitation is to augment emotion models of spontaneous speech with fully annotated data collected using scripted scenarios. We investigate whether and to what extent knowledge related to speech emotional content can be transferred between datasets of scripted and spontaneous speech. We implement transfer learning through: (1) a feed-forward neural network trained on the source data and whose last layers are fine-tuned based on the target data; and (2) a progressive neural network retaining a pool of pre-trained models and learning lateral connections between source and target task. We explore the effectiveness of the proposed approach using four publicly available datasets of emotional speech. Our results indicate that transfer learning can effectively leverage corpora of scripted data to improve emotion recognition performance for spontaneous speech.

References

[1]
Mohammed Abdelwahab and Carlos Busso. 2015. Supervised domain adaptation for emotion recognition from speech. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 5058–5062.
[2]
Mohammed Abdelwahab and Carlos Busso. 2018. Domain Adversarial for Acoustic Emotion Recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing 26, 12(2018), 2423–2435.
[3]
James Bergstra, Dan Yamins, and David D Cox. 2013. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In Proceedings of the 12th Python in Science Conference. Citeseer, 13–20.
[4]
Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42, 4 (2008), 335.
[5]
C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. Mower Provost. 2017. MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception. IEEE Transactions on Affective Computing 8, 1 (January-March 2017), 67–80. https://doi.org/10.1109/TAFFC.2016.2515617
[6]
Carlos Busso, Srinivas Parthasarathy, Alec Burmania, Mohammed AbdelWahab, Najmeh Sadoughi, and Emily Mower Provost. 2017. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing 8, 1 (2017), 67–80.
[7]
Houwei Cao, David G Cooper, Michael K Keutmann, Ruben C Gur, Ani Nenkova, and Ragini Verma. 2014. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing 5, 4 (2014), 377–390.
[8]
Jonathan Chang and Stefan Scherer. 2017. Learning representations of emotional speech with deep convolutional generative adversarial networks. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2746–2750.
[9]
Jun Deng, Zixing Zhang, Erik Marchi, and Bjorn Schuller. 2013. Sparse autoencoder-based feature transfer learning for speech emotion recognition. In Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on. IEEE, 511–516.
[10]
Abhinav Dhall, Roland Goecke, Tom Gedeon, and Nicu Sebe. 2016. Emotion recognition in the wild.
[11]
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. ACM, 1459–1462.
[12]
Muhammad Ghifary, W Bastiaan Kleijn, and Mengjie Zhang. 2014. Domain adaptive neural networks for object recognition. In Pacific Rim international conference on artificial intelligence. Springer, 898–904.
[13]
John Gideon, Soheil Khorram, Zakaria Aldeneh, Dimitrios Dimitriadis, and Emily Mower Provost. 2017. Progressive neural networks for transfer learning in emotion recognition. arXiv preprint arXiv:1706.03256(2017).
[14]
John Gideon, Emily Mower Provost, and Melvin McInnis. 2016. Mood state prediction from speech of varying acoustic quality for individuals with bipolar disorder. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2359–2363.
[15]
Bing Jiang, Yan Song, Si Wei, Jun-Hua Liu, Ian Vince McLoughlin, and Li-Rong Dai. 2014. Deep bottleneck features for spoken language identification. PloS one 9, 7 (2014), e100795.
[16]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences (2017), 201611835.
[17]
Dongkeon Lee, Kyo-Joong Oh, and Ho-Jin Choi. 2017. The chatbot feels you-a counseling service using emotional response generation. In Big Data and Smart Computing (BigComp), 2017 IEEE International Conference on. IEEE, 437–440.
[18]
Zhenyu Liu, Bin Hu, Lihua Yan, Tianyang Wang, Fei Liu, Xiaoyu Li, and Huanyu Kang. 2015. Detection of depression in speech. In Affective Computing and Intelligent Interaction (ACII), 2015 International Conference on. IEEE, 743–747.
[19]
Steven R Livingstone, Katlyn Peck, and Frank A Russo. 2012. RAVDESS: The Ryerson audio-visual database of emotional speech and song. In Annual meeting of the canadian society for brain, behaviour and cognitive science. 205–211.
[20]
Olivier Martin, Irene Kotsia, Benoit Macq, and Ioannis Pitas. 2006. The enterface’05 audio-visual emotion database. In Data Engineering Workshops, 2006. Proceedings. 22nd International Conference on. IEEE, 8–8.
[21]
Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671(2016).
[22]
Saurabh Sahu, Rahul Gupta, Ganesh Sivaraman, Wael AbdAlmageed, and Carol Espy-Wilson. 2018. Adversarial Auto-encoders for Speech Based Emotion Recognition. arXiv preprint arXiv:1806.02146(2018).
[23]
Björn Schuller, Stefan Steidl, and Anton Batliner. 2009. The Interspeech 2009 emotion challenge. In Tenth Annual Conference of the International Speech Communication Association.
[24]
Samuel S Sohn, Xun Zhang, Fernando Geraci, and Mubbasir Kapadia. 2018. An Emotionally Aware Embodied Conversational Agent. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 2250–2252.
[25]
Yan Song, Ruilian Cui, Xinhai Hong, Ian Mcloughlin, Jiong Shi, and Lirong Dai. 2015. Improved language identification using deep bottleneck network. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4200–4204.
[26]
Adela C Timmons, Theodora Chaspari, Sohyun C Han, Laura Perrone, Shrikanth S Narayanan, and Gayla Margolin. 2017. Using multimodal wearable technology to detect conflict among couples. Computer3(2017), 50–59.
[27]
Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474(2014).
[28]
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks?. In Advances in neural information processing systems. 3320–3328.

Cited By

View all
  • (2024)Noise-Robust Deep Learning Model for Emotion Classification Using Facial ExpressionsIEEE Access10.1109/ACCESS.2024.343688112(143074-143089)Online publication date: 2024
  • (2023)Improving Speech Emotion Recognition with Data Expression Aware Multi-Task Learning2023 31st European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO58844.2023.10289986(386-390)Online publication date: 4-Sep-2023
  • (2023)Two Birds With One Stone: Knowledge-Embedded Temporal Convolutional Transformer for Depression Detection and Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2023.328270414:4(2595-2613)Online publication date: 1-Oct-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICMI '19: 2019 International Conference on Multimodal Interaction
October 2019
601 pages
ISBN:9781450368605
DOI:10.1145/3340555
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. (progressive) neural network
  2. fine tuning
  3. speech emotion recognition
  4. transfer learning

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

ICMI '19

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Noise-Robust Deep Learning Model for Emotion Classification Using Facial ExpressionsIEEE Access10.1109/ACCESS.2024.343688112(143074-143089)Online publication date: 2024
  • (2023)Improving Speech Emotion Recognition with Data Expression Aware Multi-Task Learning2023 31st European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO58844.2023.10289986(386-390)Online publication date: 4-Sep-2023
  • (2023)Two Birds With One Stone: Knowledge-Embedded Temporal Convolutional Transformer for Depression Detection and Emotion RecognitionIEEE Transactions on Affective Computing10.1109/TAFFC.2023.328270414:4(2595-2613)Online publication date: 1-Oct-2023
  • (2023)Few-Shot Learning in Emotion Recognition of Spontaneous Speech Using a Siamese Neural Network With Adaptive Sample Pair FormationIEEE Transactions on Affective Computing10.1109/TAFFC.2021.310948514:2(1627-1633)Online publication date: 1-Apr-2023
  • (2023)Elicitation-Based Curriculum Learning for Improving Speech Emotion RecognitionIECON 2023- 49th Annual Conference of the IEEE Industrial Electronics Society10.1109/IECON51785.2023.10311721(1-6)Online publication date: 16-Oct-2023
  • (2023)Multimodal Transfer Learning for Oral Presentation AssessmentIEEE Access10.1109/ACCESS.2023.329583211(84013-84026)Online publication date: 2023
  • (2022)Generalization of Deep Acoustic and NLP Models for Large-Scale Depression ScreeningBiomedical Sensing and Analysis10.1007/978-3-030-99383-2_3(99-132)Online publication date: 20-Jul-2022
  • (2020)A Review of Generalizable Transfer Learning in Automatic Emotion RecognitionFrontiers in Computer Science10.3389/fcomp.2020.000092Online publication date: 28-Feb-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media