[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3395035.3425252acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

The Influence of Blind Source Separation on Mixed Audio Speech and Music Emotion Recognition

Published: 27 December 2020 Publication History

Abstract

While both speech emotion recognition and music emotion recognition have been studied extensively in different communities, little research went into the recognition of emotion from mixed audio sources, i.e. when both speech and music are present. However, many application scenarios require models that are able to extract emotions from mixed audio sources, such as television content. This paper studies how mixed audio affects both speech and music emotion recognition using a random forest and deep neural network model, and investigates if blind source separation of the mixed signal beforehand is beneficial. We created a mixed audio dataset, with 25% speech-music overlap without contextual relationship between the two. We show that specialized models for speech-only or music-only audio were able to achieve merely 'chance-level' performance on mixed audio. For speech, above chance-level performance was achieved when trained on raw mixed audio, but optimal performance was achieved with audio blind source separated beforehand. Music emotion recognition models on mixed audio achieve performance approaching or even surpassing performance on music-only audio, with and without blind source separation. Our results are important for estimating emotion from real-world data, where individual speech and music tracks are often not available.

References

[1]
Mehmet Berkehan Akçay and Kaya Oguz. 2020. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, Vol. 116 (2020), 56 -- 76. https://doi.org/10.1016/j.specom.2019.12.001
[2]
Anna Aljanaki, Yi-Hsuan Yang, and Mohammad Soleymani. 2017. Developing a benchmark for emotional analysis of music. PloS one, Vol. 12, 3 (2017), e0173392.
[3]
Anjali Bhavan, Pankaj Chauhan, Rajiv Ratn Shah, et al. 2019. Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems, Vol. 184 (2019), 104886.
[4]
Dmitry Bogdanov, Nicolas Wack, Emilia Gómez Gutiérrez, Sankalp Gulati, Herrera Boyer, Oscar Mayor, Gerard Roma Trepat, Justin Salamon, José Ricardo Zapata González, Xavier Serra, et al. 2013. Essentia: An audio analysis library for music information retrieval. In Britto A, Gouyon F, Dixon S, editors. 14th Conference of the International Society for Music Information Retrieval (ISMIR); 2013 Nov 4--8; Curitiba, Brazil. 2013. p. 493--8. International Society for Music Information Retrieval (ISMIR).
[5]
Leo Breiman. 2001. Random forests. Machine learning, Vol. 45, 1 (2001), 5--32.
[6]
Sih-Huei Chen, Yuan-Shan Lee, Wen-Chi Hsieh, and Jia-Ching Wang. 2015. Music emotion recognition using deep Gaussian process. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 495--498.
[7]
Franccois Chollet et al. 2015. Keras. https://keras.io.
[8]
R EBU-Recommendation. 2011. Loudness normalisation and permitted maximum level of audio signals. (2011).
[9]
Tuomas Eerola and Jonna K Vuoskoski. 2011. A comparison of the discrete and dimensional models of emotion in music. Psychology of Music, Vol. 39, 1 (2011), 18--49.
[10]
Paul Ekman. 1992. An argument for basic emotions. Cognition & emotion, Vol. 6, 3--4 (1992), 169--200.
[11]
Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, Vol. 44, 3 (2011), 572--587.
[12]
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459--1462.
[13]
Lili Guo, Longbiao Wang, Jianwu Dang, Linjuan Zhang, Haotian Guan, and Xiangang Li. 2018. Speech Emotion Recognition by Combining Amplitude and Phase Information Using Convolutional Neural Network. In INTERSPEECH. 1611--1615.
[14]
Byeong-jun Han, Seungmin Rho, Roger B Dannenberg, and Eenjun Hwang. 2009. SMERS: Music Emotion Recognition Using Support Vector Regression. In International Society for Music Information Retrieval (ISMIR). Citeseer, 651--656.
[15]
Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. 2019. Spleeter: A fast and state-of-the art music source separation tool with pre-trained models. Late-Breaking/Demo International Society for Music Information Retrieval (ISMIR), Vol. 2019 (2019).
[16]
Heysem Kaya and Alexey A Karpov. 2018. Efficient and effective strategies for cross-corpus acoustic emotion recognition. Neurocomputing, Vol. 275 (2018), 1028--1034.
[17]
Ruhul Amin Khalil, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan, Mohammad Haseeb Zafar, and Thamer Alhussain. 2019. Speech emotion recognition using deep learning techniques: A review. IEEE Access, Vol. 7 (2019), 117327--117345.
[18]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[19]
Lars Kuchinke, Hermann Kappelhoff, and Stefan Koelsch. 2013. Emotion and music in narrative films: a neuroscientific perspective. (2013).
[20]
Casper Laugs. 2020. Creating a Speech and Music Emotion Recognition System for Mixed Source Audio. Master's thesis. Universiteit Utrecht, the Netherlands.
[21]
Bochen Li and Aparna Kumar. 2019. Query by Video: Cross-modal Music Retrieval. In International Society for Music Information Retrieval (ISMIR). 604--611.
[22]
Tao Li and Mitsunori Ogihara. 2003. Detecting emotion in music. In Proceedings of the International Conference on Music Information Retrieval (ISMIR). Johns Hopkins University.
[23]
Steven R Livingstone and Frank A Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, Vol. 13, 5 (2018), e0196391.
[24]
Konstantin Markov and Tomoko Matsui. 2014. Music genre and emotion recognition using Gaussian processes. IEEE access, Vol. 2 (2014), 688--697.
[25]
Yesid Ospitia Medina, José Ramón Beltrán, and Sandra Baldassarri. 2020. Emotional classification of music using neural networks with the MediaEval dataset. Personal and Ubiquitous Computing (2020), 1--13.
[26]
Albert Mehrabian and James A Russell. 1974. An approach to environmental psychology. the MIT Press.
[27]
Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva. 2003. Speech emotion recognition using hidden Markov models. Speech communication, Vol. 41, 4 (2003), 603--623.
[28]
Renato Panda, Ricardo Manuel Malheiro, and Rui Pedro Paiva. 2018. Novel audio features for music emotion recognition. IEEE Transactions on Affective Computing (2018).
[29]
Renato Panda, Bruno Rocha, and Rui Pedro Paiva. 2015. Music emotion recognition with standard and melodic audio features. Applied Artificial Intelligence, Vol. 29, 4 (2015), 313--334.
[30]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, Vol. 12 (2011), 2825--2830.
[31]
James A Russell. 1980. A circumplex model of affect. Journal of personality and social psychology, Vol. 39, 6 (1980), 1161.
[32]
Mladen Russo, Luka Kraljevi?, Maja Stella, and Marjan Sikora. 2020. Cochleogram-based approach for detecting perceived emotions in music. Information Processing & Management, Vol. 57, 5 (2020), 102270. https://doi.org/10.1016/j.ipm.2020.102270
[33]
Erik M Schmidt, Douglas Turnbull, and Youngmoo E Kim. 2010. Feature selection for content-based, time-varying musical emotion regression. In Proceedings of the International Conference on Multimedia Information Retrieval. 267--274.
[34]
Björn Schuller, Clemens Hage, Dagmar Schuller, and Gerhard Rigoll. 2010a. ?Mister D.J., Cheer Me Up!?: Musical and Textual Features for Automatic Mood Classification. Journal of New Music Research, Vol. 39, 1 (2010), 13--34. https://doi.org/10.1080/09298210903430475
[35]
Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian Müller, and Shrikanth S Narayanan. 2010b. The INTERSPEECH 2010 paralinguistic challenge. In INTERSPEECH. 2794--2797.
[36]
Ki-Ho Shin and In-Kwon Lee. 2017. Music synchronization with video using emotion similarity. In 2017 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE, 47--50.
[37]
George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5200--5204.
[38]
Felix Weninger, Florian Eyben, Björn Schuller, Marcello Mortillaro, and Klaus Scherer. 2013. On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common. Frontiers in Psychology, Vol. 4 (2013), 292. https://doi.org/10.3389/fpsyg.2013.00292
[39]
Jieping Xu, Xirong Li, Yun Hao, and Gang Yang. 2014. Source separation improves music emotion recognition. In Proceedings of international conference on multimedia retrieval. 423--426.
[40]
Xinyu Yang, Yizhuo Dong, and Juan Li. 2018. Review of data features-based music emotion recognition methods. Multimedia systems, Vol. 24, 4 (2018), 365--389.
[41]
Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and Homer H Chen. 2008. A regression approach to music emotion recognition. IEEE Transactions on audio, speech, and language processing, Vol. 16, 2 (2008), 448--457.

Cited By

View all
  • (2023)A Robust Hybrid Neural Network Architecture for Blind Source Separation of Speech Signals Exploiting Deep LearningIEEE Access10.1109/ACCESS.2023.331397211(100414-100437)Online publication date: 2023
  • (2023)A survey of artificial intelligence approaches in blind source separationNeurocomputing10.1016/j.neucom.2023.126895561(126895)Online publication date: Dec-2023
  • (2022)End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-WildMultimodal Technologies and Interaction10.3390/mti60200116:2(11)Online publication date: 27-Jan-2022

Index Terms

  1. The Influence of Blind Source Separation on Mixed Audio Speech and Music Emotion Recognition
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal Interaction
      October 2020
      548 pages
      ISBN:9781450380027
      DOI:10.1145/3395035
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 27 December 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. blind source separation
      2. multi-modal
      3. music emotion recognition
      4. speech emotion recognition

      Qualifiers

      • Short-paper

      Conference

      ICMI '20
      Sponsor:
      ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
      October 25 - 29, 2020
      Virtual Event, Netherlands

      Acceptance Rates

      Overall Acceptance Rate 453 of 1,080 submissions, 42%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)18
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 10 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)A Robust Hybrid Neural Network Architecture for Blind Source Separation of Speech Signals Exploiting Deep LearningIEEE Access10.1109/ACCESS.2023.331397211(100414-100437)Online publication date: 2023
      • (2023)A survey of artificial intelligence approaches in blind source separationNeurocomputing10.1016/j.neucom.2023.126895561(126895)Online publication date: Dec-2023
      • (2022)End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-WildMultimodal Technologies and Interaction10.3390/mti60200116:2(11)Online publication date: 27-Jan-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media