Abstract
As a common multi-modal information carrier, music is frequently used to deliver emotions with lyrics and melodies. Besides lyrics (text) and melodies (audio), the structure of a song is another indicator of emotions creating a strong resonance for listeners. Typically, a pop song is composed of verses and choruses. To improve the performance of existing music emotion recognition models, we first propose a hierarchical model to analyze music structure. Then, a cross-modal interaction method is developed to extract and interact emotions from different modalities. Finally, we perform music emotion recognition by combining music structure analysis and cross-modal interaction. Adequate experiments are conducted on a dataset crawled from Netease Cloud Music, and results demonstrate the effectiveness of music structure analysis and cross-modal interaction. The proposed model COSMIC achieves state-of-the-art performance on music emotion recognition tasks.
Similar content being viewed by others
Data Availability
The experiments conducted in this article used both publicly available datasets and a custom-built dataset. The publicly available datasets used in this study can be accessed through their original sources as cited in the references. The custom-built dataset used in this study was created by the authors and cannot be publicly shared due to potential copyright issues with some of the data sources.
References
Agrawal Y, Shanker RGR, Alluri V (2021) Transformer-based approach towards music emotion recognition from lyrics. In: European conference on information retrieval, pp 167–175. Springer
Aljanaki A, Yang Y-H, Soleymani M (2017) Developing a benchmark for emotional analysis of music. PloS one 12(3):0173392
Baccianella S, Esuli A, Sebastiani F (2010) Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10)
Benward B, Saker MN (1997) Music in theory and practice vol. 7. McGraw-Hill
Bertin-Mahieux T, Ellis DPW, Whitman B, amere P (2011) The million song dataset. In: Proceedings of the 12th International Society for Music Information Retrieval Conference, ISMIR, pp 591–596
Bhattacharya A, Kadambari K (2018) A multimodal approach towards emotion recognition of music using audio and lyrical content. arXiv:1811.05760
Carr D (2004) Music, meaning, and emotion. J Aesthet Art Crit 62(3):225–234
Choi K, Fazekas G, Sandler MB, Cho K (2017) Convolutional recurrent neural networks for music classification. In: 2017 IEEE International conference on acoustics, speech and signal processing, ICASSP, pp 2392–2396. IEEE
Delbouys R, Hennequin R, Piccoli F, Royo-letelier J, Moussallam M (2018) Music mood detection based on audio and lyrics with deep neural net. In: Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR, Paris, pp 370–375
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (Long and Short Papers), pp 4171–4186
Dhariwal P, Jun H, Payne C, Kim JW, Radford A, Sutskever I (2020) Jukebox: A generative model for music. arXiv:2005.00341
Dong Y, Yang X, Zhao X, Li J (2019) Bidirectional convolutional recurrent sparse network (BCRSN):, an efficient model for music emotion recognition. IEEE Trans Multimed 21(12):3150–3163
Eyben F, Weninger F, Gross F, Schuller B (2013) Recent developments in opensmile, the munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International conference on multimedia, pp 835–838
Ferreira LN, Whitehead J (2019) Learning to generate music with sentiment. In: Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR, Delft, pp 384–390
Finnegan R (2012) Music, experience, and the anthropology of emotion. In: The cultural study of music, pp 375–385. Routledge
Garg A, Chaturvedi V, Kaur AB, Varshney V, Parashar A (2022) Machine learning model for mapping of music mood and human emotion based on physiological signals. Multimed Tools Appl 81(4):5137–5177
Han B-J, Rho S, Jun S, Hwang E (2010) Music emotion classification and context-based music recommendation. Multimed Tools Appl 47(3):433–460
Hennequin R, Khlif A, Voituret F, Moussallam M (2020) Spleeter: a fast and efficient music source separation tool with pre-trained models. J Open Source Softw 5(50):2154
Hizlisoy S, Yildirim S, Tufekci Z (2021) Music emotion recognition using convolutional long short term memory deep neural networks. Eng Sci Technol an Int J 24(3):760–767
Hung H-T, Ching J, Doh S, Kim N, Nam J, Yang Y-H (2021) EMOPIA: A multi-modal pop piano dataset for emotion recognition and emotion-based music generation. In: Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR, Online, pp 318–325
Kumar V, Minz S (2013) Mood classifiaction of lyrics using sentiwordnet. In: 2013 International conference on computer communication and informatics, pp 1–5. IEEE
Laurier C, Grivolla J, Herrera P (2008) Multimodal music mood classification using audio and lyrics. In: 2008 7th International conference on machine learning and applications, pp 688–693. IEEE
Mo S, Niu J (2019) A novel method based on OMPGW method for feature extraction in automatic music mood classification. IEEE Trans Affect Comput 10(3):313–324
Panagakis Y, Kotropoulos C (2013) Music classification by low-rank semantic mappings. EURASIP J Audio Speech Music Process 2013(1):13
Panda R, Malheiro R, Paiva RP (2020) Novel audio features for music emotion recognition. IEEE Trans Affect Comput 11(4):614–626
Panda RES, Malheiro R, Rocha B, Oliveira AP, Paiva RP (2013) Multi-modal music emotion recognition: a new dataset, methodology and comparative analysis. In: 10th International symposium on computer music multidisciplinary research (CMMR 2013), pp 570–582
Parisi L, Francia S, Olivastri S, Tavella MS (2019) Exploiting synchronized lyrics and vocal features for music emotion detection. arXiv:1901.04831
Rahman JS, Gedeon T, Caldwell S, Jones R, Jin Z (2021) Towards effective music therapy for mental health care using machine learning tools: Human affective reasoning and music genres. J Artif Intell Soft Comput Res 11(1):5–20
Robinson J (2005) Deeper than Reason: Emotion and its role in literature, music, and art. Oxford University Press on Demand, NY
Shen Y, Tan S, Sordoni A, Courville AC (2019) Ordered neurons: Integrating tree structures into recurrent neural networks. In: 7th International conference on learning representations, ICLR 2019, new orleans
Stein D (2005) Engaging music: Essays in music analysis. Oxford University Press, USA
Won M, Oramas S, Nieto O, Gouyon F, Serra X (2021) Multimodal metric learning for tag-based music retrieval. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 591–595. IEEE
Won M, Salamon J, Bryan NJ, Mysore GJ, Serra X (2021) Emotion embedding spaces for matching music to stories. In: Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR, Online, pp 777–785
Xiong Y, Su F, Wang Q (2017) Automatic music mood classification by learning cross-media relevance between audio and lyrics. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp 961–966. IEEE
Xu M, Li X, Xianyu H, Tian J, Meng F, Chen W (2015) Multi-scale approaches to the MediaEval 2015 “Emotion in Music” task. In: Working notes proceedings of the MediaEval 2015 workshop. CEUR Workshop proceedings, vol. 1436. CEUR-WS.org
Yousefian Jazi S, Kaedi M, Fatemi A (2021) An emotion-aware music recommender system: bridging the user’s interaction and music recommendation. Multimed Tools Appl 80(9):13559–13574
Zhang Y, Jiang J, Xia G, Dixon S (2022) Interpreting song lyrics with an audio-informed pre-trained language model. In: Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR, Bangaluru, pp 19–26
Zhang K, Zhang H, Li S, Yang C, Sun L (2018) The PMEmo dataset for music emotion recognition. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, ICMR, Yokohama, pp 135–142
Zhang M, Zhu Y, Zhang W, Zhu Y, Feng T (2022) Modularized composite attention network for continuous music emotion recognition. Multimed Tools Appl, 1–23
Zhao J, Ru G, Yu Y, Wu Y, Li D, Li W (2022) Multimodal music emotion recognition with hierarchical cross-modal attention network. In: IEEE International conference on multimedia and expo, ICME 2022, pp 1–6. IEEE
Zhou J, Chen X, Yang D (2019) Multimodel music emotion recognition using unsupervised deep neural networks. In: Proceedings of the 6th Conference on Sound and Music Technology (CSMT), pp 27–39. Springer
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declared that they have no conflicts of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Workflow diagram
The full workflow of the proposed COSMIC framework is shown in Fig. 4.
Appendix B: Algorithm
The pseudocode for our proposed algorithm is presented as follows.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, L., Shen, Z., Zeng, J. et al. COSMIC: Music emotion recognition combining structure analysis and modal interaction. Multimed Tools Appl 83, 12519–12534 (2024). https://doi.org/10.1007/s11042-023-15376-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15376-z