Abstract
The paper presents a novel approach for classifying music emotions that addresses several limitations, including limited emotion categories, low accuracy, and counterintuitive results compared to the Valence-Arousal (V-A) model. The proposed method involves separating music sources using the MULTI-SCALE MULTI-BAND DENSENETS model, simplifying the complex music structure into vocals, drums, bass, and four other audio tracks. Time-series features and fixed attribute features are extracted from each track. The time-series features are inputted into the Conformer model, which utilizes convolutional neural networks for local feature extraction and Transformer encoders for capturing long-distance dependencies. The Conformer model output is combined with fixed attribute features and passed through fully connected layers for emotion classification. Experiments on a Netease Cloud Music dataset with 12 emotion categories demonstrated that the proposed MSB-Conformer model outperformed comparative models (CNN+LSTM, WaveNet, Transformer), achieving an average accuracy of 94.24% and surpassing the state-of-the-art in classification categories. The effectiveness and generality of the model were validated using the Emotify dataset. The proposed method simplifies music complexity through source separation and utilizes the Conformer model to simultaneously model local and global audio features, offering a novel approach to music emotion classification tasks.
M. Fang, X. Li, S. Zhang and Q. Deng—These authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Takahashi, N., Mitsufuji, Y.: Multi-scale multi-band densenets for audio source separation. In: 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, pp. 21–25. IEEE (2017)
Peng, Z., Guo, Z., Huang, W., et al.: Conformer: local features coupling global representations for recognition and detection. IEEE Trans. Pattern Anal. Mach. Intell. 45(8), 9454–9468 (2023)
Fischer, M.T., Arya, D., Streeb, D., et al.: Visual analytics for temporal hypergraph model exploration. IEEE Trans. Vis. Comput. Graph. 27(2), 550–560 (2021)
Nilsonne, G., Harrell, F.E.: EEG-based model and antidepressant response. Nat. Biotechnol. 39(1), 27 (2020)
Li, B., Liu, X., Dinesh, K., et al.: Creating a multitrack classical music performance dataset for multimodal music analysis: challenges, insights, and applications. IEEE Trans. Multimed. 21(2), 522–535 (2019)
Tong, G.: Music emotion classification method using improved deep belief network. Mob. Inf. Syst. 2022, 1–7 (2022)
Xia, Y., Xu, F.: Study on music emotion recognition based on the machine learning model clustering algorithm. Math. Probl. Eng. 2022, 1–11 (2022)
Tzanetakis, G., Essl, G., Cook, P.: Audio analysis using the discrete wavelet transform. In: Proceedings of the WSES International Conference on Acoustics and Music: Theory and Applications (2001)
Pao, T.-L., Liao, W.-Y., Chen, Y.-T.: A weighted discrete KNN method for mandarin speech and emotion recognition. In: InTech (2008)
Jin, A.W.Q.: Application of LDA to speaker recognition (2010)
Wang, Y., et al.: UniSpeech: unified speech representation learning with labeled and unlabeled data. In: Proceedings of the International Conference on Machine Learning, pp. 10937–10947. PMLR (2021)
Chen, S., et al.: WavLM: large-scale self-supervised pre-training for full stack speech processing. arXiv preprint arXiv:2110.13900 (2021)
Chen, Z., et al.: Large-scale self-supervised speech representation learning for automatic speaker verification. arXiv preprint arXiv:2110.05777 (2021)
Chen, S., et al.: Continuous speech separation with conformer. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5749–5753. IEEE (2021)
Wu, Y.-C., Hayashi, T., Tobing, P.L., et al.: Quasi-periodic WaveNet: an autoregressive raw waveform generative model with pitch-dependent dilated convolution neural network. IEEE/ACM Trans. Audio Speech Lang. Process. (TBD)
Yu, L., Wang, L., Yang, C., et al.: Analysis and implementation of a single stage transformer-less converter with high step-down voltage gain for voltage regulator modules. IEEE Trans. Indu. Electron. 68(12), 12239–12249 (2021)
Music.163. https://music.163.com/
Yu, Y., Tong, X., Wang, H., et al.: Semantic-aware spatio-temporal app usage representation via graph convolutional network. In: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 4, no. 3, pp. 1–24 (2020)
Aljanaki, A., Wiering, F., Veltkamp, R.C.: Studying emotion induced by music through a crowdsourcing game. Inf. Process. Manag. (2015)
Zentner, M., Grandjean, D., Scherer, K.R.: Emotions evoked by the sound of music: characterization, classification, and measurement. Emotion 8(4), 494–521 (2008)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Fang, M. et al. (2024). Music Emotion Classification with Source Separation Based MSB-Conformer. In: Huang, DS., Zhang, C., Chen, W. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14865. Springer, Singapore. https://doi.org/10.1007/978-981-97-5591-2_23
Download citation
DOI: https://doi.org/10.1007/978-981-97-5591-2_23
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5590-5
Online ISBN: 978-981-97-5591-2
eBook Packages: Computer ScienceComputer Science (R0)