Abstract
This paper introduces SATIN, the Set of Audio Tags and Identifiers Normalized. SATIN is a database of 400k audio-related metadata and identifiers that aims at facilitating reproducibility and comparisons among the Music Information Retrieval (MIR) algorithms. The idea is to take advantage of partnerships between scientists and private companies that host millions of tracks. Scientists can send their feature extraction algorithm to companies along SATIN identifiers and retrieve the corresponding features. This procedure allows the MIR community to have access to more tracks for classification purposes. Afterwards, scientists can provide to the MIR community the classification result for each track, which can then be compared with other algorithms results. SATIN thus resolves the major problems of accessing more tracks, managing copyrights locks, saving computation time, and guaranteeing consistency over research databases. We introduce SOFT1, the first Set Of FeaTures extracted by a company thanks to SATIN. We propose a supporting experiment classifying instrumentals and songs to detail a possible use of SATIN. We compare a deep learning approach —that has emerged in recent years in MIR— with a knowledge-based approach.
Similar content being viewed by others
Notes
References
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray D G, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX symposium on operating system design implementation, vol 16, pp 265– 283
Bayle Y, Hanna P, Robine M (2016) Classification à grande échelle de morceaux de musique en fonction de la présence de chant. In: Journées d’informatique musicale, Albi, France, pp 144–152
Bekios-Calfa J, Buenaposada J M, Baumela L (2011) Revisiting linear discriminant techniques in gender recognition. IEEE Trans Pattern Anal Mach Intell 33(4):858–864
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Bertin-Mahieux T, Ellis D P W, Whitman B, Lamere P (2011) The million song dataset. In: Proceedings of the 12th international society for music information retrieval conference, Miami, FL, USA, pp 591–596
Bittner R M, Salamon J, Tierney M, Mauch M, Cannam C, Bello J P (2014) MedleyDB: a multitrack dataset for annotation-intensive MIR research. In: Proceedings of the 15th international society for music information retrieval conference, Taipei, Taiwan, pp 155–160
Bogdanov D, Serrà J, Wack N, Herrera P, Serra X (2011) Unifying low-level and high-level music similarity measures. IEEE Trans Multimedia 13(4):687–701
Bogdanov D, Wack N, Gómez E, Gulati S, Herrera P, Mayor O, Roma G, Salomon J, Zapata J R, Serra X (2013) Essentia: an audio analysis library for music information retrieval. In: Proceedings of the 14th international society for music information retrieval conference, Curitiba, Brazil, pp 493– 498
Cheng Z, Shen J (2014) Just-for-me: an adaptive personalization system for location-aware social music recommendation. In: Proceedings of international conference on multimedia retrieval. ACM, p 185
Choi K, Fazekas G, Sandler M, Kim J (2015) Auralisation of deep convolutional neural networks: Listening to learned features. In: Proceedings of the 16th international society for music information retrieval conference, pp 26–30
Choi K, Fazekas G, Sandler M B (2016) Automatic tagging using deep convolutional neural networks. In: Proceedings of the 17th international society for music information retrieval conference, New York, NY, USA, pp 805–811
Choi K, Fazekas G, Cho K, Sandler M (2017) A comparison on audio signal preprocessing methods for deep neural networks on music tagging. arXiv:1709.01922
Chollet F (2015) Keras: deep learning library for theano and tensorflow. Tech. Rep
Cover T, Hart P E (1967) Nearest neighbor pattern classification. IEEE Trans Inform Theory 13(1):21–27
Defferrard M, Benzi K, Vandergheynst P, Bresson X (2017) Fma: a dataset for music analysis. In: Proceedings of the 18th international society for music information retrieval conference
Eronen A, Klapuri A (2000) Musical instrument recognition using cepstral coefficients and temporal features. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, vol 2. IEEE, pp II753–II756
Fernández C, Huerta I, Prati A (2015) A comparative evaluation of regression learning algorithms for facial age estimation. In: Ji Q, Moeslund T, Hua G, Nasrollahi K (eds) Face and facial expression recognition from real world videos. Springer, Cham, pp 133–144
Foote J T (1997) Content-based retrieval of music and audio. In: Multimedia storage and archiving systems II, international society for optics and photonics, vol 3229, pp 138–148
Ghosal A, Chakraborty R, Dhara B C, Saha S K (2013) A hierarchical approach for speech-instrumental-song classification. SpringerPlus 2(526):1–11
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of the 14th international conference on artificial intelligence and statistics, pp 315–323
Goto M, Hashiguchi H, Nishimura T, Oka R (2002) RWC music database: popular, classical and jazz music databases. In: Proceedings of the 3rd international conference on music information retrieval, Paris, France, pp 287–288
Gouyon F, Sturm B L, Oliveira J L, Hespanhol N, Langlois T (2014) On evaluation validity in music autotagging. arXiv:1410.0001
Hennequin R, Moussallam M (2015) Detection and characterization of singing voice using deep neural networks. Tech. rep., Deezer
Hershey S, Chaudhuri S, Ellis D P W, Gemmeke J F, Jansen A, Moore R C, Plakal M, Platt D, Saurous R A, Seybold B, Slaney M, Weiss R J, Wilson K (2017) Cnn architectures for large-scale audio classification. In: ICASSP. IEEE, pp 131–135
Hespanhol N (2013) Using autotagging for classification of vocals in music signals. PhD Thesis, University of Porto, Portugal
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456
Jeon B, Kim C, Kim A, Kim D, Park J, Ha J W (2017) Music emotion recognition via end-to-end multimodal neural networks. In: RECSYS
Kim Y E, Whitman B (2002) Singer identification in popular music recordings using voice coding features. In: Proceedings of the 3rd international conference on music information retrieval, Paris, France, pp 17–23
Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges C J C, Bottou L, Weinberger K Q (eds) Proceedings of the 25th conference on advances neural information processing systems. Curran Associates, Inc., pp 1097–1105
Law E, West K, Mandel M I, Bay M, Downie J S (2009) Evaluation of algorithms using games: the case of music tagging. In: Proceedings of the 10th international society for music information retrieval conference, Kobe, Japan, pp 387–392
Leglaive S, Hennequin R, Badeau R (2015) Singing voice detection with deep recurrent neural networks. In: Proceedings of the 40th IEEE international conference on acoustics, speech, and signal processing, Brisbane, Australia, pp 121–125
Lehner B, Widmer G (2015) Monaural blind source separation in the context of vocal detection. In: Proceedings of the 16th international society for music information retrieval conference, pp 309–315
Lehner B, Widmer G, Böck S (2015) A low-latency, real-time-capable singing voice detection method with LSTM recurrent neural networks. In: Proceedings of the 23rd european signal processing conference, Nice, France, pp 21–25
Lerch A (2012) An introduction to audio content analysis: applications in signal processing and music informatics. Wiley, New York
Liutkus A, Fitzgerald D, Rafii Z, Pardo B, Daudet L (2014) Kernel additive models for source separation. IEEE Trans Signal Process 62(16):4298–4310
Livshin A, Rodet X (2003) The importance of cross database evaluation in sound classification. In: Proceedings of the 4th international conference on music information retrieval, Baltimore, MD, USA, pp 1–2
Llamedo M, Khawaja A, Martinez J P (2012) Cross-database evaluation of a multilead heartbeat classifier. IEEE Trans Inf Technol Biomed 16(4):658–664
Lyu Q, Wu Z, Zhu J (2015) Polyphonic music modelling with lstm-rtrbm. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, pp 991–994
Marques G, Domingues M A, Langlois T, Gouyon F (2011) Three current issues in music autotagging. In: Proceedings of the 12th international society for music information retrieval conference, Miami, FL, USA, pp 795–800
Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) YAAFE, an easy to use and efficient audio feature extraction software. In: Proceedings of the 11th international society for music information retrieval conference, Utrecht, Netherlands, pp 441–446
McEnnis D, McKay C, Fujinaga I (2006) Overview of OMEN. In: Proceedings of the 7th international conference on music information retrieval, Victoria, BC, Canada, pp 7–12
McFee B, Raffel C, Liang D, Ellis D P W, McVicar M, Battenberg E, Nieto O (2015) Librosa Audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, pp 18–25
Moore B C J (2012) An introduction to the psychology of hearing. Brill, Leiden
Ng A Y (1997) Preventing “overfitting” of cross-validation data. In: Proceedings of the 14th international conference on machine learning, Nashville, TN, USA, pp 245–253
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É (2011) Scikit-learn: machine learning in Python. J Mach Learning Res 12:2825–2830
Rabiner L R, Juang B H (1993) Fundamentals of speech recognition. PTR Prentice Hall, Englewood Cliffs
Raina R, Madhavan A, Ng A Y (2009) Large-scale deep unsupervised learning using graphics processors. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 873–880
Ramona M, Richard G, David B (2008) Vocal detection in music with support vector machines. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, Las Vegas, NV, USA, pp 1885–1888
Rocamora M, Herrera P (2007) Comparing audio descriptors for singing voice detection in music audio files. In: Proceedings of the 11th Brazilian symposium on computer music, San Pablo, Brazil, vol 26, p 27
Roma G, Grais E M, Simpson A J, Plumbley M D (2016) Singing voice separation using deep neural networks and f0 estimation. In: MIREX
Schlüter J (2016) Learning to pinpoint singing voice from weakly labeled examples. In: Proceedings of the 17th international society for music information retrieval conference, New York, NY, USA, pp 44–50
Schlüter J, Grill T (2015) Exploring data augmentation for improved singing voice detection with neural networks. In: Proceedings of the 16th international society for music information retrieval conference, Málaga, Spain, pp 121–126
Shen J, Meng W, Yan S, Pang H, Hua X (2010) Effective music tagging through advanced statistical modeling. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval. ACM, pp 635–642
Shen J, Pang H, Wang M, Yan S (2012) Modeling concept dynamics for large scale music search. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 455–464
Silla C N Jr, Koerich A L, Kaestner C A A (2008) The latin music database. In: Proceedings of the 9th international conference on music information retrieval, pp 451–456
Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Sturm B L (2014) The state of the art ten years after a state of the art: Future research in music information retrieval. Journal of New Music Research 43(2):147–172
Sturm B L (2015) Faults in the latin music database and with its use. In: Proceedings of the late breaking demo 16th international society for music information retrieval conference, Málaga, Spain, pp 1–2
Tachibana H, Ono T, Ono N, Sagayama S (2010) Melody line estimation in homophonic music audio signals based on temporal-variability of melodic source. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing. IEEE, pp 425–428
Turnbull D, Barrington L, Torres D, Lanckriet G (2008) Semantic annotation and retrieval of music and sound effects. IEEE Trans Audio Speech Lang Process 16 (2):467–476
Tzanetakis G, Cook P (2000) Marsyas: a framework for audio analysis. Organised Sound 4(3):169–175
Valin JM (2017) A hybrid dsp/deep learning approach to real-time full-band speech enhancement. Tech. rep
Velarde G (2017) Convolutional methods for music analysis. PhD Thesis, Aalborg Universitetsforlag
Wang X, Wang Y (2014) Improving content-based and hybrid music recommendation using deep learning. In: Proceedings of the 22nd ACM international conference on multimedia. ACM, pp 627–636
West K, Cox S (2004) Features and classifiers for the automatic classification of musical audio signals. In: Proceedings of the 5th international conference on music information retrieval
Yoshii K, Goto M, Komatani K, Ogata T, Okuno H G (2007) Improving efficiency and scalability of model-based music recommender system based on incremental training. In: Proceedings of the 8th international conference on music information retrieval, Vienna, Austria, pp 89–94
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D et al (2002) The HTK book, vol 3. Cambridge University Engineering Department, Cambridge
Zhao Z, Wang X, Xiang Q, Sarroff A M, Li Z, Wang Y (2010) Large-scale music tag recommendation with explicit multiple attributes. In: Proceedings of the 18th ACM international conference on multimedia. ACM, pp 401–410
Acknowledgements
The authors thank Musixmatch for their metadata and the Research and Development team of Deezer for extracting the audio features. The authors thank Florian Iragne from Simbals, for his help with ISRC and musical metadata handling. The authors thank Fidji Berio and Kimberly Malcolm for insightful proofreading.
This work has been partially funded by the Charles University, project GA UK No. 1580317, project SVV 260451, by the internal grant agency of VŠB - Technical University of Ostrava, under the project no. SP2017/177 “Optimization of machine learning algorithms for the HPC platform”, by The Ministry of Education, Youth and Sports of the Czech Republic from the National Programme of Sustainability (NPU II) project “IT4Innovations excellence in science - LQ1602” and from the Large Infrastructures for Research, Experimental Development and Innovations project “IT4Innovations National Supercomputing Center – LM2015070”. All findings and points of view expressed in this paper are those of the authors and do not necessarily reflect the views of their academic and industrial partners.
Part of the computer time for this study was provided by the computing facilities MCIA (Mésocentre de Calcul Intensif Aquitain) of the Université de Bordeaux and of the Université de Pau et des Pays de l’Adour.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bayle, Y., Robine, M. & Hanna, P. SATIN: a persistent musical database for music information retrieval and a supporting deep learning experiment on song instrumental classification. Multimed Tools Appl 78, 2703–2718 (2019). https://doi.org/10.1007/s11042-018-5797-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-5797-8