Abstract
In order to deal with the challenges arising from acoustic-based music information retrieval such as automatic music transcription, the video of the musical performances can be utilized. In this paper, a new real-time learning-based system for visually transcribing piano music using the CNN-SVM classification of the pressed black and white keys is presented. The whole process in this technique is based on visual analysis of the piano keyboard and the pianist’s hands and fingers. A high accuracy with an average F1 score of 0.95 even under non-ideal camera view, hand coverage, and lighting conditions is achieved. The proposed system has a low latency (about 20 ms) in real-time music transcription. In addition, a new dataset for visual transcription of piano music is created and made available to researchers in this area. Since not all possible varying patterns of the data used in our work are available, an online learning approach is applied to efficiently update the original model based on the new data added to the training dataset.
Similar content being viewed by others
Notes
All videos can be downloaded from http://www.sfu.ca/akbari/MTA/Dataset.
The videos can be downloaded from http://www.sfu.ca/akbari/MTA/OnlineLearningExperiments.
The test videos and the classification results can be downloaded from http://www.sfu.ca/akbari/MTA.
References
Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: European conference on machine learning, pp 39–50
Akbari M (2014) claVision: bisual automatic piano music transcription. Master’s thesis, University of Lethbridge, Lethbridge
Akbari M, Cheng H (2015) claVision: visual automatic piano music transcription. In: Proceedings of the international conference on new interfaces for musical expression. Louisiana State University, Baton Rouge, pp 313–314
Akbari M, Cheng H (2015) Real-time piano music transcription based on computer vision. IEEE Trans Multimed 17(12):2113–2121
Akbari M, Cheng H (2016), Methods and systems for visual music transcription. http://www.google.com/patents/US9418637. US Patent 9,418,637
Baniya BK, Lee J (2016) Importance of audio feature reduction in automatic music genre classification. Multimed Tools Appl 75(6):3013–3026
Baur D, Seiffert F, Sedlmair M, Boring S (2010) The streams of our lives: visualizing listening histories in context. IEEE Trans Vis Comput Graph 16 (6):1119–1128
Bazzica A, Liem C, Hanjalic A (2016) On detecting the playing/non-playing activity of musicians in symphonic music videos. Comput Vis Image Underst 144:188–204
Ben-Hur A, Weston J (2010) A user’s guide to support vector machines. In: Data mining techniques for the life sciences, pp 223–239
Benetos E, Dixon S (2012) A shift-invariant latent variable model for automatic music transcription. Comput Music J 36(4):81–94
Benetos E, Weyde T (2015) An efficient temporally-constrained probabilistic model for multiple-instrument music transcription. In: International society for music information retrieval, pp 701–707
Benetos E, Dixon S, Giannoulis D, Kirchhoff H, Klapuri A (2013) Automatic music transcription: challenges and future directions. J Intell Inf Syst 41:407–434
Bertin N, Badeau R, Vincent E (2010) Enforcing harmonicity and smoothness in bayesian non-negative matrix factorization applied to polyphonic music transcription. IEEE Trans Audio Speech Lang Process 18(3):538–549
Böck S, Schedl M (2012) Polyphonic piano note transcription with recurrent neural networks. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 121–124
Borjian N, Kabir E, Seyedin S, Masehian E (2017) A query-by-example music retrieval system using feature and decision fusion. Multimed Tools Appl 1–25. https://doi.org/10.1007/s11042-017-4524-1
Brown S (2006) The perpetual music track: the phenomenon of constant musical imagery. J Conscious Stud 13(6):43–62
Cao X, Sun L, Niu J, Wu R, Liu Y, Cai H (2015) Automatic composition of happy melodies based on relations. Multimed Tools Appl 74(21):9097–9115
Cemgil AT, Kappen HJ, Barber D (2006) A generative model for music transcription. IEEE Trans Audio Speech Lang Process 14(2):679–694
Chang C, Lin C (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27
Chang H, Huang S, Wu J (2016) A personalized music recommendation system based on electroencephalography feedback. Multimed Tools Appl 1–20. https://doi.org/10.1007/s11042-015-3202-4
Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl 6(1):1–6
Corrêa DC, Rodrigues FA (2016) A survey on symbolic data-based music genre classification. Exp Syst Appl 60:190–210
Dannenberg RB (1993) Music representation issues, techniques, and systems. Comput Music J 17(3):20–30
Davy M, Godsill SJ (2003) Bayesian harmonic models for musical signal analysis. Bayesian Stat 7:105–124
de Souza C (2014) Accord.net framework. http://www.accord-framework.net
Downie JS (2003) Music information retrieval. Annu Rev Inf Sci Technol 37 (1):295–340
Duan Z, Pardo B, Zhang C (2010) Multiple fundamental frequency estimation by modeling spectral peaks and non-peak regions. IEEE Trans Audio Speech Lang Process 18(8):2121–2133
Farquad M, Bose I (2012) Preprocessing unbalanced data using support vector machine. Decis Support Syst 53(1):226–233
Frisson C, Reboursière L, Chu W, Lähdeoja O, Mills Iii J, Picard C, Shen A, Todoroff T (2009) Multimodal guitar: performance toolbox and study workbench. QPSR of the Numediart Res Progr 2(3):67–84
Geng M, Wang Y, Tian Y, Huang T (2016) Cnusvm: Hybrid cnn-uneven svm model for imbalanced visual learning. In: IEEE second international conference on multimedia big data (BigMM), pp 186–193
Gorodnichy DO, Yogeswaran A (2006) Detection and tracking of pianist hands and fingers. In: 2006 The 3rd Canadian conference on computer and robot vision, p 63
Gutiérrez S, García S (2016) Landmark-based music recognition system optimisation using genetic algorithms. Multimed Tools Appl 75(24):16905–16922
Karpathy A (2016) Convnetsharp. https://github.com/cbovar/ConvNetSharp
Katarya R, Verma OP Efficient music recommender system using context graph and particle swarm. Multimed Tools Appl 1–15. https://doi.org/10.1007/s11042-017-4447-x
Klapuri AP (2003) Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Trans Speech Audio Process 11(6):804–816
Klapuri A (2004) Automatic music transcription as we know it today. J New Music Res 33(3):269–282
Laskov P, Gehl C, Krüger S, Müller K (2006) Incremental support vector learning: analysis, implementation and applications. J Mach Learn Res 7:1909–1936
Lin C, Weng R, Keerthi S (2008) Trust region newton method for logistic regression. J Mach Learn Res 9:627–650
Maler A (2013) Songs for hands: analyzing interactions of sign language and music. Music Theory Online 19(1):1–15
Nanni L, Costa YM, Lumini A, Kim MY, Baek SR (2016) Combining visual and acoustic features for music genre classification. Exp Syst Appl 45:108–117
Oka A, Hashimoto M (2013) Marker-less piano fingering recognition using sequential depth images. In: 2013 19th Korea-Japan joint workshop on frontiers of computer vision, (FCV), pp 1–4
Paleari M, Huet B, Schutz A, Slock D (2008) A multimodal approach to music transcription. In: 15th IEEE international conference on image processing, pp 93–96
Peeling PH, Godsill SJ (2011) Multiple pitch estimation using non-homogeneous poisson processes. IEEE J Sel Top Sign Process 5(6):1133–1143
Pertusa A, Iñesta JM (2005) Polyphonic monotimbral music transcription using dynamic networks. Pattern Recogn Lett 26(12):1809–1818
Poast M (2000) Color music: visual color notation for musical expression. Leonardo 33(3):215–221
Quested G, Boyle R, Ng K (2008) Polyphonic note tracking using multimodal retrieval of musical events. In: Proceedings of the international computer music conference (ICMC)
Reboursière L, Frisson C, Lähdeoja O, Mills Iii J, Picard C, Todoroff T (2010) MultimodalGuitar: a toolbox for augmented guitar performances. In: Proceedings of the New Interfaces for Musical Expression++ (NIME++)
Scarr J, Green R (2010) Retrieval of guitarist fingering information using computer vision. In: 25th international conference of image and vision computing New Zealand (IVCNZ), pp 1–7
Schindler A, Rauber A (2016) Harnessing music-related visual stereotypes for music information retrieval. ACM Trans Intell Syst Technol (TIST) 8(2):20
Seger RA, Wanderley MM, Koerich AL (2014) Automatic detection of musicians’ ancillary gestures based on video analysis. Exp Syst Appl 41(4):2098–2106
Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 806–813
Sigtia S, Benetos E, Boulanger-Lewandowski N, Weyde T, d’Avila Garcez AS, Dixon S (2015) A hybrid recurrent neural network for music transcription. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2061–2065
Sigtia S, Benetos E, Dixon S (2016) An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 24(5):927–939
Sotirios M, Georgios P (2008) Computer vision method for pianist’s fingers information retrieval. In: Proceedings of the 10th international conference on information integration and web-based applications & services, iiWAS ’08. ACM, pp 604–608
Stober S, Nürnberger A (2013) Adaptive music retrieval–a state of the art. Multimed Tools Appl 65(3):467–494
Suteparuk P (2014) Detection of piano keys pressed in video. Tech. rep., Department of Computer Science, Stanford University
Tavares TF, Odowichuck G, Zehtabi S, Tzanetakis G (2012) Audio-visual vibraphone transcription in real time. In: 2012 IEEE 14th international workshop on multimedia signal processing (MMSP), pp 215–220
Tavares TF, Barbedo JGA, Attux R, Lopes A (2013) Survey on automatic transcription of music. J Braz Comput Soc 19(4):589–604
Taweewat P, Wutiwiwatchai C (2013) Musical pitch estimation using a supervised single hidden layer feed-forward neural network. Exp Syst Appl 40(2):575–589
Thompson WF, Graham P, Russo FA (2005) Seeing music performance: visual influences on perception and experience. Semiotica 2005(156):203–227
Tsai C, Lin C, Lin C (2014) Incremental and decremental training for linear classification. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 343–352
Yoshii K, Goto M (2012) A nonparametric bayesian multipitch analyzer based on infinite latent harmonic allocation. IEEE Trans Audio Speech Lang Process 20 (3):717–730
Zhang B, Wang Y (2009) Automatic music transcription using audio-visual fusion for violin practice in home environment. Tech. Rep. TRA7/09, School of Computing, National University of Singapore
Zhang B, Zhu J, Wang Y, Leow WK (2007) Visual analysis of fingering for pedagogical violin transcription. In: Proceedings of the 15th international conference on multimedia, pp 521–524
Acknowledgements
This work was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada under grant RGPIN312262, STPGP447223, RGPAS478109, and RGPIN288300.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Akbari, M., Liang, J. & Cheng, H. A real-time system for online learning-based visual transcription of piano music. Multimed Tools Appl 77, 25513–25535 (2018). https://doi.org/10.1007/s11042-018-5803-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-5803-1