High-Quality and Reproducible Automatic Drum Transcription from Crowdsourced Data
<p>Detailed architecture for the frame-synchronous CNN encoder (<b>left</b>) and the tatum-synchronous CNN encoder (<b>right</b>). Each architecture has two stacks of two convolutional layers (cyan) with batch normalization, followed by max-pooling and dropout layers. Tatum synchronicity is achieved with max-pooling on the frame dimension (blue).</p> "> Figure 2
<p>Detailed architecture for the RNN decoder (<b>left</b>) and the self-attention decoder (<b>right</b>). In summary, the RNN consists of three layers of bi-directional gated recurrent units (green). The self-attention mechanism consists of L stacks of multi-head self-attention (orange).</p> "> Figure 3
<p>A beat divided into 12 even intervals accommodates 16th notes and 16th-note triplets.</p> "> Figure 4
<p>Distribution of the tatum intervals, derived from Madmom’s beats subdivided 12 times, for each dataset.</p> "> Figure 5
<p>Genre distribution for both ADTOF datasets.</p> "> Figure A1
<p>F-measure of the frame RNN model with a varying hit rate tolerance on both ADTOF datasets, before and after alignment.</p> ">
Abstract
:1. Introduction
- To identify an optimal architecture, we compared multiple architectures exploiting recent techniques in DL;
- To mitigate noise and bias in crowdsourced datasets, we curated a new dataset to be used conjointly with existing ones;
- To evaluate the datasets, we compared multiple training procedures by combining different mixtures of data.
2. Related Works
2.1. Tasksand Vocabulary
2.2. Architecture
2.3. Training Procedure
2.4. Training Data
3. Materials and Methods
3.1. Deep Learning Architectures
3.1.1. Frame- and Tatum-Synchronous Encoders
Input
CNN
Tatum Max-Pooling
3.1.2. RNN or Self-Attention Decoders
RNN
Self-Attention
Output
3.2. Training Procedure
3.2.1. Datasets
3.2.2. Sampling Procedure
Mixing Datasets
Pre-Processing
Further Details
4. Results
4.1. Evaluation of Architectures
4.2. Evaluating Training Procedures
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
ADT | Automatic drum transcription |
AMT | Automatic music transcription |
BD | Bass drum |
CNN | Convolutional neural network |
CY | Cymbal |
DL | Deep learning |
DTD | Drum transcription of drum-only recordings |
DTM | Drum transcription in the presence of melodic instruments |
DTP | Drum transcription in the presence of percussion |
HH | Hi-hat |
MIT | Multi-instrument transcription |
RD | Ride |
RNN | Recurrent neural network |
SD | Snare drum |
SS | Source separation |
Appendix A. Quantitative Study of the Cleansing Procedure
Original Game Label | KD | SD | TT | HH | CY + RD |
---|---|---|---|---|---|
Orange drum | 3,219,032 | ||||
Red drum | 1,855,106 | 17,531 | 23 | ||
Yellow drum | 214,105 | ||||
Blue drum | 213 | 188,620 | |||
Green drum | 412 | 198,367 | |||
Yellow cymbal | 10,016 | 1,043,200 | 25,598 | ||
Blue cymbal | 993,918 | ||||
Green cymbal | 610,726 |
Appendix B. Dataset Accessibility
References
- Wu, C.W.; Dittmar, C.; Southall, C.; Vogl, R.; Widmer, G.; Hockman, J.; Muller, M.; Lerch, A. A Review of Automatic Drum Transcription. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1457–1483. [Google Scholar] [CrossRef]
- Vogl, R.; Widmer, G.; Knees, P. Towards multi-instrument drum transcription. In Proceedings of the 21th International Conference on Digital Audio Effects (DAFx-18), Aveiro, Portugal, 4–8 September 2018. [Google Scholar]
- Zehren, M.; Alunno, M.; Bientinesi, P. ADTOF: A large dataset of non-synthetic music for automatic drum transcription. In Proceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR), Online, 7–12 November 2021; pp. 818–824. [Google Scholar]
- Wei, I.C.; Wu, C.W.; Su, L. Improving Automatic Drum Transcription Using Large-Scale Audio-to-Midi Aligned Data. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: Toronto, ON, Canada, 2021; pp. 246–250. [Google Scholar] [CrossRef]
- Ishizuka, R.; Nishikimi, R.; Nakamura, E.; Yoshii, K. Tatum-Level Drum Transcription Based on a Convolutional Recurrent Neural Network with Language Model-Based Regularized Training. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December 2020; pp. 359–364. [Google Scholar]
- Ishizuka, R.; Nishikimi, R.; Yoshii, K. Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms. Signals 2021, 2, 508–526. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Jacques, C.; Roebel, A. Automatic drum transcription with convolutional neural networks. In Proceedings of the 21th International Conference on Digital Audio Effects (DAFx-18), Aveiro, Portugal, 4–8 September 2018; pp. 80–86. [Google Scholar]
- Cartwright, M.; Bello, J.P. Increasing Drum Transcription Vocabulary Using Data Synthesis. In Proceedings of the 21th International Conference on Digital Audio Effects (DAFx-18), Aveiro, Portugal, 4–8 September 2018; pp. 72–79. [Google Scholar]
- Choi, K.; Cho, K. Deep Unsupervised Drum Transcription. arXiv 2019, arXiv:1906.03697. [Google Scholar]
- Jacques, C.; Roebel, A. Data Augmentation for Drum Transcription with Convolutional Neural Networks. arXiv 2019, arXiv:1903.01416. [Google Scholar]
- Callender, L.; Hawthorne, C.; Engel, J. Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset. arXiv 2020, arXiv:2004.00188. [Google Scholar]
- Manilow, E.; Seetharaman, P.; Pardo, B. Simultaneous Separation and Transcription of Mixtures with Multiple Polyphonic and Percussive Instruments. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Barcelona, Spain, 2020; pp. 771–775. [Google Scholar] [CrossRef]
- Wang, Y.; Salamon, J.; Cartwright, M.; Bryan, N.J.; Bello, J.P. Few-Shot Drum Transcription in Polyphonic Music. In Proceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR), Montréal, QC, Canada, 11–15 October 2020; pp. 117–124. [Google Scholar]
- Cheuk, K.W.; Herremans, D.; Su, L. ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021; pp. 3918–3926. [Google Scholar] [CrossRef]
- Gardner, J.; Simon, I.; Manilow, E.; Hawthorne, C.; Engel, J. MT3: Multi-Task Multitrack Music Transcription. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021; p. 21. [Google Scholar]
- Simon, I.; Gardner, J.; Hawthorne, C.; Manilow, E.; Engel, J. Scaling Polyphonic Transcription with Mixtures of Monophonic Transcriptions. In Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), Bengaluru, India, 4–8 December 2022; p. 8. [Google Scholar]
- Cheuk, K.W.; Choi, K.; Kong, Q.; Li, B.; Won, M.; Hung, A.; Wang, J.C.; Herremans, D. Jointist: Joint Learning for Multi-instrument Transcription and Its Applications. arXiv 2022, arXiv:2206.10805. [Google Scholar]
- Hennequin, R.; Khlif, A.; Voituret, F.; Moussallam, M. Spleeter: A fast and efficient music source separation tool with pre-trained models. J. Open Source Softw. 2020, 5, 2154. [Google Scholar] [CrossRef]
- Manilow, E.; Wichern, G.; Seetharaman, P.; Le Roux, J. Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 20–23 October 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
- Ostermann, F.; Vatolkin, I.; Ebeling, M. AAM: A dataset of Artificial Audio Multitracks for diverse music information retrieval tasks. EURASIP J. Audio Speech Music. Process. 2023, 2023, 13. [Google Scholar] [CrossRef]
- Raffel, C. Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching; Columbia University: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
- Böck, S.; Korzeniowski, F.; Schlüter, J.; Krebs, F.; Widmer, G. madmom: A New Python Audio and Music Signal Processing Library. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 1174–1178. [Google Scholar] [CrossRef]
- Böck, S.; Krebs, F.; Widmer, G. Joint Beat and Downbeat Tracking with Recurrent Neural Networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), New York, NY, USA, 7–11 August 2016; pp. 255–261. [Google Scholar] [CrossRef]
- Gillet, O.; Richard, G. ENST-Drums: An extensive audio-visual database for drum signals processing. In Proceedings of the 7th International Society for Music Information Retrieval Conference (ISMIR, Victoria, BC, Canada, 8–12 October 2006; pp. 156–159. [Google Scholar]
- Southall, C.; Wu, C.W.; Lerch, A.; Hockman, J. MDB drums—An annotated subset of MedleyDB for Automatic Drum Transcription. In Proceedings of the 7th International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China, 23–27 October 2017. [Google Scholar]
- Vogl, R.; Dorfer, M.; Knees, P. Drum transcription from polyphonic music with recurrent neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: New Orleans, LA, USA, 2017; pp. 201–205. [Google Scholar] [CrossRef]
- Bittner, R.; Salamon, J.; Tierney, M.; Mauch, M.; Cannam, C.; Bello, J. MedleyDB: A multitrack dataset for annotation-intensive MIR research. In Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR), Taipei, Taiwan, 27–31 October 2014; pp. 155–160. [Google Scholar]
- Driedger, J.; Schreiber, H.; Bas de Haas, W.; Müller, M. Towards automatically correcting tapped beat annotations for music recordings. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), Delft, The Netherlands, 4–8 November 2019; pp. 200–207. [Google Scholar]
- Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the Conference of the North American Chapter of The Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 483–498. [Google Scholar] [CrossRef]
- Böck, S.; Davies, M.E.P. Deconstruct, Analyse, Reconstruct: How to improve Tempo, Beat, and Downbeat Estimation. In Proceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR), Montréal, QC, Canada, 11–15 October 2020; pp. 574–582. [Google Scholar]
- Hung, Y.N.; Wang, J.C.; Song, X.; Lu, W.T.; Won, M. Modeling Beats and Downbeats with a Time-Frequency Transformer. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 401–405. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-scale machine learning on heterogeneous systems. arXiv 2016, arXiv:1603.04467. [Google Scholar]
- Raffel, C.; Ellis, D.P.W. Intuitive Analysis, Creation and Manipulation of MIDI Data with pretty_midi. In Proceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 27–31 October 2014. [Google Scholar]
- Raffel, C.; McFee, B.; Humphrey, E.J.; Salamon, J.; Nieto, O.; Liang, D.; Ellis, D.P.W. mir_eval: A transparent implementation of common MIR metrics. In Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR), Taipei, Taiwan, 27–31 October 2014; pp. 367–372. [Google Scholar]
- Hernandez, D.; Kaplan, J.; Henighan, T.; McCandlish, S. Scaling Laws for Transfer. arXiv 2021, arXiv:2102.01293. [Google Scholar]
- Rolnick, D.; Veit, A.; Belongie, S.; Shavit, N. Deep Learning is Robust to Massive Label Noise. arXiv 2018, arXiv:1705.10694. [Google Scholar]
- Nieto, O.; McCallum, M.; Davies, M.E.P.; Robertson, A.; Stark, A.; Egozy, E. The Harmonix Set: Beats, Downbeats, and Functional Segment Annotations of Western Popular Music. In Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), Delft, The Netherlands, 4–8 November 2019; pp. 565–572. [Google Scholar]
Year | Work | Task | Voc. | Architecture | Training Proc. | Training Data |
---|---|---|---|---|---|---|
2018 | Cartwriht and Bello [9] | DTM + BD | 14 | CRNN | Supervised | ENST, MDB, RBMA, SDDS 1, etc. |
2018 | Jacques and Roebel [8] | DTM | 3 | CNN | Supervised | ENST, RWC 1 |
2018 | Vogl et al. [2] | DTM | 18 | CRNN | Supervised | ENST, MDB, RBMA, TMIDT 1 |
2019 | Choi and Cho [10] | DTP | 11 | CRNN | Unsupervised | In-house dataset |
2019 | Jacques and Roebel [11] | DTM | 3 | CNN | Supervised | MIREX 2018 2 |
2020 | Callender et al. [12] | DTD + V | 7 | CRNN | Supervised | E-GMD 1,2 |
2020 | Ishizuka et al. [5] | DTM | 3 | CRNN, LM | Supervised | RWC 3,4 |
2020 | Manilow et al. [13] | MIT + SS | 88 | RNN | Supervised | MAPS, Slakh 1, GuitarSet |
2020 | Wang et al. [14] | DTM | Open | CNN | Few-shot | Slakh 1 |
2021 | Cheuk et al. [15] | MIT | 88 | CNN-SelfAtt | Semi-supervised | MAPS 1, MusicNet |
2021 | Gardner et al. [16] | MIT | 128 | SelfAtt | Supervised | Cerberus4 1, Slakh 1, etc. |
2021 | Ishizuka et al. [6] | DTM | 3 | CNN-SelfAtt, LM | Supervised | Slakh 1,4, RWC 3,4 |
2021 | Wei et al. [4] | DTM | 3 | CNN-SelfAtt | Supervised | A2MD (TS) |
2021 | Zehren et al. [3] | DTM | 5 | CRNN | Supervised | ADTOF |
2022 | Simon et al. [17] | MIT | 128 | SelfAtt | Self-supervised | In-house dataset, Cerberus4 1, etc. |
2022 | Cheuk et al. [18] | MIT+SS | 128 | CRNN | Supervised | Slakh 1 |
Subdivision | ADTOF-RGW | ADTOF-YT | RBMA | ENST | MDB | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Conflict | Far | Conflict | Far | Conflict | Far | Conflict | Far | Conflict | Far | ||
Madmom | 4 | 1.21% | 2.13% | 6.05% | 6.97% | 5.44% | 3.30% | 1.84% | 4.56% | 3.25% | 8.74% |
12 | 0.05% | 0.05% | 0.11% | 0.07% | 1.17% | 0.01% | 0.50% | 0.24% | 0.28% | 0.28% | |
Ground Truth | 4 | 1.52% | 3.84% | 6.65% | 9.87% | 6.14% | 4.42% | - | - | - | - |
12 | 0.05% | 0.06% | 0.24% | 0.22% | 1.79% | 0.39% | - | - | - | - |
Dataset | Hours | Vocabulary | #Tracks | Real Music | Beat |
---|---|---|---|---|---|
ENST [25] | 1.0 | 20 | 64 | 🗸 | |
MDB [26] | 0.4 | 21 | 23 | 🗸 | |
RBMA [27] | 1.7 | 24 | 30 | 🗸 | 🗸 |
TMIDT [2] | 259 | 18 | 4197 | 🗸 | |
A2MD [4] | 35 | 3 | 1565 | 🗸 | 🗸 |
ADTOF-RGW [3] | 114 | 5 | 1739 | 🗸 | 🗸 |
ADTOF-YT | 245 | 5 | 2924 | 🗸 | 🗸 |
Architecture | ADTOF-RGW | ADTOF-YT | RBMA | ENST | MDB | |
---|---|---|---|---|---|---|
Frame | RNN | 0.83 | 0.85 | 0.65 | 0.78 | 0.81 |
Self-att | 0.83 | 0.85 | 0.64 | 0.79 | 0.79 | |
Tatum | RNN | 0.81 | 0.83 | 0.62 | 0.75 | 0.81 |
Self-att | 0.82 | 0.83 | 0.62 | 0.79 | 0.80 |
# | Training Dataset (s) | ADTOF-RGW | ADTOF-YT | RBMA | ENST | MDB | |
---|---|---|---|---|---|---|---|
1 | All five | 0.83 | 0.85 | 0.65 | 0.78 | 0.81 | |
2 | ADTOF-RGW, ADTOF-YT | 0.82 | 0.85 | 0.65 | 0.78 | 0.81 | |
3 | ADTOF-RGW (Zehren et al. [3]) | 0.79 | 0.73 | 0.63 | 0.72 | 0.76 | |
4 | ADTOF-YT | 0.76 | 0.85 | 0.57 | 0.73 | 0.78 | |
5 | RBMA, ENST, MDB | 0.63 | 0.48 | 0.57 | 0.73 | 0.74 | |
6 | pt TMIDT | All five | 0.79 | 0.81 | 0.62 | 0.77 | 0.77 |
7 | ADTOF-RGW, ADTOF-YT | 0.79 | 0.81 | 0.62 | 0.76 | 0.77 | |
8 | RBMA, ENST, MDB (Vogl et al. [2]) | 0.70 | 0.56 | 0.63 | 0.76 | 0.75 | |
9 | - | 0.70 | 0.62 | 0.60 | 0.75 | 0.68 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zehren, M.; Alunno, M.; Bientinesi, P. High-Quality and Reproducible Automatic Drum Transcription from Crowdsourced Data. Signals 2023, 4, 768-787. https://doi.org/10.3390/signals4040042
Zehren M, Alunno M, Bientinesi P. High-Quality and Reproducible Automatic Drum Transcription from Crowdsourced Data. Signals. 2023; 4(4):768-787. https://doi.org/10.3390/signals4040042
Chicago/Turabian StyleZehren, Mickaël, Marco Alunno, and Paolo Bientinesi. 2023. "High-Quality and Reproducible Automatic Drum Transcription from Crowdsourced Data" Signals 4, no. 4: 768-787. https://doi.org/10.3390/signals4040042
APA StyleZehren, M., Alunno, M., & Bientinesi, P. (2023). High-Quality and Reproducible Automatic Drum Transcription from Crowdsourced Data. Signals, 4(4), 768-787. https://doi.org/10.3390/signals4040042