10000 GitHub - loubbrad/aria-midi: Official repository for Aria-MIDI: a MIDI dataset of 1,186,253 transcribed solo-piano recordings.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Official repository for Aria-MIDI: a MIDI dataset of 1,186,253 transcribed solo-piano recordings.

License

Notifications You must be signed in to change notification settings

loubbrad/aria-midi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

The Aria-MIDI Dataset

Paper / Huggingface

The Aria-MIDI dataset is a collection of 1,186,253 MIDI files, comprising approximately 100,629 hours of transcribed solo-piano recordings, with metadata in categories including genre, composer, performer, as well as compositional identifiers. We developed Aria-MIDI to serve as a dataset for pre-training generative music models, and are releasing it with the goal of facilitating open research in music information retrieval and generative modelling for symbolic music.

NOTE: For applications to generative modeling, it's highly recommended to use the pruned subset which has been filtered and post-processed accordingly.

Download (V1)

Along with the full dataset, we provide several subsets which may be appropriate for different use cases:

Subset # Files Deduplication1 Pre-processing filters2 Example Application
Full [download] 1,186,253 No None Data analysis
Pruned [download] 820,944 10 Light Foundation model pre-training
Deduped [download] 371,053 1 Heavy Generative modelling
Unique [download] 32,522 1 Compositional metadata3 Composition fingerprints

Dataset

As detailed in our paper, Aria-MIDI was created by transcribing publicly available solo-piano audio recordings into MIDI files. To accomplish this, we developed various tools which we integrated into our data pipeline, which we also release under the Apache-2.0 license. In particular:

Aria-AMT. A seq-to-seq piano transcription model which accurately transcribes notes from solo-piano recordings to MIDI. We designed this model to be robust, aiming to maintain note-identification accuracy when transcribing realistic audio from a wide range of recording environments. We also provide an inference engine for Linux/CUDA, supporting batched processing as well as multi-gpu setups.

Aria-CL. A solo-piano audio-classification model capable of detecting and isolating (i.e., segmenting) solo-piano content from arbitrary audio files. We designed our approach to additionally exclude solo-piano audio which may cause problems for transcription models, e.g., recordings with significant audio artifacts.

We provide the dataset as a tarball. Filenames follow the format <file_id>_<segment_number>.mid, where each segment number corresponds to a distinct (i.e., non-overlapping), contiguous portion of solo-piano content from each audio file, identified and transcribed independently. Metadata is provided in the file metadata.json, structured as:

<file_id>: {
  "metadata": {
    <metadata_category>: <metadata>,
    ...,
  },
  "audio_scores": {
    <segment_number>: <score>,
    ...,
  }
}

Metadata was extracted from textual information associated with each original audio file and may vary in accuracy. Audio scores are calculated as the average score assigned by our classifier across segments, designed to correlate with the audio quality of the underlying solo-piano recording.

Citation/License

Aria-MIDI is distributed with the CC-BY-NC-SA 4.0 license. By accessing this dataset, you declare that you agree to our disclaimer. If you use the dataset, or any of the components in our data-pipeline, please cite the paper in which they were introduced:

@inproceedings{bradshawaria,
  title={Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling},
  author={Bradshaw, Louis and Colton, Simon},
  booktitle={International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=X5hrhgndxW}, 
}

Footnotes

  1. For popular composers, we retain at most X instances for each opus/piece-number pair and discard files lacking compositional identifiers.

  2. Heuristic-based filtering, considering note density, pitch and duration entropy, silence, number of segments, and indicators of repetitive content, to exclude problematic entries.

  3. Exclude all files lacking exact compositional identifiers.

About

Official repository for Aria-MIDI: a MIDI dataset of 1,186,253 transcribed solo-piano recordings.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0