Code accompanying the NIPS 2017 paper Generating Black Metal and Math Rock: Beyond Bach, Beethoven, and Beatles and MUME 2018 paper Generating Albums with SampleRNN to Imitate Metal, Rock, and Punk Bands
We modified a SampleRNN architecture to generate music in modern genres such as black metal, math rock, skate punk, beatbox, etc
This early example of neural synthesis is a proof-of-concept for how machine learning can drive new types of music software. Creating music can be as simple as specifying a set of music influences on which a model trains. We demonstrate a method for generating albums that imitate bands in experimental music genres previously unrealized by traditional synthesis techniques (e.g. additive, subtractive, FM, granular, concatenative). Unlike MIDI and symbolic models, SampleRNN generates raw audio in the time domain. This requirement becomes increasingly important in modern music styles where timbre and space are used compositionally. Long developmental compositions with rapid transitions between sections are possible by increasing the depth of the network beyond the number used for speech datasets. We are delighted by the unique characteristic artifacts of neural synthesis.
We've created several albums this way. Read our papers for more expalnation of how we use this as part of a creative workflow, how to choose good datasets, etc.
Dadabots is CJ Carr [github] [website] and Zack Zukowski [github] [website]
Original SampleRNN paper SampleRNN: An Unconditional End-to-End Neural Audio Generation Model.
- Load a dataset of audio
- Train a model on that audio to predict "given what just happened, what comes next?"
- Generate new audio by iteratively choosing "what next comes" indefinitely
- Auto-preprocessing (audio conversion, concatenation, chunking, and saving .npy files). We find splitting an album into 32 00 overlapping chunks of 8 seconds to give us good results.
- New scripts for generating 100s of audio examples in parallel from a trained net.
- New scripts for different sample rates are available (16k, 32k). 32k audio sounds better, but the nets take longer to train, and they don't learn structure as well as 16k.
- Any processed datasets can be loaded into the two-tier network via arguments. This significantly speeds up the workflow without having to change code.
- Sampling is picked from distribution (not argmax). This makes better sense because certain sounds (noise, texture, the "s" sound in speech) are inherently stochastic. Also this is significant for avoiding traps (the generated audio gets stuck in a loop).
- Wny number of RNN layers is now possible (until you run out of memory). This was significant to getting good results. The original limit was insufficient for music, we get good results with 5 layers.
- Local conditioning. Although we haven't fully researched the possibilities of local conditioning, we coded it in.
- Fix bad amplitude normalization causing DC offsets (see issue)
The original code lists:
- cuDNN 5105
- Python 2.7.12
- Numpy 1.11.1
- Theano 0.8.2
- Lasagne 0.2.dev1
- ffmpeg (libav-tools)
But we get much faster code using the next generation of GPU architecture with:
- CUDA 9.2
- cuDNN 8.0
- Theano 1.0
- NVIDIA V100 GPU
A detailed description of how we setup this code on Ubuntu 16.04 with NVIDIA 100 GPU can be found here.
To create a new dataset, place your audio here:
datasets/music/downloads/
then run the new experiment python script located in the datasets/music directory:
16k sample rate:
cd datasets/music/
sudo python new_experiment16k.py krallice downloads/
32k sample rate:
cd datasets/music/
sudo python new_experiment32k.py krallice downloads/
To train a model on an existing dataset with accelerated GPU processing, you need to run following lines from the root of dadabots_sampleRNN
folder which corresponds to the best found set of hyper-paramters.
Mission control center:
$ pwd
/root/cj/https://github.com/Cortexelus/dadabots_sampleRNN
$ python models/two_tier/two_tier32k.py -h
usage: two_tier.py [-h] [--exp EXP] --n_frames N_FRAMES --frame_size
FRAME_SIZE --weight_norm WEIGHT_NORM --emb_size EMB_SIZE
--skip_conn SKIP_CONN --dim DIM --n_rnn {1,2,3,4,5}
--rnn_type {LSTM,GRU} --learn_h0 LEARN_H0 --q_levels
Q_LEVELS --q_type {linear,a-law,mu-law} --which_set
{...} --batch_size {64,128,256} [--debug]
[--resume]
two_tier.py No default value! Indicate every argument.
optional arguments:
-h, --help show this help message and exit
--exp EXP Experiment name (name it anything you want)
--n_frames N_FRAMES How many "frames" to include in each Truncated BPTT
pass
--frame_size FRAME_SIZE
How many samples per frame
--weight_norm WEIGHT_NORM
Adding learnable weight normalization to all the
linear layers (except for the embedding layer)
--emb_size EMB_SIZE Size of embedding layer (0 to disable)
--skip_conn SKIP_CONN
Add skip connections to RNN
--dim DIM Dimension of RNN and MLPs
--n_rnn {1,2,3,4,5,6,7,8,9,10,11,12,n,...}
Number of layers in the stacked RNN
--rnn_type {LSTM,GRU}
GRU or LSTM
--learn_h0 LEARN_H0 Whether to learn the initial state of RNN
--q_levels Q_LEVELS Number of bins for quantization of audio samples.
Should be 256 for mu-law.
--q_type {linear,a-law,mu-law}
Quantization in linear-scale, a-law-companding, or mu-
law compandig. With mu-/a-law quantization level shoud
be set as 256
--which_set {...}
The name of the dataset you created. In the above example "krallice"
--batch_size {64,128,256}
size of mini-batch
--debug Debug mode
--resume Resume the same model from the last checkpoint. Order
of params are important. [for now]
If you're using cuda9 with v100 gpus, you need "device=cuda0"
If you're using cuda8 with K80 gpus or earlier, you may need "device=gpu0" instead
If you have 8 GPUs, you can run up to 8 experiments in parallel, by setting device to cuda0, cuda1, cuda2, cuda3... cuda7
After training 100s of models with different hyperparameters, these were our best hyperparameters (at the limits of the V100 hardware) for the kind of music we wanted to generate. Further explanation for our choices can be found in our papers.
THEANO_FLAGS=mode=FAST_RUN,device=cuda0,floatX=float32 python -u models/two_tier/two_tier16k.py --exp krallice_experiment --n_frames 64 --frame_size 16 --emb_size 256 --skip_conn True --dim 1024 --n_rnn 5 --rnn_type LSTM --q_levels 256 --q_type mu-law --batch_size 128 --weight_norm True --learn_h0 False --which_set krallice
There's also a 3-tier option, but we initially had better results with 2-tier, so we don't use 3-tier. It doesn't have the modifications we made to 2-tier.
$ python models/three_tier/three_tier.py -h
usage: three_tier16k.py [-h] [--exp EXP] --seq_len SEQ_LEN --big_frame_size
BIG_FRAME_SIZE --frame_size FRAME_SIZE --weight_norm
WEIGHT_NORM --emb_size EMB_SIZE --skip_conn SKIP_CONN
--dim DIM --n_rnn {1,2,3,4,5} --rnn_type {LSTM,GRU}
--learn_h0 LEARN_H0 --q_levels Q_LEVELS --q_type
{linear,a-law,mu-law} --which_set {ONOM,BLIZZ,MUSIC}
--batch_size {64,128,256} [--debug] [--resume]
three_tier.py No default value! Indicate every argument.
optional arguments:
-h, --help show this help message and exit
--exp EXP Experiment name
--seq_len SEQ_LEN How many samples to include in each Truncated BPTT
pass
--big_frame_size BIG_FRAME_SIZE
How many samples per big frame in tier 3
--frame_size FRAME_SIZE
How many samples per frame in tier 2
--weight_norm WEIGHT_NORM
Adding learnable weight normalization to all the
linear layers (except for the embedding layer)
--emb_size EMB_SIZE Size of embedding layer (> 0)
--skip_conn SKIP_CONN
Add skip connections to RNN
--dim DIM Dimension of RNN and MLPs
--n_rnn {1,2,3,4,5} Number of layers in the stacked RNN
--rnn_type {LSTM,GRU}
GRU or LSTM
--learn_h0 LEARN_H0 Whether to learn the initial state of RNN
--q_levels Q_LEVELS Number of bins for quantization of audio samples.
Should be 256 for mu-law.
--q_type {linear,a-law,mu-law}
Quantization in linear-scale, a-law-companding, or mu-
law compandig. With mu-/a-law quantization level shoud
be set as 256
--which_set WHICH_SET
any preprocessed set in the datasets/music/ directory
--batch_size {64,128,256}
size of mini-batch
--debug Debug mode
--resume Resume the same model from the last checkpoint. Order
of params are important. [for now]
To run:
$ THEANO_FLAGS=mode=FAST_RUN,device=cuda0,floatX=float32 python -u models/three_tier/three_tier.py --exp 3TIER --seq_len 512 --big_frame_size 8 --frame_size 2 --emb_size 256 --skip_conn False --dim 1024 --n_rnn 1 --rnn_type GRU --q_levels 256 --q_type linear --batch_size 128 --weight_norm True --learn_h0 True --which_set MUSIC
Generate 100 songs (4 minutes each) from a trained 32k model:
$ THEANO_FLAGS=mode=FAST_RUN,device=gpu0,floatX=float32 python -u models/two_tier/two_tier_generate32k.py --exp krallice_experiment --n_frames 64 --frame_size 16 --emb_size 256 --skip_conn True --dim 1024 --n_rnn 5 --rnn_type LSTM --q_levels 256 --q_type mu-law --batch_size 128 --weight_norm True --learn_h0 False --which_set krallice --n_secs 240 --n_seqs 100
All the parameters have to be the same as when you trained it. Notice we're calling two_tier_generate32k.py
with two new flags --n_secs
and --n_seqs
It will take just as much time to generate 100 songs as 5, because they are created in parallel (up to a hardware memory limit).
This will generate from the latest checkpoint. However, we found the latest checkpoint does not always create the best music. Instead we listen to the test audio generated at each checkpoint, choose our favorite checkpoint, and delete the newer checkpoints, before generating a huge batch with this script.
At this point, we suggest human curation. Listen through the generated audio, find the best parts, and use them in your music. Read our MUME 2018 paper to see how our workflow changed over the course of six albums.
If you are using this code, please cite our paper:
Generating Albums with SampleRNN to Imitate Metal, Rock, and Punk Bands. CJ Carr, Zack Zukowski (MUME 2018).
And the original paper:
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model. Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, Yoshua Bengio, 5th International Conference on Learning Representations (ICLR 2017).
This documentation licensed CC-BY 4.0
The source code is licensed Apache 2.0