🚀🚀🚀 Official implementation of SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
Zihan Liu,
Shuangrui Ding,
Zhixiong Zhang,
Xiaoyi Dong,
Pan Zhang,
Yuhang Zang,
Yuhang Cao,
Dahua Lin,
Jiaqi Wang
🚀 [2025/3/18] We released the checkpoint of SongGen Mixed_Pro at Huggingface🤗.
🚀 [2025/2/19] The paper and demo page are released!
- 🔥We introduce SongGen, a single-stage auto-regressive transformer for text-to-song generation, offering versatile control via lyrics, descriptive text, and an optional reference voice.
- 🔥SongGen supports both mixed and dual-track mode to accommodate diverse requirements. Our experiments provide valuable insights for optimizing both modes.
- 🔥By releasing the model weights, code, annotated data, and preprocessing pipeline, we aim to establish a simple yet effective baseline for future song generation research.
- Release annotated data and preprocessing pipeline
- Release SongGen training code
- Develop an audio upsampling renderer
- Release SongGen (Interleaving A-V) checkpoint
- Release SongGen Mixed_pro checkpoint
- Release SongGen inference code
- SongGen demo
git clone https://github.com/LiuZH-19/SongGen.git
cd SongGen
# We recommend using conda to create a new environment.
conda create -n songgen_env python=3.9.18
conda activate songgen_env
# Install CUDA >= 11.8 and PyTorch, e.g.,
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install flash-attn==2.6.1 --no-build-isolation
To use SongGen only in inference mode, install it using:
pip install -e .
Download the X-Codec checkpoint from 🤗 and place it in the following directory : SongGen/songgen/xcodec_wrapper/xcodec_infer/ckpts/general_more
xcodec_infer
├── ckpts
│ └── general_more
│ ├── config_hubert_general.yaml
│ └── xcodec_hubert_general_audio_v2.pth
import torch
import os
from songgen import (
VoiceBpeTokenizer,
SongGenMixedForConditionalGeneration,
SongGenProcessor
)
import soundfile as sf
ckpt_path = "LiuZH-19/SongGen_mixed_pro" # Path to the pretrained model
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = SongGenMixedForConditionalGeneration.from_pretrained(
ckpt_path,
attn_implementation='sdpa').to(device)
processor = SongGenProcessor(ckpt_path, device)
# Define input text and lyrics
lyrics = "..." # The lyrics text
text = "..." # The music description text
ref_voice_path = 'path/to/your/reference_audio.wav' # Path to reference audio, optional
separate= True # Whether to separate the vocal track from the reference voice audio
model_inputs = processor(text=text, lyrics=lyrics, ref_voice_path=ref_voice_path, separate=separate)
generation = model.generate(**model_inputs,
do_sample=True,
)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("songgen_out.wav", audio_arr, model.config.sampling_rate)
import torch
import os
from songgen import (
VoiceBpeTokenizer,
SongGenDualTrackForConditionalGeneration,
SongGenProcessor
)
import soundfile as sf
ckpt_path = "..." # Path to the pretrained model
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = SongGenDualTrackForConditionalGeneration.from_pretrained(
ckpt_path,
attn_implementation='sdpa').to(device)
processor = SongGenProcessor(ckpt_path, device)
# Define input text and lyrics
lyrics = "..." # The lyrics text
text = "..." # The music description text
ref_voice_path = 'path/to/your/reference_audio.wav' # Path to reference audio, optional
separate= True # Whether to separate the vocal track from the reference voice audio
model_inputs = processor(text=text, lyrics=lyrics, ref_voice_path=ref_voice_path, separate=True)
generation = model.generate(**model_inputs,
do_sample=True,
)
acc_array = generation[0].cpu().numpy()
vocal_array = generation[1].cpu().numpy()
min_len =min(vocal_array.shape[0], acc_array.shape[0])
acc_array = acc_array[:min_len]
vocal_array = vocal_array[:min_len]
audio_arr = vocal_array + acc_array
sf.write("songgen_out.wav", audio_arr, model.config.sampling_rate)
This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!
Special thanks to:
- Parler-tts: The codebase we built upon.
- X-Codec: The audio codec utilized in our research.
- lp-music-caps: A project aimed at generating captions for music.
We deeply appreciate all the support we've received along the way.
This is a research work focused on text-to-song generation. Due to the limitations of the current training dataset, our model is currently restricted to generating English songs with a maximum duration of 30 seconds. However, despite being trained on only 2k hours of data with a 1.3B parameter model, our approach has demonstrated strong effectiveness and promising potential in generating coherent and expressive songs. We believe that scaling up both data and model size will further enhance lyrics alignment and musicality. That being said, scaling the dataset is time-consuming and challenging. We welcome collaborations and discussions to explore new ways to improve the model and extend its capabilities. For any inquiries or potential collaborations, feel free to reach out: Zihan Liu (liuzihan@pjlab.org.cn) and Jiaqi Wang (wangjiaqi@pjlab.org.cn).
If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝
@misc{liu2025songgen,
title={SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation},
author={Zihan Liu and Shuangrui Ding and Zhixiong Zhang and Xiaoyi Dong and Pan Zhang and Yuhang Zang and Yuhang Cao and Dahua Lin and Jiaqi Wang},
year={2025},
eprint={2502.13128},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2502.13128},
}