SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

🚀🚀🚀 Official implementation of SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao,
Dahua Lin, Jiaqi Wang

📜 News

🚀 [2025/6/27] We released the checkpoint of SongGen Interleaving (A-V) at Huggingface🤗.

🎉 [2025/5/1] SongGen is accepted by ICML 2025!

🚀 [2025/3/18] We released the checkpoint of SongGen Mixed_Pro at Huggingface🤗.

🚀 [2025/2/19] The paper and demo page are released!

💡 Highlights

🔥We introduce SongGen, a single-stage auto-regressive transformer for text-to-song generation, offering versatile control via lyrics, descriptive text, and an optional reference voice.
🔥SongGen supports both mixed and dual-track mode to accommodate diverse requirements. Our experiments provide valuable insights for optimizing both modes.
🔥By releasing the model weights, code, annotated data, and preprocessing pipeline, we aim to establish a simple yet effective baseline for future song generation research.

👨‍💻 Todo

Release annotated data and preprocessing pipeline
Release SongGen training code
Release SongGen (Interleaving A-V) checkpoint
Release SongGen Mixed_pro checkpoint
Release SongGen inference code
SongGen demo

🛠️ Usage

1. Install environment and dependencies

git clone https://github.com/LiuZH-19/SongGen.git
cd SongGen
# We recommend using conda to create a new environment.
conda create -n songgen_env python=3.9.18 
conda activate songgen_env
# Install CUDA >= 11.8 and PyTorch, e.g.,
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install flash-attn==2.6.1 --no-build-isolation

To use SongGen only in inference mode, install it using:

pip install -e .

2. Download the xcodec

Download the X-Codec checkpoint from 🤗 and place it in the following directory : SongGen/songgen/xcodec_wrapper/xcodec_infer/ckpts/general_more

xcodec_infer
    ├── ckpts
    │   └── general_more
    │       ├── config_hubert_general.yaml
    │       └── xcodec_hubert_general_audio_v2.pth

3. Run the inference

(1). Mixed Pro Mode

import torch
import os
from songgen import (
    VoiceBpeTokenizer,
    SongGenMixedForConditionalGeneration,
    SongGenProcessor
)
import soundfile as sf

ckpt_path = "LiuZH-19/SongGen_mixed_pro" # Path to the pretrained model
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = SongGenMixedForConditionalGeneration.from_pretrained(
    ckpt_path,
    attn_implementation='sdpa').to(device)
processor = SongGenProcessor(ckpt_path, device)

# Define input text and lyrics
lyrics = "..." # The lyrics text
text = "..." # The music description text
ref_voice_path = 'path/to/your/reference_audio.wav' # Path to reference audio, optional
separate= True # Whether to separate the vocal track from the reference voice audio

model_inputs = processor(text=text, lyrics=lyrics, ref_voice_path=ref_voice_path, separate=separate) 
generation = model.generate(**model_inputs,
                do_sample=True,
            )
audio_arr = generation.cpu().numpy().squeeze()
sf.write("songgen_out.wav", audio_arr, model.config.sampling_rate)

(2). Interleaving A-V (Dual-track mode)

import torch
import os
from songgen import (
    VoiceBpeTokenizer,
    SongGenDualTrackForConditionalGeneration,
    SongGenProcessor
)
import soundfile as sf

ckpt_path = "LiuZH-19/SongGen_interleaving_A_V" # Path to the pretrained model
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = SongGenDualTrackForConditionalGeneration.from_pretrained(
    ckpt_path,
    attn_implementation='sdpa').to(device)
processor = SongGenProcessor(ckpt_path, device)

# Define input text and lyrics
lyrics = "..." # The lyrics text
text = "..." # The music description text
ref_voice_path = 'path/to/your/reference_audio.wav' # Path to reference audio, optional
separate= True # Whether to separate the vocal track from the reference voice audio

model_inputs = processor(text=text, lyrics=lyrics, ref_voice_path=ref_voice_path, separate=True) 
generation = model.generate(**model_inputs,
                do_sample=True,
            )

acc_array = generation[0].cpu().numpy()
vocal_array = generation[1].cpu().numpy()
min_len =min(vocal_array.shape[0], acc_array.shape[0])
acc_array = acc_array[:min_len]
vocal_array = vocal_array[:min_len]
audio_arr = vocal_array + acc_array
sf.write("songgen_out.wav", audio_arr, model.config.sampling_rate)

❤️ Acknowledgments

This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!

Special thanks to:

Parler-tts: The codebase we built upon.
X-Codec: The audio codec utilized in our research.
lp-music-caps: A project aimed at generating captions for music.

We deeply appreciate all the support we've received along the way.

☎️ Limitation and Future Work

This is a research work focused on text-to-song generation. Due to the limitations of the current training dataset, our model is currently restricted to generating English songs with a maximum duration of 30 seconds. However, despite being trained on only 2k hours of data with a 1.3B parameter model, our approach has demonstrated strong effectiveness and promising potential in generating coherent and expressive songs. We believe that scaling up both data and model size will further enhance lyrics alignment and musicality. That being said, scaling the dataset is time-consuming and challenging. We welcome collaborations and discussions to explore new ways to improve the model and extend its capabilities. For any inquiries or potential collaborations, feel free to reach out: Zihan Liu (liuzihan@pjlab.org.cn) and Jiaqi Wang (wangjiaqi@pjlab.org.cn).

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@misc{liu2025songgen,
      title={SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation}, 
      author={Zihan Liu and Shuangrui Ding and Zhixiong Zhang and Xiaoyi Dong and Pan Zhang and Yuhang Zang and Yuhang Cao and Dahua Lin and Jiaqi Wang},
      year={2025},
      eprint={2502.13128},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2502.13128}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets/imgs		assets/imgs
songgen		songgen
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

📜 News

💡 Highlights

👨‍💻 Todo

🛠️ Usage

1. Install environment and dependencies

2. Download the xcodec

3. Run the inference

(1). Mixed Pro Mode

(2). Interleaving A-V (Dual-track mode)

❤️ Acknowledgments

☎️ Limitation and Future Work

✒️ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

LiuZH-19/SongGen

Folders and files

Latest commit

History

Repository files navigation

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

📜 News

💡 Highlights

👨‍💻 Todo

🛠️ Usage

1. Install environment and dependencies

2. Download the xcodec

3. Run the inference

(1). Mixed Pro Mode

(2). Interleaving A-V (Dual-track mode)

❤️ Acknowledgments

☎️ Limitation and Future Work

✒️ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages