8000 GitHub - LiuZH-19/SongGen
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

LiuZH-19/SongGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

🚀🚀🚀 Official implementation of SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao,
Dahua Lin, Jiaqi Wang

📜 News

🚀 [2025/3/18] We released the checkpoint of SongGen Mixed_Pro at Huggingface🤗.

🚀 [2025/2/19] The paper and demo page are released!

💡 Highlights

  • 🔥We introduce SongGen, a single-stage auto-regressive transformer for text-to-song generation, offering versatile control via lyrics, descriptive text, and an optional reference voice.
  • 🔥SongGen supports both mixed and dual-track mode to accommodate diverse requirements. Our experiments provide valuable insights for optimizing both modes.
  • 🔥By releasing the model weights, code, annotated data, and preprocessing pipeline, we aim to establish a simple yet effective baseline for future song generation research.

👨‍💻 Todo

  • Release annotated data and preprocessing pipeline
  • Release SongGen training code
  • Develop an audio upsampling renderer
  • Release SongGen (Interleaving A-V) checkpoint
  • Release SongGen Mixed_pro checkpoint
  • Release SongGen inference code
  • SongGen demo

🛠️ Usage

1. Install environment and dependencies

git clone https://github.com/LiuZH-19/SongGen.git
cd SongGen
# We recommend using conda to create a new environment.
conda create -n songgen_env python=3.9.18 
conda activate songgen_env
# Install CUDA >= 11.8 and PyTorch, e.g.,
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install flash-attn==2.6.1 --no-build-isolation

To use SongGen only in inference mode, install it using:

pip install -e .

2. Download the xcodec

Download the X-Codec checkpoint from 🤗 and place it in the following directory : SongGen/songgen/xcodec_wrapper/xcodec_infer/ckpts/general_more

xcodec_infer
    ├── ckpts
    │   └── general_more
    │       ├── config_hubert_general.yaml
    │       └── xcodec_hubert_general_audio_v2.pth

3. Run the inference

(1). Mixed Pro Mode

import torch
import os
from songgen import (
    VoiceBpeTokenizer,
    SongGenMixedForConditionalGeneration,
    SongGenProcessor
)
import soundfile as sf

ckpt_path = "LiuZH-19/SongGen_mixed_pro" # Path to the pretrained model
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = SongGenMixedForConditionalGeneration.from_pretrained(
    ckpt_path,
    attn_implementation='sdpa').to(device)
processor = SongGenProcessor(ckpt_path, device)

# Define input text and lyrics
lyrics = "..." # The lyrics text
text = "..." # The music description text
ref_voice_path = 'path/to/your/reference_audio.wav' # Path to reference audio, optional
separate= True # Whether to separate the vocal track from the reference voice audio

model_inputs = processor(text=text, lyrics=lyrics, ref_voice_path=ref_voice_path, separate=separate) 
generation = model.generate(**model_inputs,
                do_sample=True,
            )
audio_arr = generation.cpu().numpy().squeeze()
sf.write("songgen_out.wav", audio_arr, model.config.sampling_rate)

(2). Interleaving A-V (Dual-track mode)

import torch
import os
from songgen import (
    VoiceBpeTokenizer,
    SongGenDualTrackForConditionalGeneration,
    SongGenProcessor
)
import soundfile as sf

ckpt_path = "..." # Path to the pretrained model
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = SongGenDualTrackForConditionalGeneration.from_pretrained(
    ckpt_path,
    attn_implementation='sdpa').to(device)
processor = SongGenProcessor(ckpt_path, device)

# Define input text and lyrics
lyrics = "..." # The lyrics text
text = "..." # The music description text
ref_voice_path = 'path/to/your/reference_audio.wav' # Path to reference audio, optional
separate= True # Whether to separate the vocal track from the reference voice audio

model_inputs = processor(text=text, lyrics=lyrics, ref_voice_path=ref_voice_path, separate=True) 
generation = model.generate(**model_inputs,
                do_sample=True,
            )

acc_array = generation[0].cpu().numpy()
vocal_array = generation[1].cpu().numpy()
min_len =min(vocal_array.shape[0], acc_array.shape[0])
acc_array = acc_array[:min_len]
vocal_array = vocal_array[:min_len]
audio_arr = vocal_array + acc_array
sf.write("songgen_out.wav", audio_arr, model.config.sampling_rate)

❤️ Acknowledgments

This library builds on top of a number of open-source giants, to whom we'd like to extend our warmest thanks for providing these tools!

Special thanks to:

  • Parler-tts: The codebase we built upon.
  • X-Codec: The audio codec utilized in our research.
  • lp-music-caps: A project aimed at generating captions for music.

We deeply appreciate all the support we've received along the way.

☎️ Limitation and Future Work

This is a research work focused on text-to-song generation. Due to the limitations of the current training dataset, our model is currently restricted to generating English songs with a maximum duration of 30 seconds. However, despite being trained on only 2k hours of data with a 1.3B parameter model, our approach has demonstrated strong effectiveness and promising potential in generating coherent and expressive songs. We believe that scaling up both data and model size will further enhance lyrics alignment and musicality. That being said, scaling the dataset is time-consuming and challenging. We welcome collaborations and discussions to explore new ways to improve the model and extend its capabilities. For any inquiries or potential collaborations, feel free to reach out: Zihan Liu (liuzihan@pjlab.org.cn) and Jiaqi Wang (wangjiaqi@pjlab.org.cn).

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@misc{liu2025songgen,
      title={SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation}, 
      author={Zihan Liu and Shuangrui Ding and Zhixiong Zhang and Xiaoyi Dong and Pan Zhang and Yuhang Zang and Yuhang Cao and Dahua Lin and Jiaqi Wang},
      year={2025},
      eprint={2502.13128},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2502.13128}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0