8000 GitHub - VideoVerses/VideoTuna: Let's finetune video generation models!
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

VideoVerses/VideoTuna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VideoTuna

VideoTuna

Version visitors Homepage GitHub

πŸ€—πŸ€—πŸ€— Videotuna is a useful codebase for text-to-video applications.
🌟 VideoTuna is the first repo that integrates multiple AI video generation models including text-to-video (T2V), image-to-video (I2V), text-to-image (T2I), and video-to-video (V2V) generation for model inference and finetuning (to the best of our knowledge).
🌟 VideoTuna is the first repo that provides comprehensive pipelines in video generation, from fine-tuning to pre-training, continuous training, and post-training (alignment) (to the best of our knowledge).

πŸ”† Features

videotuna-pipeline-fig3 🌟 All-in-one framework: Inference and fine-tune various up-to-date pre-trained video generation models.
🌟 Continuous training: Keep improving your model with new data.
🌟 Fine-tuning: Adapt pre-trained models to specific domains.
🌟 Human preference alignment: Leverage RLHF to align with human preferences.
🌟 Post-processing: Enhance and rectify the videos with video-to-video enhancement model.

πŸ”† Updates

  • [2025-04-22] 🐟 Supported inference for Wan2.1 and Step Video and fine-tuning for HunyuanVideo T2V, with a unified codebase architecture.
  • [2025-02-03] 🐟 Supported automatic code formatting via PR#27. Thanks @samidarko!
  • [2025-02-01] 🐟 Migrated to Poetry for streamlined dependency and script management (PR#25). Thanks @samidarko!
  • [2025-01-20] 🐟 Supported fine-tuning for Flux-T2I.
  • [2025-01-01] 🐟 Released training for VideoVAE+ in the VideoVAEPlus repo.
  • [2025-01-01] 🐟 Supported inference for Hunyuan Video and Mochi.
  • [2024-12-24] 🐟 Released VideoVAE+: a SOTA Video VAE modelβ€”now available in this repo! Achieves better video reconstruction than NVIDIA’s Cosmos-Tokenizer.
  • [2024-12-01] 🐟 Supported inference for CogVideoX-1.5-T2V&I2V and Video-to-Video Enhancement from ModelScope.
  • [2024-12-01] 🐟 Supported fine-tuning for CogVideoX.
  • [2024-11-01] 🐟 πŸŽ‰ Released VideoTuna v0.1.0!
    Initial support includes inference for VideoCrafter1-T2V&I2V, VideoCrafter2-T2V, DynamiCrafter-I2V, OpenSora-T2V, CogVideoX-1-2B-T2V, CogVideoX-1-T2V, Flux-T2I, and training/fine-tuning of VideoCrafter, DynamiCrafter, and Open-Sora.

πŸ”† Get started

1.Prepare environment

(1) If you use Linux and Conda (Recommend)

conda create -n videotuna python=3.10 -y
conda activate videotuna
pip install poetry
poetry install
  • ↑ It takes around 3 minitues.

Optional: Flash-attn installation

Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode. Install the flash-attn via:

poetry run install-flash-attn 
  • ↑ It takes 1 minitue.

Optional: Video-to-video enhancement

poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
  • If this command ↑ get stucked, kill and re-run it will solve the issue.

(2) If you use Linux and Poetry (without Conda):

Click to check instructions

Install Poetry: https://python-poetry.org/docs/#installation
Then:

poetry config virtualenvs.in-project true # optional but recommended, will ensure the virtual env is created in the project root
poetry config virtualenvs.create true # enable this argument to ensure the virtual env is created in the project root
poetry env use python3.10 # will create the virtual env, check with `ls -l .venv`.
poetry env activate # optional because Poetry commands (e.g. `poetry install` or `poetry run <command>`) will always automatically load the virtual env.
poetry install

Optional: Flash-attn installation

Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode. Install the flash-attn via:

poetry run install-flash-attn

Optional: Video-to-video enhancement

poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
  • If this command ↑ get stucked, kill and re-run it will solve the issue.

(3) If you use MacOS

Click to check instructions

On MacOS with Apple Silicon chip use docker compose because some dependencies are not supporting arm64 (e.g. bitsandbytes, decord, xformers).

First build:

docker compose build videotuna

To preserve the project's files permissions set those env variables:

export HOST_UID=$(id -u)
export HOST_GID=$(id -g)

Install dependencies:

docker compose run --remove-orphans videotuna poetry env use /usr/local/bin/python
docker compose run --remove-orphans videotuna poetry run python -m pip install --upgrade pip setuptools wheel
docker compose run --remove-orphans videotuna poetry install
docker compose run --remove-orphans videotuna poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

Note: installing swissarmytransformer might hang. Just try again and it should work.

Add a dependency:

docker compose run --remove-orphans videotuna poetry add wheel

Check dependencies:

docker compose run --remove-orphans videotuna poetry run pip freeze

Run Poetry commands:

docker compose run --remove-orphans videotuna poetry run format

Start a terminal:

docker compose run -it --remove-orphans videotuna bash

2.Prepare checkpoints

3.Inference state-of-the-art T2V/I2V/T2I models

Run the following commands to inference models: It will automatically perform T2V/T2I based on prompts in inputs/t2v/prompts.txt, and I2V based on images and prompts in inputs/i2v/576x1024.

T2V

Task Model Command Length (#Frames) Resolution Inference Time GPU Memory (GB)
T2V HunyuanVideo poetry run inference-hunyuan-t2v 129 720x1280 32min 60G
T2V WanVideo poetry run inference-wanvideo-t2v-720p 81 720x1280 32min 70G
T2V StepVideo poetry run inference-stepvideo-t2v-544x992 51 544x992 8min 61G
T2V Mochi poetry run inference-mochi 84 480x848 2min 26G
T2V CogVideoX-5b poetry run inference-cogvideo-t2v-diffusers 49 480x720 2min 3G
T2V CogVideoX-2b poetry run inference-cogvideo-t2v-diffusers 49 480x720 2min 3G
T2V Open Sora V1.0 poetry run inference-opensora-v10-16x256x256 16 256x256 11s 24G
T2V VideoCrafter-V2-320x512 poetry run inference-vc2-t2v-320x512 16 320x512 26s 11G
T2V VideoCrafter-V1-576x1024 poetry run inference-vc1-t2v-576x1024 16 576x1024 2min 15G

I2V

Task Model Command Length (#Frames) Resolution Inference Time GPU Memory (GB)
I2V WanVideo poetry run inference-wanvideo-i2v-720p 81 720x1280 28min 77G
I2V HunyuanVideo poetry run inference-hunyuan-i2v-720p 129 720x1280 29min 43G
I2V CogVideoX-5b-I2V poetry run inference-cogvideox-15-5b-i2v 49 480x720 5min 5G
I2V DynamiCrafter poetry run inference-dc-i2v-576x1024 16 576x1024 2min 53G
I2V VideoCrafter-V1 poetry run inference-vc1-i2v-320x512 16 320x512 26s 11G

T2I

Task Model Command Length (#Frames) Resolution Inference Time GPU Memory (GB)
T2I Flux-dev poetry run inference-flux-dev 1 768x1360 4s 37G
T2I Flux-dev poetry run inference-flux-dev --enable_vae_tiling --enable_sequential_cpu_offload 1 768x1360 4.2min 2G
T2I Flux-schnell poetry run inference-flux-schnell 1 768x1360 1s 37G
T2I Flux-schnell poetry run inference-flux-schnell --enable_vae_tiling --enable_sequential_cpu_offload 1 768x1360 24s 2G

4. Finetune T2V models

(1) Prepare dataset

Please follow the docs/datasets.md to try provided toydataset or build your own datasets.

(2) Fine-tune

All training commands were tested on H800 80G GPUs.
T2V

Task Model Mode Command More Details #GPUs
T2V Hunyuan Video Lora Fine-tune poetry run train-hunyuan-t2v-lora docs/finetune_hunyuanvideo.md 2
T2V CogvideoX Lora Fine-tune poetry run train-cogvideox-t2v-lora docs/finetune_cogvideox.md 1
T2V CogvideoX Full Fine-tune poetry run train-cogvideox-t2v-fullft docs/finetune_cogvideox.md 4
T2V Open-Sora v1.0 Full Fine-tune poetry run train-opensorav10 - 1
T2V VideoCrafter Lora Fine-tune poetry run train-videocrafter-lora docs/finetune_videocrafter.md 1
T2V VideoCrafter Full Fine-tune poetry run train-videocrafter-v2 docs/finetune_videocrafter.md 1

I2V

Task Model Mode Command More Details #GPUs
I2V CogvideoX Lora Fine-tune poetry run train-cogvideox-i2v-lora docs/finetune_cogvideox.md 1
I2V CogvideoX Full Fine-tune poetry run train-cogvideox-i2v-fullft docs/finetune_cogvideox.md 4

T2I

Task Model Mode Command More Details #GPUs
T2I Flux Lora Fine-tune poetry run train-flux-lora docs/finetune_flux.md 1

5. Evaluation

We support VBench evaluation to evaluate the T2V generation performance. Please check eval/README.md for details.

Contribute

Git hooks

Git hooks are handled with pre-commit library.

Hooks installation

Run the following command to install hooks on commit. They will check formatting, linting and types.

poetry run pre-commit install
poetry run pre-commit install --hook-type commit-msg

Running the hooks without commiting

poetry run pre-commit run --all-files

Acknowledgement

We thank the following repos for sharing their awesome models and codes!

  • Wan2.1: Wan: Open and Advanced Large-Scale Video Generative Models.
  • HunyuanVideo: A Systematic Framework For Large Video Generation Model.
  • Step-Video: A text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames.
  • Mochi: A new SOTA in open-source video generation models
  • VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
  • VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
  • DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
  • Open-Sora: Democratizing Efficient Video Production for All
  • CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
  • VADER: Video Diffusion Alignment via Reward Gradients
  • VBench: Comprehensive Benchmark Suite for Video Generative Models
  • Flux: Text-to-image models from Black Forest Labs.
  • SimpleTuner: A fine-tuning kit for text-to-image generation.

Some Resources

🍻 Contributors

πŸ“‹ License

Please follow CC-BY-NC-ND. If you want a license authorization, please contact the project leads Yingqing He (yhebm@connect.ust.hk) and Yazhou Xing (yxingag@connect.ust.hk).

😊 Citation

@software{videotuna,
  author = {Yingqing He and Yazhou Xing and Zhefan Rao and Haoyu Wu and Zhaoyang Liu and Jingye Chen and Pengjun Fang and Jiajun Li and Liya Ji and Runtao Liu and Xiaowei Chi and Yang Fei and Guocheng Shao and Yue Ma and Qifeng Chen},
  title = {VideoTuna: A Powerful Toolkit for Video Generation with Model Fine-Tuning and Post-Training},
  month = {Nov},
  year = {2024},
  url = {https://github.com/VideoVerses/VideoTuna}
}

Star History

Star History Chart

0