π€π€π€ Videotuna is a useful codebase for text-to-video applications.
π VideoTuna is the first repo that integrates multiple AI video generation models including text-to-video (T2V)
, image-to-video (I2V)
, text-to-image (T2I)
, and video-to-video (V2V)
generation for model inference and finetuning (to the best of our knowledge).
π VideoTuna is the first repo that provides comprehensive pipelines in video generation, from fine-tuning to pre-training, continuous training, and post-training (alignment) (to the best of our knowledge).
π All-in-one framework: Inference and fine-tune various up-to-date pre-trained video generation models.
π Continuous training: Keep improving your model with new data.
π Fine-tuning: Adapt pre-trained models to specific domains.
π Human preference alignment: Leverage RLHF to align with human preferences.
π Post-processing: Enhance and rectify the videos with video-to-video enhancement model.
- [2025-04-22] π Supported inference for
Wan2.1
andStep Video
and fine-tuning forHunyuanVideo T2V
, with a unified codebase architecture. - [2025-02-03] π Supported automatic code formatting via PR#27. Thanks @samidarko!
- [2025-02-01] π Migrated to Poetry for streamlined dependency and script management (PR#25). Thanks @samidarko!
- [2025-01-20] π Supported fine-tuning for
Flux-T2I
. - [2025-01-01] π Released training for
VideoVAE+
in the VideoVAEPlus repo. - [2025-01-01] π Supported inference for
Hunyuan Video
andMochi
. - [2024-12-24] π Released
VideoVAE+
: a SOTA Video VAE modelβnow available in this repo! Achieves better video reconstruction than NVIDIAβsCosmos-Tokenizer
. - [2024-12-01] π Supported inference for
CogVideoX-1.5-T2V&I2V
andVideo-to-Video Enhancement
from ModelScope. - [2024-12-01] π Supported fine-tuning for
CogVideoX
. - [2024-11-01] π π Released VideoTuna v0.1.0!
Initial support includes inference forVideoCrafter1-T2V&I2V
,VideoCrafter2-T2V
,DynamiCrafter-I2V
,OpenSora-T2V
,CogVideoX-1-2B-T2V
,CogVideoX-1-T2V
,Flux-T2I
, and training/fine-tuning ofVideoCrafter
,DynamiCrafter
, andOpen-Sora
.
conda create -n videotuna python=3.10 -y
conda activate videotuna
pip install poetry
poetry install
- β It takes around 3 minitues.
Optional: Flash-attn installation
Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode. Install the flash-attn
via:
poetry run install-flash-attn
- β It takes 1 minitue.
Optional: Video-to-video enhancement
poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
- If this command β get stucked, kill and re-run it will solve the issue.
Click to check instructions
Install Poetry: https://python-poetry.org/docs/#installation
Then:
poetry config virtualenvs.in-project true # optional but recommended, will ensure the virtual env is created in the project root
poetry config virtualenvs.create true # enable this argument to ensure the virtual env is created in the project root
poetry env use python3.10 # will create the virtual env, check with `ls -l .venv`.
poetry env activate # optional because Poetry commands (e.g. `poetry install` or `poetry run <command>`) will always automatically load the virtual env.
poetry install
Optional: Flash-attn installation
Hunyuan model uses it to reduce memory usage and speed up inference. If it is not installed, the model will run in normal mode. Install the flash-attn
via:
poetry run install-flash-attn
Optional: Video-to-video enhancement
poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
- If this command β get stucked, kill and re-run it will solve the issue.
Click to check instructions
On MacOS with Apple Silicon chip use docker compose because some dependencies are not supporting arm64 (e.g. bitsandbytes
, decord
, xformers
).
First build:
docker compose build videotuna
To preserve the project's files permissions set those env variables:
export HOST_UID=$(id -u)
export HOST_GID=$(id -g)
Install dependencies:
docker compose run --remove-orphans videotuna poetry env use /usr/local/bin/python
docker compose run --remove-orphans videotuna poetry run python -m pip install --upgrade pip setuptools wheel
docker compose run --remove-orphans videotuna poetry install
docker compose run --remove-orphans videotuna poetry run pip install "modelscope[cv]" -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html
Note: installing swissarmytransformer might hang. Just try again and it should work.
Add a dependency:
docker compose run --remove-orphans videotuna poetry add wheel
Check dependencies:
docker compose run --remove-orphans videotuna poetry run pip freeze
Run Poetry commands:
docker compose run --remove-orphans videotuna poetry run format
Start a terminal:
docker compose run -it --remove-orphans videotuna bash
- Please follow docs/checkpoints.md to download model checkpoints.
- After downloading, the model checkpoints should be placed as Checkpoint Structure.
Run the following commands to inference models:
It will automatically perform T2V/T2I based on prompts in inputs/t2v/prompts.txt
,
and I2V based on images and prompts in inputs/i2v/576x1024
.
T2V
Task | Model | Command | Length (#Frames) | Resolution | Inference Time | GPU Memory (GB) |
---|---|---|---|---|---|---|
T2V | HunyuanVideo | poetry run inference-hunyuan-t2v |
129 | 720x1280 | 32min | 60G |
T2V | WanVideo | poetry run inference-wanvideo-t2v-720p |
81 | 720x1280 | 32min | 70G |
T2V | StepVideo | poetry run inference-stepvideo-t2v-544x992 |
51 | 544x992 | 8min | 61G |
T2V | Mochi | poetry run inference-mochi |
84 | 480x848 | 2min | 26G |
T2V | CogVideoX-5b | poetry run inference-cogvideo-t2v-diffusers |
49 | 480x720 | 2min | 3G |
T2V | CogVideoX-2b | poetry run inference-cogvideo-t2v-diffusers |
49 | 480x720 | 2min | 3G |
T2V | Open Sora V1.0 | poetry run inference-opensora-v10-16x256x256 |
16 | 256x256 | 11s | 24G |
T2V | VideoCrafter-V2-320x512 | poetry run inference-vc2-t2v-320x512 |
16 | 320x512 | 26s | 11G |
T2V | VideoCrafter-V1-576x1024 | poetry run inference-vc1-t2v-576x1024 |
16 | 576x1024 | 2min | 15G |
I2V
Task | Model | Command | Length (#Frames) | Resolution | Inference Time | GPU Memory (GB) |
---|---|---|---|---|---|---|
I2V | WanVideo | poetry run inference-wanvideo-i2v-720p |
81 | 720x1280 | 28min | 77G |
I2V | HunyuanVideo | poetry run inference-hunyuan-i2v-720p |
129 | 720x1280 | 29min | 43G |
I2V | CogVideoX-5b-I2V | poetry run inference-cogvideox-15-5b-i2v |
49 | 480x720 | 5min | 5G |
I2V | DynamiCrafter | poetry run inference-dc-i2v-576x1024 |
16 | 576x1024 | 2min | 53G |
I2V | VideoCrafter-V1 | poetry run inference-vc1-i2v-320x512 |
16 | 320x512 | 26s | 11G |
T2I
Task | Model | Command | Length (#Frames) | Resolution | Inference Time | GPU Memory (GB) |
---|---|---|---|---|---|---|
T2I | Flux-dev | poetry run inference-flux-dev |
1 | 768x1360 | 4s | 37G |
T2I | Flux-dev | poetry run inference-flux-dev --enable_vae_tiling --enable_sequential_cpu_offload |
1 | 768x1360 | 4.2min | 2G |
T2I | Flux-schnell | poetry run inference-flux-schnell |
1 | 768x1360 | 1s | 37G |
T2I | Flux-schnell | poetry run inference-flux-schnell --enable_vae_tiling --enable_sequential_cpu_offload |
1 | 768x1360 | 24s | 2G |
Please follow the docs/datasets.md to try provided toydataset or build your own datasets.
All training commands were tested on H800 80G GPUs.
T2V
Task | Model | Mode | Command | More Details | #GPUs |
---|---|---|---|---|---|
T2V | Hunyuan Video | Lora Fine-tune | poetry run train-hunyuan-t2v-lora |
docs/finetune_hunyuanvideo.md | 2 |
T2V | CogvideoX | Lora Fine-tune | poetry run train-cogvideox-t2v-lora |
docs/finetune_cogvideox.md | 1 |
T2V | CogvideoX | Full Fine-tune | poetry run train-cogvideox-t2v-fullft |
docs/finetune_cogvideox.md | 4 |
T2V | Open-Sora v1.0 | Full Fine-tune | poetry run train-opensorav10 |
- | 1 |
T2V | VideoCrafter | Lora Fine-tune | poetry run train-videocrafter-lora |
docs/finetune_videocrafter.md | 1 |
T2V | VideoCrafter | Full Fine-tune | poetry run train-videocrafter-v2 |
docs/finetune_videocrafter.md | 1 |
I2V
Task | Model | Mode | Command | More Details | #GPUs |
---|---|---|---|---|---|
I2V | CogvideoX | Lora Fine-tune | poetry run train-cogvideox-i2v-lora |
docs/finetune_cogvideox.md | 1 |
I2V | CogvideoX | Full Fine-tune | poetry run train-cogvideox-i2v-fullft |
docs/finetune_cogvideox.md | 4 |
T2I
Task | Model | Mode | Command | More Details | #GPUs |
---|---|---|---|---|---|
T2I | Flux | Lora Fine-tune | poetry run train-flux-lora |
docs/finetune_flux.md | 1 |
We support VBench evaluation to evaluate the T2V generation performance. Please check eval/README.md for details.
Git hooks are handled with pre-commit library.
Run the following command to install hooks on commit
. They will check formatting, linting and types.
poetry run pre-commit install
poetry run pre-commit install --hook-type commit-msg
poetry run pre-commit run --all-files
We thank the following repos for sharing their awesome models and codes!
- Wan2.1: Wan: Open and Advanced Large-Scale Video Generative Models.
- HunyuanVideo: A Systematic Framework For Large Video Generation Model.
- Step-Video: A text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames.
- Mochi: A new SOTA in open-source video generation models
- VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
- VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
- DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors
- Open-Sora: Democratizing Efficient Video Production for All
- CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
- VADER: Video Diffusion Alignment via Reward Gradients
- VBench: Comprehensive Benchmark Suite for Video Generative Models
- Flux: Text-to-image models from Black Forest Labs.
- SimpleTuner: A fine-tuning kit for text-to-image generation.
- LLMs-Meet-MM-Generation: A paper collection of utilizing LLMs for multimodal generation (image, video, 3D and audio).
- MMTrail: A multimodal trailer video dataset with language and music descriptions.
- Seeing-and-Hearing: A versatile framework for Joint VA generation, V2A, A2V, and I2A.
- Self-Cascade: A Self-Cascade model for higher-resolution image and video generation.
- ScaleCrafter and HiPrompt: Free method for higher-resolution image and video generation.
- FreeTraj and FreeNoise: Free method for video trajectory control and longer-video generation.
- Follow-Your-Emoji, Follow-Your-Click, and Follow-Your-Pose: Follow family for controllable video generation.
- Animate-A-Story: A framework for storytelling video generation.
- LVDM: Latent Video Diffusion Model for long video generation and text-to-video generation.
Please follow CC-BY-NC-ND. If you want a license authorization, please contact the project leads Yingqing He (yhebm@connect.ust.hk) and Yazhou Xing (yxingag@connect.ust.hk).
@software{videotuna,
author = {Yingqing He and Yazhou Xing and Zhefan Rao and Haoyu Wu and Zhaoyang Liu and Jingye Chen and Pengjun Fang and Jiajun Li and Liya Ji and Runtao Liu and Xiaowei Chi and Yang Fei and Guocheng Shao and Yue Ma and Qifeng Chen},
title = {VideoTuna: A Powerful Toolkit for Video Generation with Model Fine-Tuning and Post-Training},
month = {Nov},
year = {2024},
url = {https://github.com/VideoVerses/VideoTuna}
}