Official implementation of VIRES: Video Instance Repainting with Sketch and Text Guidance, which is accepted by CVPR 2025.
demo_video.mp4 |
Input | Mask | Output |
input.mp4 |
mask.mp4 |
VIRES.mp4 |
input.mp4 |
mask.mp4 |
VIRES.mp4 |
For CUDA 12.1, you can install the dependencies with the following commands. Otherwise, you need to manually install torch
, torchvision
and xformers
.
# create a virtual env and activate (conda as an example)
conda create -n vires python=3.9
conda activate vires
# download the repo
git clone https://github.com/suimuc/VIRES
cd VIRES
# install torch, torchvision and xformers
pip install -r requirements-cu121.txt
# install others packages
pip install --no-deps -r requirements.txt
pip install -v -e .
# install flash attention
# set enable_flash_attn=True in config to enable flash attention
pip install flash-attn==2.7.4.post1
# install apex
# set enable_layernorm_kernel=True and shardformer=True in config to enable apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git@24.04.01
Model | Download Link |
---|---|
VIRES | ๐ |
VIRES-VAE | ๐ |
T5 | ๐ |
To run VIRES, please follow these steps:
1- Download models using huggingface-cli:
huggingface-cli download suimu/VIRES --local-dir ./checkpoints/VIRES
huggingface-cli download suimu/VIRES_VAE --local-dir ./checkpoints/VIRES_VAE
huggingface-cli download DeepFloyd/t5-v1_1-xxl --local-dir ./checkpoints/t5-v1_1-xxl
2- Prepare a config file under configs/vires/inference
, and make sure that the model paths
for model
, text_encoder
, and vae
in the config file match the paths of the models you just downloaded.
3- Run the following command:
The basic command line inference is as follows:
python scripts/inference.py configs/vires/inference/config.py \
--save-dir ./outputs/ --input_video "assets/clothes_input.mp4" \
--sketch_video "assets/clothes_sketch.avi" --mask_video "assets/clothes_mask.avi" \
--prompt "The video features a man and a woman walking side by side on a paved pathway. The man is dressed in a blue jacket, with his hands clasped behind his back." \
--cfg_guidance_scale 7 \
--sampling_steps 30
To enable sequence parallelism, you need to use torchrun to run the inference script. The following command will run the inference with 2 GPUs:
torchrun --nproc_per_node 2 scripts/inference.py configs/vires/inference/config.py \
--save-dir ./outputs/ --input_video "assets/clothes_input.mp4" \
--sketch_video "assets/clothes_sketch.avi" --mask_video "assets/clothes_mask.avi" \
--prompt "The video features a man and a woman walking side by side on a paved pathway. The man is dressed in a blue jacket, with his hands clasped behind his back." \
--cfg_guidance_scale 7 \
--sampling_steps 30
4- The results will be generated under the save-dir
directory. Note that by default,
the configuration will only edit the first 51 frames of the given video (frames 0 to 50).
If you want to edit any arbitrary 51 frames in the video, use the --start_frame
option.
After running the command using --start_frame
, the terminal will prompt you to enter the starting frame number,
and the script will then process frames from start_frame
to start_frame + 50
.
To run our grad.io based web demo, prepare all the model weights, ensure the config file has the correct model paths, and then run the following command:
python scripts/app.py configs/vires/inference/config.py
Download the VIRESET from huggingface. After get the csv file contain
absolute path of video clips and corresponding json files,
provide the path of csv file to the VariableVideoTextGrayHintMaskDataset
in opensora/datasets/datasets.py
,
Then, the dataset will return a video tensor of the specified image_size (normalized between -1 and 1)
and a mask tensor (with values of 0 or 1).
To train VIRES, please follow these steps:
1- Download VIRESET refer to Dataset
2- Prepare a config file under configs/vires/train
, specify the CSV file path in the data_path
field
of the dataset dictionary and make sure the model paths for model
, text_encoder
, and vae
in the config file
match the paths of the models you just downloaded.
3- Download the HED model for sketch generation.
wget -O "checkpoints/ControlNetHED.pth" "https://huggingface.co/lllyasviel/Annotators/resolve/main/ControlNetHED.pth"
4- Run the following command:
torchrun --standalone --nproc_per_node 8 scripts/train.py configs/vires/train/config.py --outputs your_experiment_dir
5- During training,
all data,including model weights, optimizer states, and loss values logged in TensorBoard format,
will be saved in your_experiment_dir
, and you can configure the parameters epochs
, log_every
,
and ckpt_every
in the configuration file to specify the number of training epochs,
the interval for logging loss, and the interval for saving checkpoints, respectively.
6- If training is interrupted, you can configure the load
parameter in the configuration file
to resume training from a specific step. Alternatively, you can use --load
in the command line.
NOTE: Due to the high number of input and output channels in the 3D convolutions of the Sequential ControlNet, GPU memory usage is significantly increased. As a result, the training process for VIRES was conducted on a setup with 8 GPUs, each equipped with 96 GH100 cards.
It is recommended for users to reduce the input and output channels in the Sequential ControlNetโs 3D convolutions,
which can be found in the opensora/models/stdit/vires.py
file, line168-183
.
When making these changes, you only need to ensure that the last out_channels of self.hint_mid_convs
matches the config.hidden_size
.
Rest assured, reducing the channels will not significantly degrade the modelโs performance, but it will greatly reduce memory usage.
This model is a fine-tuned derivative version based on the Open-Sora model. Its original code and model parameters are governed by the Open-Sora LICENSE.
As a derivative work of Open-Sora, the use, distribution, and modification of this model must comply with the license terms of Open-Sora.
@article{vires,
title={VIRES: Video Instance Repainting via Sketch and Text Guided Generation},
author={Weng, Shuchen and Zheng, Haojie and Zhang, Peixuan and Hong, Yuchen and Jiang, Han and Li, Si and Shi, Boxin},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={28416--28425},
year={2025}
}