8000 GitHub - suimuc/VIRES
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

suimuc/VIRES

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

[CVPR2025] VIRES: Video Instance Repainting via Sketch and Text Guided Generation

Official implementation of VIRES: Video Instance Repainting with Sketch and Text Guidance, which is accepted by CVPR 2025. image

demo_video.mp4

Showcase

Input Mask Output
input.mp4
mask.mp4
VIRES.mp4
input.mp4
mask.mp4
VIRES.mp4

Installation

Setup Environment

For CUDA 12.1, you can install the dependencies with the following commands. Otherwise, you need to manually install torch, torchvision and xformers.

# create a virtual env and activate (conda as an example)
conda create -n vires python=3.9
conda activate vires

# download the repo
git clone https://github.com/suimuc/VIRES
cd VIRES

# install torch, torchvision and xformers
pip install -r requirements-cu121.txt

# install others packages
pip install --no-deps -r requirements.txt
pip install -v -e .
# install flash attention
# set enable_flash_attn=True in config to enable flash attention
pip install flash-attn==2.7.4.post1
# install apex
# set enable_layernorm_kernel=True and shardformer=True in config to enable apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git@24.04.01

Inference

Model Weights

Model Download Link
VIRES ๐Ÿ”—
VIRES-VAE ๐Ÿ”—
T5 ๐Ÿ”—

To run VIRES, please follow these steps:

1- Download models using huggingface-cli:

huggingface-cli download suimu/VIRES --local-dir ./checkpoints/VIRES
huggingface-cli download suimu/VIRES_VAE --local-dir ./checkpoints/VIRES_VAE
huggingface-cli download DeepFloyd/t5-v1_1-xxl --local-dir ./checkpoints/t5-v1_1-xxl

2- Prepare a config file under configs/vires/inference, and make sure that the model paths for model, text_encoder, and vae in the config file match the paths of the models you just downloaded.

3- Run the following command:

The basic command line inference is as follows:

python scripts/inference.py configs/vires/inference/config.py \
--save-dir ./outputs/ --input_video "assets/clothes_input.mp4" \
--sketch_video "assets/clothes_sketch.avi" --mask_video "assets/clothes_mask.avi" \
--prompt "The video features a man and a woman walking side by side on a paved pathway. The man is dressed in a blue jacket, with his hands clasped behind his back." \
--cfg_guidance_scale 7 \
--sampling_steps 30

To enable sequence parallelism, you need to use torchrun to run the inference script. The following command will run the inference with 2 GPUs:

torchrun --nproc_per_node 2 scripts/inference.py configs/vires/inference/config.py \
--save-dir ./outputs/ --input_video "assets/clothes_input.mp4" \
--sketch_video "assets/clothes_sketch.avi" --mask_video "assets/clothes_mask.avi" \
--prompt "The video features a man and a woman walking side by side on a paved pathway. The man is dressed in a blue jacket, with his hands clasped behind his back." \
--cfg_guidance_scale 7 \
--sampling_steps 30

4- The results will be generated under the save-dir directory. Note that by default, the configuration will only edit the first 51 frames of the given video (frames 0 to 50). If you want to edit any arbitrary 51 frames in the video, use the --start_frame option. After running the command using --start_frame, the terminal will prompt you to enter the starting frame number, and the script will then process frames from start_frame to start_frame + 50.

WebUI Demo

To run our grad.io based web demo, prepare all the model weights, ensure the config file has the correct model paths, and then run the following command:

python scripts/app.py configs/vires/inference/config.py

Dataset

Download the VIRESET from huggingface. After get the csv file contain absolute path of video clips and corresponding json files, provide the path of csv file to the VariableVideoTextGrayHintMaskDataset in opensora/datasets/datasets.py, Then, the dataset will return a video tensor of the specified image_size (normalized between -1 and 1) and a mask tensor (with values of 0 or 1).

Training

To train VIRES, please follow these steps:

1- Download VIRESET refer to Dataset

2- Prepare a config file under configs/vires/train, specify the CSV file path in the data_path field of the dataset dictionary and make sure the model paths for model, text_encoder, and vae in the config file match the paths of the models you just downloaded.

3- Download the HED model for sketch generation.

wget -O "checkpoints/ControlNetHED.pth" "https://huggingface.co/lllyasviel/Annotators/resolve/main/ControlNetHED.pth"

4- Run the following command:

torchrun --standalone --nproc_per_node 8 scripts/train.py configs/vires/train/config.py --outputs your_experiment_dir

5- During training, all data,including model weights, optimizer states, and loss values logged in TensorBoard format, will be saved in your_experiment_dir, and you can configure the parameters epochs, log_every, and ckpt_every in the configuration file to specify the number of training epochs, the interval for logging loss, and the interval for saving checkpoints, respectively.

6- If training is interrupted, you can configure the load parameter in the configuration file to resume training from a specific step. Alternatively, you can use --load in the command line.

NOTE: Due to the high number of input and output channels in the 3D convolutions of the Sequential ControlNet, GPU memory usage is significantly increased. As a result, the training process for VIRES was conducted on a setup with 8 GPUs, each equipped with 96 GH100 cards.

It is recommended for users to reduce the input and output channels in the Sequential ControlNetโ€™s 3D convolutions, which can be found in the opensora/models/stdit/vires.py file, line168-183. When making these changes, you only need to ensure that the last out_channels of self.hint_mid_convs matches the config.hidden_size.

Rest assured, reducing the channels will not significantly degrade the modelโ€™s performance, but it will greatly reduce memory usage.

Acknowledgements

This model is a fine-tuned derivative version based on the Open-Sora model. Its original code and model parameters are governed by the Open-Sora LICENSE.

As a derivative work of Open-Sora, the use, distribution, and modification of this model must comply with the license terms of Open-Sora.

Citation

@article{vires,
      title={VIRES: Video Instance Repainting via Sketch and Text Guided Generation},
      author={Weng, Shuchen and Zheng, Haojie and Zhang, Peixuan and Hong, Yuchen and Jiang, Han and Li, Si and Shi, Boxin},
      booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
      pages={28416--28425},
      year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0