SVD Xtend

Stable Video Diffusion Training Code and Extensions 🚀

SVD_Xtend Project Documentation

Introduction

This document outlines the steps to set up and run the SVD_Xtend project, which involves training and inference using the Stable Video Diffusion model. The process includes setting up the environment, downloading datasets, training the model, and performing inference.

Environment Setup

Create and Activate Conda Environment

conda create --name py310 python=3.10
conda activate py310
pip install ipykernel
python -m ipykernel install --user --name py310 --display-name "py310"

Install Required Packages

sudo apt-get update
sudo apt-get install git-lfs ffmpeg cbm
pip install -U diffusers transformers accelerate huggingface_hub torch peft sentencepiece "httpx[socks]" opencv-python einops

Dataset Preparation

Clone the Repository and Download Dataset

git clone https://github.com/svjack/SVD_Xtend
cd SVD_Xtend
wget http://dl.yf.io/bdd100k/mot20/images20-track-train-1.zip
unzip images20-track-train-1.zip

Verify Dataset Structure

from train_svd_lora import *
!ls bdd100k/images/track/train
!ls bdd100k/images/track/train/0000f77c-6257be58/

Visualization

Convert Video Frames to GIF

folder_path = "bdd100k/images/track/train/0000f77c-6257be58/"
frames = os.listdir(folder_path)
frames.sort()
from PIL import Image
export_to_gif(list(map(lambda x: Image.open(os.path.join(folder_path, x)), frames)), "0000f77c-6257be58.gif", fps=1)
from IPython import display
display.Image("0000f77c-6257be58.gif")

Training

Login to Hugging Face

huggingface-cli login

Launch Training

accelerate launch train_svd_lora.py \
--base_folder bdd100k/images/track/train \
--pretrained_model_name_or_path=stabilityai/stable-video-diffusion-img2vid-xt-1-1 \
--per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
--max_train_steps=100 \
--width=512 \
--height=320 \
--checkpointing_steps=50 --checkpoints_total_limit=5 \
--learning_rate=1e-5 --lr_warmup_steps=0 \
--seed=123 \
--mixed_precision="fp16" \
--validation_steps=20

Inference

Inference on Original Model

import torch
from diffusers import UNetSpatioTemporalConditionModel, StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video, export_to_gif

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    low_cpu_mem_usage=False,
    torch_dtype=torch.float16, variant="fp16", local_files_only=True,
)
pipe.to("cuda:0")

image = load_image('bdd100k/images/track/train/0000f77c-6257be58/0000f77c-6257be58-0000001.jpg')
image = image.resize((1024, 576))

generator = torch.manual_seed(-1)
with torch.inference_mode():
    frames = pipe(image,
                num_frames=14,
                width=1024,
                height=576,
                decode_chunk_size=8, generator=generator, motion_bucket_id=127, fps=8, num_inference_steps=30).frames[0]

export_to_gif(frames, "0000f77c-6257be58-0000001_generated_ori.gif", fps=7)
from IPython import display
display.Image("0000f77c-6257be58-0000001_generated_ori.gif")

Inference on LoRA-Tuned UNet

import torch
from diffusers import UNetSpatioTemporalConditionModel, StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video, export_to_gif

unet = UNetSpatioTemporalConditionModel.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    subfolder="unet",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=False,
)
lora_folder = "outputs/pytorch_lora_weights.safetensors"
unet.load_attn_procs(lora_folder)
unet.to(torch.float16)
unet.requires_grad_(False)

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    unet=unet,
    low_cpu_mem_usage=False,
    torch_dtype=torch.float16, variant="fp16", local_files_only=True,
)
pipe.to("cuda:0")

image = load_image('bdd100k/images/track/train/0000f77c-6257be58/0000f77c-6257be58-0000001.jpg')
image = image.resize((1024, 576))

generator = torch.manual_seed(-1)
with torch.inference_mode():
    frames = pipe(image,
                num_frames=14,
                width=1024,
                height=576,
                decode_chunk_size=8, generator=generator, motion_bucket_id=127, fps=8, num_inference_steps=30).frames[0]
export_to_gif(frames, "0000f77c-6257be58-0000001_generated_lora.gif", fps=7)
from IPython import display
display.Image("0000f77c-6257be58-0000001_generated_lora.gif")

Genshin Impact building Example

Overview

This README provides instructions and code snippets for training and generating video frames using the Stable Video Diffusion model, specifically focusing on the process of fine-tuning with LoRA (Low-Rank Adaptation). Additionally, it includes insights and adjustments for enhancing the dynamic intensity of generated videos.

Preparing the Dataset (From https://github.com/svjack/katna )

Copy the initial frame image to a demo file:

cp ../genshin_frame_V2/BV12YDaYME9Z_interval_videos_interval_0_folder_0/frame_000000.png demo.jpg

Training Command

The following command is used to train the model with default LoRA rank set to 4:

accelerate launch train_svd_lora.py \
--base_folder ../genshin_frame_V2_tgt \
--pretrained_model_name_or_path=stabilityai/stable-video-diffusion-img2vid-xt-1-1 \
--per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
--max_train_steps=1000 \
--width=512 \
--height=320 \
--checkpointing_steps=100 --checkpoints_total_limit=10 \
--learning_rate=1e-5 --lr_warmup_steps=0 \
--seed=123 \
--mixed_precision="fp16" \
--validation_steps=100

Displaying Images

To display an image in a Jupyter notebook:

from IPython import display
display.Image("莫娜.png")

Results Before Fine-Tuning

Code Snippet

import torch
from diffusers import UNetSpatioTemporalConditionModel, StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video, export_to_gif

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    low_cpu_mem_usage=False,
    torch_dtype=torch.float16, variant="fp16", local_files_only=True,
)
pipe.to("cuda:0")
print("Done")

image = load_image('莫娜.png')
image = image.resize((1024, 576))

generator = torch.manual_seed(-1)
with torch.inference_mode():
    frames = pipe(image,
                num_frames=14,
                width=1024,
                height=576,
                decode_chunk_size=8, generator=generator, fps=8, num_inference_steps=30).frames[0]

export_to_gif(frames, "demo_generated_ori.gif")
from IPython import display
display.Image("demo_generated_ori.gif")

Results After Fine-Tuning

Code Snippet

import torch
from diffusers import UNetSpatioTemporalConditionModel, StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video, export_to_gif

unet = UNetSpatioTemporalConditionModel.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    subfolder="unet",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=False,
)
lora_folder = "outputs/checkpoint-200/"
lora_folder = "outputs/"
unet.load_attn_procs(lora_folder)
unet.to(torch.float16)
unet.requires_grad_(False)

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    unet=unet,
    low_cpu_mem_usage=False,
    torch_dtype=torch.float16, variant="fp16", local_files_only=True,
)
pipe.to("cuda:0")
print("Done")

image = load_image('莫娜.png')
image = image.resize((1024, 576))

generator = torch.manual_seed(-1)
with torch.inference_mode():
    frames = pipe(image,
                num_frames=14,
                width=1024,
                height=576,
                decode_chunk_size=8, generator=generator, fps=8, num_inference_steps=30).frames[0]

export_to_gif(frames, "demo_generated_lora.gif")
from IPython import display
display.Image("demo_generated_lora.gif")

Enhancing Dynamic Intensity

Insights from Issue #56

Original Model Dynamics: The original model also has instances of static frames, indicating that static frames are not solely due to training. However, training with LoRA tends to stabilize the generated images.
noise_aug_strength:
- Default Value: 0.02
- Effect: Increasing this value (e.g., to 0.5) can introduce more motion but may have side effects.
motion_bucket_id:
- Default Value: 127
- Effect: Higher values increase the amount of motion in the video.
Checkpoint Analysis:
- Observation: Step 100 results in less static frames compared to Step 1000, especially for frames that were dynamic before fine-tuning.

Parameters for Dynamic Intensity

motion_bucket_id:
- Effect: Controls the amount of motion in the video. Higher values increase motion.
- Default: 127
noise_aug_strength:
- Effect: Controls the noise added to the initial image. Higher values increase dynamic intensity.
- Default: 0.02
num_frames:
- Effect: Number of frames in the generated video. More frames may increase dynamic intensity.
- Default: Depends on model configuration (14 or 25)
fps:
- Effect: Frames per second. Higher values may increase dynamic intensity.
- Default: 7
num_inference_steps:
- Effect: Number of denoising steps. More steps may increase dynamic intensity.
- Default: 25

Summary

motion_bucket_id and noise_aug_strength directly influence dynamic intensity.
num_frames, fps, and num_inference_steps may indirectly influence dynamic intensity.

To increase the dynamic intensity of generated videos, consider increasing the values of motion_bucket_id and noise_aug_strength.

💡 Highlight

Finetuning SVD. See Part 1.
Tracklet-Conditioned Video Generation. Building upon SVD, you can control the movement of objects using tracklets(bbox). See Part 2.

Part 1: Training

Comparison

size=(512, 320), motion_bucket_id=127, fps=7, noise_aug_strength=0.00
generator=torch.manual_seed(111)

Init Image	Before Fine-tuning	After Fine-tuning

Video Data Processing

Note that BDD100K is a driving video/image dataset, but this is not a necessity for training. Any video can be used to initiate your training. Please refer to the DummyDataset data reading logic. In short, you only need to modify self.base_folder. Then arrange your videos in the following file structure:

self.base_folder
    ├── video_name1
    │   ├── video_frame1
    │   ├── video_frame2
    │   ...
    ├── video_name2
    │   ├── video_frame1
        ├── ...

Training Configuration(on the BDD100K dataset)

This training configuration is for reference only, I set all parameters of unet to be trainable during the training and adopted a learning rate of 1e-5.

accelerate launch train_svd.py \
    --pretrained_model_name_or_path=/path/to/weight \
    --per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
    --max_train_steps=50000 \
    --width=512 \
    --height=320 \
    --checkpointing_steps=1000 --checkpoints_total_limit=1 \
    --learning_rate=1e-5 --lr_warmup_steps=0 \
    --seed=123 \
    --mixed_precision="fp16" \
    --validation_steps=200

Part 2: Tracklet2Video

Tracklet2Video

We have attempted to incorporate layout control on top of img2video, which makes the motion of objects more controllable, similar to what is demonstrated in the image below. The code and weights will be updated soon. It should be noted that we use a resolution of 512*320 for SVD to generate videos, so the quality of the generated videos appears to be poor (which is somewhat unfair to SVD), but our intention is to demonstrate the effectiveness of tracklet control, and we will resolve the issue with video quality as soon as possible.

Init Image	Gen Video by SVD	Gen Video by Ours

Methods

We have utilized the Self-Tracking training from Boximator and the Instance-Enhancer from TrackDiffusion. For more details, please refer to the paper.

🏷️ TODO List

Support text2video (WIP)
Support more conditional inputs, such as layout

♥️ Acknowledgement

Our model is related to Diffusers and Stability AI. Thanks for their great work!

Thanks Boximator and GLIGEN for their awesome models.

✒️ Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{li2023trackdiffusion,
  title={Trackdiffusion: Multi-object tracking data generation via diffusion models},
  author={Li, Pengxiang and Liu, Zhili and Chen, Kai and Hong, Lanqing and Zhuge, Yunzhi and Yeung, Dit-Yan and Lu, Huchuan and Jia, Xu},
  journal={arXiv preprint arXiv:2312.00651},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.asset		.asset
src		src
.gitignore		.gitignore
README.md		README.md
demo.jpg		demo.jpg
infer_svd.ipynb		infer_svd.ipynb
train_svd.py		train_svd.py
train_svd_lora.py		train_svd_lora.py

svjack/SVD_Xtend

Folders and files

Latest commit

History

Repository files navigation

SVD Xtend

SVD_Xtend Project Documentation

Introduction

Environment Setup

Create and Activate Conda Environment

Install Required Packages

Dataset Preparation

Clone the Repository and Download Dataset

Verify Dataset Structure

Visualization

Convert Video Frames to GIF

Training

Login to Hugging Face

Launch Training

Inference

Inference on Original Model

Inference on LoRA-Tuned UNet

Genshin Impact building Example

Overview

Preparing the Dataset (From https://github.com/svjack/katna )

Training Command

Displaying Images

Results Before Fine-Tuning

Code Snippet

Results After Fine-Tuning

Code Snippet

Enhancing Dynamic Intensity

Insights from Issue #56

Parameters for Dynamic Intensity

Summary

💡 Highlight

Part 1: Training

Comparison

Video Data Processing

Training Configuration(on the BDD100K dataset)

Part 2: Tracklet2Video

Tracklet2Video

Methods

🏷️ TODO List

♥️ Acknowledgement

✒️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages