8000 GitHub - svjack/SVD_Xtend: Stable Video Diffusion Training Code and Extensions.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

svjack/SVD_Xtend

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SVD Xtend

Stable Video Diffusion Training Code and Extensions 🚀

SVD_Xtend Project Documentation

Introduction

This document outlines the steps to set up and run the SVD_Xtend project, which involves training and inference using the Stable Video Diffusion model. The process includes setting up the environment, downloading datasets, training the model, and performing inference.

Environment Setup

Create and Activate Conda Environment

conda create --name py310 python=3.10
conda activate py310
pip install ipykernel
python -m ipykernel install --user --name py310 --display-name "py310"

Install Required Packages

sudo apt-get update
sudo apt-get install git-lfs ffmpeg cbm
pip install -U diffusers transformers accelerate huggingface_hub torch peft sentencepiece "httpx[socks]" opencv-python einops

Dataset Preparation

Clone the Repository and Download Dataset

git clone https://github.com/svjack/SVD_Xtend
cd SVD_Xtend
wget http://dl.yf.io/bdd100k/mot20/images20-track-train-1.zip
unzip images20-track-train-1.zip

Verify Dataset Structure

from train_svd_lora import *
!ls bdd100k/images/track/train
!ls bdd100k/images/track/train/0000f77c-6257be58/

Visualization

Convert Video Frames to GIF

folder_path = "bdd100k/images/track/train/0000f77c-6257be58/"
frames = os.listdir(folder_path)
frames.sort()
from PIL import Image
export_to_gif(list(map(lambda x: Image.open(os.path.join(folder_path, x)), frames)), "0000f77c-6257be58.gif", fps=1)
from IPython import display
display.Image("0000f77c-6257be58.gif")

Training

Login to Hugging Face

huggingface-cli login

Launch Training

accelerate launch train_svd_lora.py \
--base_folder bdd100k/images/track/train \
--pretrained_model_name_or_path=stabilityai/stable-video-diffusion-img2vid-xt-1-1 \
--per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
--max_train_steps=100 \
--width=512 \
--height=320 \
--checkpointing_steps=50 --checkpoints_total_limit=5 \
--learning_rate=1e-5 --lr_warmup_steps=0 \
--seed=123 \
--mixed_precision="fp16" \
--validation_steps=20

Inference

Inference on Original Model

import torch
from diffusers import UNetSpatioTemporalConditionModel, StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video, export_to_gif

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    low_cpu_mem_usage=False,
    torch_dtype=torch.float16, variant="fp16", local_files_only=True,
)
pipe.to("cuda:0")

image = load_image('bdd100k/images/track/train/0000f77c-6257be58/0000f77c-6257be58-0000001.jpg')
image = image.resize((1024, 576))

generator = torch.manual_seed(-1)
with torch.inference_mode():
    frames = pipe(image,
                num_frames=14,
                width=1024,
                height=576,
                decode_chunk_size=8, generator=generator, motion_bucket_id=127, fps=8, num_inference_steps=30).frames[0]

export_to_gif(frames, "0000f77c-6257be58-0000001_generated_ori.gif", fps=7)
from IPython import display
display.Image("0000f77c-6257be58-0000001_generated_ori.gif")

Inference on LoRA-Tuned UNet

import torch
from diffusers import UNetSpatioTemporalConditionModel, StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video, export_to_gif

unet = UNetSpatioTemporalConditionModel.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    subfolder="unet",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=False,
)
lora_folder = "outputs/pytorch_lora_weights.safetensors"
unet.load_attn_procs(lora_folder)
unet.to(torch.float16)
unet.requires_grad_(False)

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    unet=unet,
    low_cpu_mem_usage=False,
    torch_dtype=torch.float16, variant="fp16", local_files_only=True,
)
pipe.to("cuda:0")

image = load_image('bdd100k/images/track/train/0000f77c-6257be58/0000f77c-6257be58-0000001.jpg')
image = image.resize((1024, 576))

generator = torch.manual_seed(-1)
with torch.inference_mode():
    frames = pipe(image,
                num_frames=14,
                width=1024,
                height=576,
                decode_chunk_size=8, generator=generator, motion_bucket_id=127, fps=8, num_inference_steps=30).frames[0]
export_to_gif(frames, "0000f77c-6257be58-0000001_generated_lora.gif", fps=7)
from IPython import display
display.Image("0000f77c-6257be58-0000001_generated_lora.gif")

Genshin Impact building Example

Overview

This README provides instructions and code snippets for training and generating video frames using the Stable Video Diffusion model, specifically focusing on the process of fine-tuning with LoRA (Low-Rank Adaptation). Additionally, it includes insights and adjustments for enhancing the dynamic intensity of generated videos.

Preparing the Dataset (From https://github.com/svjack/katna )

  1. Copy the initial frame image to a demo file:
    cp ../genshin_frame_V2/BV12YDaYME9Z_interval_videos_interval_0_folder_0/frame_000000.png demo.jpg

Training Command

The following command is used to train the model with default LoRA rank set to 4:

accelerate launch train_svd_lora.py \
--base_folder ../genshin_frame_V2_tgt \
--pretrained_model_name_or_path=stabilityai/stable-video-diffusion-img2vid-xt-1-1 \
--per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
--max_train_steps=1000 \
--width=512 \
--height=320 \
--checkpointing_steps=100 --checkpoints_total_limit=10 \
--learning_rate=1e-5 --lr_warmup_steps=0 \
--seed=123 \
--mixed_precision="fp16" \
--validation_steps=100

Displaying Images

To display an image in a Jupyter notebook:

from IPython import display
display.Image("莫娜.png")

莫娜

Results Before Fine-Tuning

Code Snippet

import torch
from diffusers import UNetSpatioTemporalConditionModel, StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video, export_to_gif

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    low_cpu_mem_usage=False,
    torch_dtype=torch.float16, variant="fp16", local_files_only=True,
)
pipe.to("cuda:0")
print("Done")

image = load_image('莫娜.png')
image = image.resize((1024, 576))

generator = torch.manual_seed(-1)
with torch.inference_mode():
    frames = pipe(image,
                num_frames=14,
                width=1024,
                height=576,
                decode_chunk_size=8, generator=generator, fps=8, num_inference_steps=30).frames[0]

export_to_gif(frames, "demo_generated_ori.gif")
from IPython import display
display.Image("demo_generated_ori.gif")

demo_generated_ori

Results After Fine-Tuning

Code Snippet

import torch
from diffusers import UNetSpatioTemporalConditionModel, StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video, export_to_gif

unet = UNetSpatioTemporalConditionModel.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    subfolder="unet",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=False,
)
lora_folder = "outputs/checkpoint-200/"
lora_folder = "outputs/"
unet.load_attn_procs(lora_folder)
unet.to(torch.float16)
unet.requires_grad_(False)

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt-1-1",
    unet=unet,
    low_cpu_mem_usage=False,
    torch_dtype=torch.float16, variant="fp16", local_files_only=True,
)
pipe.to("cuda:0")
print("Done")

image = load_image('莫娜.png')
image = image.resize((1024, 576))

generator = torch.manual_seed(-1)
with torch.inference_mode():
    frames = pipe(image,
                num_frames=14,
                width=1024,
                height=576,
                decode_chunk_size=8, generator=generator, fps=8, num_inference_steps=30).frames[0]

export_to_gif(frames, "demo_generated_lora.gif")
from IPython import display
display.Image("demo_generated_lora.gif")

demo_generated_lora

Enhancing Dynamic Intensity

Insights from Issue #56

  • Original Model Dynamics: The original model also has instances of static frames, indicating that static frames are not solely due to training. However, training with LoRA tends to stabilize the generated images.
  • noise_aug_strength:
    • Default Value: 0.02
    • Effect: Increasing this value (e.g., to 0.5) can introduce more motion but may have side effects.
  • motion_bucket_id:
    • Default Value: 127
    • Effect: Higher values increase the amount of motion in the video.
  • Checkpoint Analysis:
    • Observation: Step 100 results in less static frames compared to Step 1000, especially for frames that were dynamic before fine-tuning.

Parameters for Dynamic Intensity

  1. motion_bucket_id:
    • Effect: Controls the amount of motion in the video. Higher values increase motion.
    • Default: 127
  2. noise_aug_strength:
    • Effect: Controls the noise added to the initial image. Higher values increase dynamic intensity.
    • Default: 0.02
  3. num_frames:
    • Effect: Number of frames in the generated video. More frames may increase dynamic intensity.
    • Default: Depends on model configuration (14 or 25)
  4. fps:
    • Effect: Frames per second. Higher values may increase dynamic intensity.
    • Default: 7
  5. num_inference_steps:
    • Effect: Number of denoising steps. More steps may increase dynamic intensity.
    • Default: 25

Summary

  • motion_bucket_id and noise_aug_strength directly influence dynamic intensity.
  • num_frames, fps, and num_inference_steps may indirectly influence dynamic intensity.

To increase the dynamic intensity of generated videos, consider increasing the values of motion_bucket_id and noise_aug_strength.

💡 Highlight

  • Finetuning SVD. See Part 1.
  • Tracklet-Conditioned Video Generation. Building upon SVD, you can control the movement of objects using tracklets(bbox). See Part 2.

Part 1: Training

Comparison

size=(512, 320), motion_bucket_id=127, fps=7, noise_aug_strength=0.00
generator=torch.manual_seed(111)
Init Image Before Fine-tuning After Fine-tuning
demo ori ft
demo ori ft
demo ori ft
demo ori ft

Video Data Processing

Note that BDD100K is a driving video/image dataset, but this is not a necessity for training. Any video can be used to initiate your training. Please refer to the DummyDataset data reading logic. In short, you only need to modify self.base_folder. Then arrange your videos in the following file structure:

self.base_folder
    ├── video_name1
    │   ├── video_frame1
    │   ├── video_frame2
    │   ...
    ├── video_name2
    │   ├── video_frame1
        ├── ...

Training Configuration(on the BDD100K dataset)

This training configuration is for reference only, I set all parameters of unet to be trainable during the training and adopted a learning rate of 1e-5.

accelerate launch train_svd.py \
    --pretrained_model_name_or_path=/path/to/weight \
    --per_gpu_batch_size=1 --gradient_accumulation_steps=1 \
    --max_train_steps=50000 \
    --width=512 \
    --height=320 \
    --checkpointing_steps=1000 --checkpoints_total_limit=1 \
    --learning_rate=1e-5 --lr_warmup_steps=0 \
    --seed=123 \
    --mixed_precision="fp16" \
    --validation_steps=200

Part 2: Tracklet2Video

Tracklet2Video

We have attempted to incorporate layout control on top of img2video, which makes the motion of objects more controllable, similar to what is demonstrated in the image below. The code and weights will be updated soon. It should be noted that we use a resolution of 512*320 for SVD to generate videos, so the quality of the generated videos appears to be poor (which is somewhat unfair to SVD), but our intention is to demonstrate the effectiveness of tracklet control, and we will resolve the issue with video quality as soon as possible.

Init Image Gen Video by SVD Gen Video by Ours
demo1 svd1 gen1
demo2 svd2 gen2

Methods

We have utilized the Self-Tracking training from Boximator and the Instance-Enhancer from TrackDiffusion. For more details, please refer to the paper.

🏷️ TODO List

  • Support text2video (WIP)
  • Support more conditional inputs, such as layout

♥️ Acknowledgement

Our model is related to Diffusers and Stability AI. Thanks for their great work!

Thanks Boximator and GLIGEN for their awesome models.

✒️ Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@article{li2023trackdiffusion,
  title={Trackdiffusion: Multi-object tracking data generation via diffusion models},
  author={Li, Pengxiang and Liu, Zhili and Chen, Kai and Hong, Lanqing and Zhuge, Yunzhi and Yeung, Dit-Yan and Lu, Huchuan and Jia, Xu},
  journal={arXiv preprint arXiv:2312.00651},
  year={2023}
}

About

Stable Video Diffusion Training Code and Extensions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.3%
  • Jupyter Notebook 1.7%
0