GitHub - dvlab-research/Jenga: Official Implementation: Training-Free Efficient Video Generation via Dynamic Token Carving

The offical implementation of the paper Training-Free Efficient Video Generation via Dynamic Token Carving

Overview

Jenga can generate videos with 4.68-10.35 times faster on single GPU.

Please visit the project page for more video results.

Open-source Plan

Model Adaptation
- HunyuanVideo Inference
- Multi-gpus Parallel inference (Faster inference speed on more gpus)
- HunyuanVideo-I2V Inference
- Wan2.1-1.3B
- Wan2.1-14B (I2V, T2V)
Engineering Optimization
- Quantization (sage-attention)
- ComfyUI
- RoPE & Norm Kernel
- FA3 Adaptation

Guidance

Inference on HunyuanVideo

Enviornment

Following the installation as in HunyuanVideo:

# 1. Create conda environment
conda create -n Jenga python==3.10.9

# 2. Activate the environment
conda activate Jenga

# 3. Install PyTorch and other dependencies using conda
# For CUDA 12.4
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia

# 4. Install pip dependencies
python -m pip install -r hy_requirements.txt

# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3

# 6. Install xDiT for parallel inference (we test on H800, cuda124)
python -m pip install xfuser==0.4.3.post3
python -m pip install yunchang==0.6.3.post1

Download model

Please following the instruction in model_down_hy.md.

Single GPU Inference

bash scripts/hyvideo_jenga_base.sh # Jenga Base (Opt. 310s)
# bash scripts/hyvideo_jenga_turbo.sh # Jenga Turbo
# bash scripts/hyvideo_jenga_flash.sh # Jenga Flash
# bash scripts/hyvideo_jenga_3stage.sh # Jenga 3Stage

Inference time for different settings (DiT time, single H800, after warmup):

HunyuanVideo	Jenga-Base	Jenga-Turbo	Jenga-Flash	Jenga-3Stage
1625s	310s (5.24x)	225s (7.22x)	184s (8.82x)	157s (10.35x)

If you want to type your prompt directly, just change the --prompt. Following command (for Jenga-Turbo)

If you encounters OOM issue, try to add --use-cpu-offload.

CUDA_VISIBLE_DEVICES=0 python3 -u ./jenga_hyvideo.py \
    --video-size 720 1280 \
    --video-length 125 \
    --infer-steps 50 \
    --prompt "A cat walks on the grass, realistic style." \
    --seed 42 \
    --embedded-cfg-scale 6.0 \
    --flow-shift 7.0 \
    --flow-reverse \
    --sa-drop-rates 0.7 0.8 \
    --p-remain-rates 0.3 \
    --post-fix "Jenga_Turbo" \
    --save-path ./results/hyvideo \
    --res-rate-list 0.75 1.0 \
    --step-rate-list 0.5 1.0 \
    --scheduler-shift-list 7 9

Multi GPU Inference

We provide set of 8GPU runnable scripts (further 5-6x compared with single GPU):

bash scripts/hyvide_multigpu_jenga_base.sh 
# bash scripts/hyvide_multigpu_jenga_turbo.sh 
# bash scripts/hyvide_multigpu_jenga_flash.sh 
# bash scripts/hyvide_multigpu_jenga_3stage.sh

For customizing (Jenga-Turbo as example):

export NPROC_PER_NODE=8
export ULYSSES_DEGREE=8 # number of GPU

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=$NPROC_PER_NODE ./jenga_hyvideo_multigpu.py \
 
A344
   --video-size 720 1280 \
    --video-length 125 \
    --infer-steps 50 \
    --prompt "The camera rotates around a large stack of vintage televisions all showing different programs -- 1950s sci-fi movies, horror movies, news, static, a 1970s sitcom, etc, set inside a large New York museum gallery." \
    --seed 42 \
    --embedded-cfg-scale 6.0 \
    --flow-shift 7.0 \
    --flow-reverse \
    --sa-drop-rates 0.75 0.85 \
    --p-remain-rates 0.3 \
    --post-fix "Jenga_Turbo" \
    --save-path ./results/hyvideo_multigpu \
    --res-rate-list 0.75 1.0 \
    --step-rate-list 0.5 1.0 \
    --ulysses-degree $ULYSSES_DEGREE \
    --scheduler-shift-list 7 9

Inference time for different settings (DiT time, 8xH800, after warmup):

HunyuanVideo	Jenga-Base	Jenga-Turbo	Jenga-Flash	Jenga-3Stage
225s	55s (4.09x)	40s (5.62x)	38s (5.92x)	32s (7.03x)

Run Multiple Samples with Multi-GPU

Due to the constant time of VAE, we recommend allocating each prompt to a single card for batch sampling. Please check the sample script in Jenga-Turbo.

bash ./scripts/hyvideo_batched_sample.sh

Inference on AccVideo (Distilled Models)

The general pipeline is the same, just download weight from Huggingface to ckpts/AccVideo

Then run the script

bash ./scripts/accvideo_jenga.sh

Inference on HunyuanVideo-I2V

First, download HunyuanVideo-I2V models following the instruction

Here we support single prompt inference and json-like input (for example, VBench-like input)

bash ./scripts/hyi2v_jenga_base.sh

If you want to input json files for batched inference, please format your file as following:

[
  {
        "prompt_en": "a close up of a blue and orange liquid, camera pans left",
        "dimension": [
            "camera_motion"
        ],
        "image_type": "abstract",
        "image_name": "a close up of a blue and orange liquid.jpg",
        "id": "0001"
    },
    {
        "prompt_en": "a close up of a blue and orange liquid, camera pans right",
        "dimension": [
            "camera_motion"
        ],
        "image_type": "abstract",
        "image_name": "a close up of a blue and orange liquid.jpg",
        "id": "0002"
    },
]

We test on the default case: 1088x832x125f, 113K tokens, following is a reference DiT time:

HunyuanVideo	Jenga-Base
1590s	323s (4.92x)

Inference on Wan2.1

Currently, we support Wan2.1-1.3B. We are working on the 14B inference. We use the same environment as in Hunyuan, please update enviorment if you find trouble in env setup, please refer to the official guideline in Wan2.1.

First, download Wan2.1 models from HuggingFace Wan2.1 1.3B to ./ckpts

We support Jenga-Base and Jenga-Turbo, you may also adjust the --teacache_thresh or use complex rewritten prompts to resolve possible temporal flickering problem.

bash ./scripts/wan_1.3B_jenga_base.sh
# bash ./scripts/wan_1.3B_jenga_turbo.sh

We test on the default case: 832x480x81f, 32K tokens, following is a reference DiT time (FlashAttention2):

Wan2.1-1.3B	Jenga-Base	Jenga-Turbo
111s	26s (4.26x)	18s (6.16x)

Method Overview

The general idea of Jenga is to reduce token interactions in Diffusion Transformers (DiTs). Following is an overview.

The left part illustrates the attention carving. A 3D video latent is partitioned into local blocks before being passed to the Transformer layers. A block-wise attention is processed to get a head-aware sparse block-selection masks. In each selected block, dense parallel attention is performed. The right part illustrates the Progressive Resolution strategy. The number of tokens and timesteps is compressed to ensure an efficient generation.

Attention Carving (AttenCarve). Here we illustrate a toy example of a 4x4x4 latent, where m=8 latent items form a block. Left: The latent 3D re-ordering and block partition via space filling curves (SFC). Right: After the block-wise attention, we can construct the Importance Mask, combined with the pre-computed Condition Mask and Adjacency Mask, a block-wise dense attention mask is passed to the customized kernel for device-efficient attention.

Progressive Resolusion (ProRes). Left: A brief illustration of stage switch and timestep skip. Before the rescale in stage s, we revert the latent to a clean state $\hat{x}^{s}_0$, then re-noise on the upsampled clean latent. Right & Bottom: We add a bias on the video-text attention score, to enable a scalable Field of View (FOV) in low-resolution content generation.

Citation

If you find Jenga useful for your research and applications, please cite using this BibTeX:

@article{zhang2025training,
  title={Training-Free Efficient Video Generation via Dynamic Token Carving},
  author={Zhang, Yuechen and Xing, Jinbo and Xia, Bin and Liu, Shaoteng and Peng, Bohao and Tao, Xin and Wan, Pengfei and Lo, Eric and Jia, Jiaya},
  journal={arXiv preprint arXiv:2505.16864},
  year={2025}
}

Acknowledgements

We would like to thank the contributors to the HunyuanVideo, HunyuanVideo-I2V, Wan2.1, AccVideo, MInference, Gilbert, TeaCache and HuggingFace repositories, for their open research and exploration.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
hyvideo		hyvideo
hyvideo_i2v		hyvideo_i2v
scripts		scripts
utils		utils
wan		wan
.gitignore		.gitignore
README.md		README.md
gilbert.py		gilbert.py
hy_requirements.txt		hy_requirements.txt
jenga_hyi2v.py		jenga_hyi2v.py
jenga_hyvideo.py		jenga_hyvideo.py
jenga_hyvideo_multigpu.py		jenga_hyvideo_multigpu.py
jenga_wan.py		jenga_wan.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Open-source Plan

Guidance

Inference on HunyuanVideo

Enviornment

Download model

Single GPU Inference

Multi GPU Inference

Run Multiple Samples with Multi-GPU

Inference on AccVideo (Distilled Models)

Inference on HunyuanVideo-I2V

Inference on Wan2.1

Method Overview

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

dvlab-research/Jenga

Folders and files

Latest commit

History

Repository files navigation

Overview

Open-source Plan

Guidance

Inference on HunyuanVideo

Enviornment

Download model

Single GPU Inference

Multi GPU Inference

Run Multiple Samples with Multi-GPU

Inference on AccVideo (Distilled Models)

Inference on HunyuanVideo-I2V

Inference on Wan2.1

Method Overview

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages