The offical implementation of the paper Training-Free Efficient Video Generation via Dynamic Token Carving
Jenga can generate videos with 4.68-10.35 times faster on single GPU.
Please visit the project page for more video results.
- Model Adaptation
- HunyuanVideo Inference
- Multi-gpus Parallel inference (Faster inference speed on more gpus)
- HunyuanVideo-I2V Inference
- Wan2.1-1.3B
- Wan2.1-14B (I2V, T2V)
- Engineering Optimization
- Quantization (sage-attention)
- ComfyUI
- RoPE & Norm Kernel
- FA3 Adaptation
Following the installation as in HunyuanVideo:
# 1. Create conda environment
conda create -n Jenga python==3.10.9
# 2. Activate the environment
conda activate Jenga
# 3. Install PyTorch and other dependencies using conda
# For CUDA 12.4
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
# 4. Install pip dependencies
python -m pip install -r hy_requirements.txt
# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/flash-attention.git@v2.6.3
# 6. Install xDiT for parallel inference (we test on H800, cuda124)
python -m pip install xfuser==0.4.3.post3
python -m pip install yunchang==0.6.3.post1
Please following the instruction in model_down_hy.md.
bash scripts/hyvideo_jenga_base.sh # Jenga Base (Opt. 310s)
# bash scripts/hyvideo_jenga_turbo.sh # Jenga Turbo
# bash scripts/hyvideo_jenga_flash.sh # Jenga Flash
# bash scripts/hyvideo_jenga_3stage.sh # Jenga 3Stage
Inference time for different settings (DiT time, single H800, after warmup):
HunyuanVideo | Jenga-Base | Jenga-Turbo | Jenga-Flash | Jenga-3Stage |
---|---|---|---|---|
1625s | 310s (5.24x) | 225s (7.22x) | 184s (8.82x) | 157s (10.35x) |
If you want to type your prompt directly, just change the --prompt
. Following command (for Jenga-Turbo)
If you encounters OOM issue, try to add
--use-cpu-offload
.
CUDA_VISIBLE_DEVICES=0 python3 -u ./jenga_hyvideo.py \
--video-size 720 1280 \
--video-length 125 \
--infer-steps 50 \
--prompt "A cat walks on the grass, realistic style." \
--seed 42 \
--embedded-cfg-scale 6.0 \
--flow-shift 7.0 \
--flow-reverse \
--sa-drop-rates 0.7 0.8 \
--p-remain-rates 0.3 \
--post-fix "Jenga_Turbo" \
--save-path ./results/hyvideo \
--res-rate-list 0.75 1.0 \
--step-rate-list 0.5 1.0 \
--scheduler-shift-list 7 9
We provide set of 8GPU runnable scripts (further 5-6x compared with single GPU):
bash scripts/hyvide_multigpu_jenga_base.sh
# bash scripts/hyvide_multigpu_jenga_turbo.sh
# bash scripts/hyvide_multigpu_jenga_flash.sh
# bash scripts/hyvide_multigpu_jenga_3stage.sh
For customizing (Jenga-Turbo as example):
export NPROC_PER_NODE=8
export ULYSSES_DEGREE=8 # number of GPU
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=$NPROC_PER_NODE ./jenga_hyvideo_multigpu.py \
A344
--video-size 720 1280 \
--video-length 125 \
--infer-steps 50 \
--prompt "The camera rotates around a large stack of vintage televisions all showing different programs -- 1950s sci-fi movies, horror movies, news, static, a 1970s sitcom, etc, set inside a large New York museum gallery." \
--seed 42 \
--embedded-cfg-scale 6.0 \
--flow-shift 7.0 \
--flow-reverse \
--sa-drop-rates 0.75 0.85 \
--p-remain-rates 0.3 \
--post-fix "Jenga_Turbo" \
--save-path ./results/hyvideo_multigpu \
--res-rate-list 0.75 1.0 \
--step-rate-list 0.5 1.0 \
--ulysses-degree $ULYSSES_DEGREE \
--scheduler-shift-list 7 9
Inference time for different settings (DiT time, 8xH800, after warmup):
HunyuanVideo | Jenga-Base | Jenga-Turbo | Jenga-Flash | Jenga-3Stage |
---|---|---|---|---|
225s | 55s (4.09x) | 40s (5.62x) | 38s (5.92x) | 32s (7.03x) |
Due to the constant time of VAE, we recommend allocating each prompt to a single card for batch sampling. Please check the sample script in Jenga-Turbo.
bash ./scripts/hyvideo_batched_sample.sh
The general pipeline is the same, just download weight from Huggingface to ckpts/AccVideo
Then run the script
bash ./scripts/accvideo_jenga.sh
First, download HunyuanVideo-I2V models following the instruction
Here we support single prompt inference and json-like input (for example, VBench-like input)
bash ./scripts/hyi2v_jenga_base.sh
If you want to input json files for batched inference, please format your file as following:
[
{
"prompt_en": "a close up of a blue and orange liquid, camera pans left",
"dimension": [
"camera_motion"
],
"image_type": "abstract",
"image_name": "a close up of a blue and orange liquid.jpg",
"id": "0001"
},
{
"prompt_en": "a close up of a blue and orange liquid, camera pans right",
"dimension": [
"camera_motion"
],
"image_type": "abstract",
"image_name": "a close up of a blue and orange liquid.jpg",
"id": "0002"
},
]
We test on the default case: 1088x832x125f, 113K tokens, following is a reference DiT time:
HunyuanVideo | Jenga-Base |
---|---|
1590s | 323s (4.92x) |
Currently, we support Wan2.1-1.3B. We are working on the 14B inference. We use the same environment as in Hunyuan, please update enviorment if you find trouble in env setup, please refer to the official guideline in Wan2.1.
First, download Wan2.1 models from HuggingFace Wan2.1 1.3B to ./ckpts
We support Jenga-Base and Jenga-Turbo, you may also adjust the --teacache_thresh
or use complex rewritten prompts to resolve possible temporal flickering problem.
bash ./scripts/wan_1.3B_jenga_base.sh
# bash ./scripts/wan_1.3B_jenga_turbo.sh
We test on the default case: 832x480x81f, 32K tokens, following is a reference DiT time (FlashAttention2):
Wan2.1-1.3B | Jenga-Base | Jenga-Turbo |
---|---|---|
111s | 26s (4.26x) | 18s (6.16x) |
The general idea of Jenga is to reduce token interactions in Diffusion Transformers (DiTs). Following is an overview.
The left part illustrates the attention carving. A 3D video latent is partitioned into local blocks before being passed to the Transformer layers. A block-wise attention is processed to get a head-aware sparse block-selection masks. In each selected block, dense parallel attention is performed. The right part illustrates the Progressive Resolution strategy. The number of tokens and timesteps is compressed to ensure an efficient generation.
Attention Carving (AttenCarve). Here we illustrate a toy example of a 4x4x4 latent, where m=8 latent items form a block. Left: The latent 3D re-ordering and block partition via space filling curves (SFC). Right: After the block-wise attention, we can construct the Importance Mask, combined with the pre-computed Condition Mask and Adjacency Mask, a block-wise dense attention mask is passed to the customized kernel for device-efficient attention.
Progressive Resolusion (ProRes). Left: A brief illustration of stage switch and timestep skip. Before the rescale in stage s, we revert the latent to a clean state
If you find Jenga useful for your research and applications, please cite using this BibTeX:
@article{zhang2025training,
title={Training-Free Efficient Video Generation via Dynamic Token Carving},
author={Zhang, Yuechen and Xing, Jinbo and Xia, Bin and Liu, Shaoteng and Peng, Bohao and Tao, Xin and Wan, Pengfei and Lo, Eric and Jia, Jiaya},
journal={arXiv preprint arXiv:2505.16864},
year={2025}
}
We would like to thank the contributors to the HunyuanVideo, HunyuanVideo-I2V, Wan2.1, AccVideo, MInference, Gilbert, TeaCache and HuggingFace repositories, for their open research and exploration.