Official PyTorch implementation of the paper "dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching" (dLLM-Cache).
- [2025/05/23] The code of our paper has been released.
- [2025/05/22] Our paper has been released.
- Speedup: Achieves up to 9.1x speedup over standard dLLM pipelines, with no performance loss on most tasks.
- Evaluation: Evaluated on LLaDA 8B and Dream 7B.
- Latency: Approaches ARM-level inference speeds in many scenarios.
Here's an overview of the process behind our dLLM-Cache method:
To get started with dLLM-Cache, follow the installation instructions below.
- Clone the Repository:
git clone https://github.com/maomaocun/dLLM-Cache.git
cd dLLM-Cache
- Set Up the Environment:
Create a Python environment with
conda
orvirtualenv
and install dependencies:
bash install.sh
- demo:
python demo_{model_name}.py
- Running Experiments: Run experiments using the provided scripts:
bash scripts/run_{model_name}_{task_name}_base.sh
- GSM8K with LLaDA
bash scripts/run_LLaDA_gsm8k_base.sh
- BBH with Dream
bash scripts/run_Dream_bbh_base.sh
If you have any questions, please email yangyicun187@gmail.com.
This repository was built off of LLaDA, Dream and lm-evaluation-harness.
If you find dLLM-Cache useful for your research and applications, please cite using this BibTeX:
@misc{liu2025dllm,
title={dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching},
author={Zhiyuan Liu and Yicun Yang and Yaojie Zhang and Junjie Chen and Chang Zou and Qingyan Wei and Shaobo Wang and Linfeng Zhang},
year={2025},
url={https://github.com/maomaocun/dLLM-cache},
}