8000 GitHub - KaiLv69/DuoDecoding: DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

License

Notifications You must be signed in to change notification settings

KaiLv69/DuoDecoding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DuoDecoding

arXiv.2503.00784 Hugging Face Paper Page

This repo contains the implementation for the paper Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting. We propose deploying the draft model on CPU, which shifts drafting computational overhead to CPU and enables parallel decoding.

Setup

  1. Create a conda environment with Python 3.10:
conda create -n duodec python=3.10
conda activate duodec
  1. Install Python bindings for llama.cpp:
CMAKE_ARGS="-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
  1. Install other required packages:
git clone https://github.com/KaiLv69/DuoDecoding.git
cd DuoDecoding
pip install -r requirements.txt
  1. Set model path in src/utils.py.

  2. (Optional) Install draftretriever and create a datastore for REST:

bash src/model/rest/datastore/datastore.sh
pip install src/model/rest/DraftRetriever/wheels/draftretriever-0.1.0-cp310-cp310-manylinux_2_34_x86_64.whl

Evaluation

We provide evaluation scripts for the experiments reported in our paper.

  • To evaluate the baseline methods on Llama-2-7b:
bash cmds/baseline_llama.sh
  • To evaluate DuoDecoding on Llama-2-7b:
bash cmds/duodec_llama.sh
  • To evaluate baseline methods on Vicuna-7b-v1.5:
bash cmds/baseline_vicuna.sh
  • To evaluate DuoDecoding on Vicuna-7b-v1.5:
bash cmds/duodec_vicuna.sh

Bugs and Questions

If you have any questions related to the code or the paper, feel free to email Kai (klv23@m.fudan.edu.cn). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Acknowledgments

This repo builds upon the following excellent repos: llama-cpp-python, Spec-Bench, parallelspeculativedecoding.

Citation

Please cite our paper if you find the repo helpful:

@misc{lv2025duodecodinghardwareawareheterogeneousspeculative,
      title={DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting}, 
      author={Kai Lv and Honglin Guo and Qipeng Guo and Xipeng Qiu},
      year={2025},
      eprint={2503.00784},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.00784}, 
}

About

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0