📄 Paper on arXiv Speculative Decoding Reimagined for Multimodal Large Language Models
You can directly use the Multimodal Speculative Decoding (MSD) models available on Hugging Face:
- MSD-LLaVA1.5-7B: lucylyn/MSD-LLaVA1.5-7B
- MSD-LLaVA1.5-13B: lucylyn/MSD-LLaVA1.5-13B
- MSD-Qwen2VL-7B-Instruct: lucylyn/MSD-Qwen2VL-7B-Instruct
conda create -n msd python=3.10 -y
conda activate msd
# Ensure CUDA 12.1 is installed and configured
cd LLaVA
pip install -e .
cd ../EAGLE
pip install -e .
cd ../lmms-eval
pip install -e .
Download the annotations used for instruction tuning:
-
⚠️ Before use, processllava_v1_5_mix665k.json
withEAGLE/eagle/ge_data/convert.py
to fix formatting issues.
Then download the image data from the following datasets:
-
COCO: train2017
-
GQA: images
-
OCR-VQA: Download script (Google Drive)
💡 Make sure all OCR-VQA images are saved as
.jpg
-
TextVQA: train_val_images
After downloading, organize the data under ./image_data
in the following structure:
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
Use the following script to generate training data. You can control the target model by setting the --model_type
argument (e.g., llava_v15
or qwen2_vl
):
cd EAGLE/eagle/ge_data
CUDA_VISIBLE_DEVICES=0 python -m eagle.ge_data.allocation \
--outdir <output_data_dir> \
--model_type <model_type> \
--model <base_model_path> \
--image_data_path <image_data_dir> \
--json_data_path <annotation_file>
Use DeepSpeed to train the speculative decoding model. Modify the following paths according to your setup:
cd EAGLE/eagle/train
deepspeed --master_port 29504 --include localhost:0 main_deepspeed.py \
--deepspeed_config ds_config.json \
--tmpdir_v <visual_data_path> \
--tmpdir_t <text_data_path> \
--basepath <base_llm_path> \
--cpdir <checkpoint_output_dir> \
--config <training_config_path>
Parameters:
<visual_data_path>
: directory containing preprocessed visual data<text_data_path>
: directory containing preprocessed text data<training_config_path>
: training configuration file, e.g.,llava_v15_7B_config.json
Run evaluation with lmms-eval
. The following example evaluates on the ChartQA
task:
CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 --main_process_port=29506 -m lmms_eval \
--model <model_name> \
--model_args pretrained="<base_model_path>" \
--msd_model_path <msd_model_path> \
--tasks chartqa \
--batch_size 1 \
--gen_kwargs temperature=0 \
--use_msd \
Parameters:
<model_name>
: short name identifier of your model, e.g.,llava_msd
orqwen2_vl_msd
<base_model_path>
: path to the base pretrained model<msd_model_path>
: path to the MSD model