Code release for "Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference"
- 2025.01.18: 🔥 VTW has been selected for oral presentation at AAAI'25!
- 2024.12.10: 🔥 VTW has been accepted to AAAI'25!
# install llava
conda create -n vtw python=3.10 -y
conda activate vtw
pip install --upgrade pip # enable PEP 660 support
pip install -e .
# install lmms-eval
cd lmms-evaluation
pip install -e .
python -m llava.serve.cli \
--model-path liuhaotian/llava-v1.5-7b \
--image-file "https://llava-vl.github.io/static/images/view.jpg" \
--use_vtw
accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval --model llava \
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
--tasks scienceqa_img --batch_size 1 \
--log_samples_suffix llava-1.5-7b \
--output_path ./logs/ \
--limit 20 --findk
accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval --model llava \
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
--tasks scienceqa_img --batch_size 1 \
--log_samples_suffix llava_7b \
--output_path ./logs/7b/
accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval --model llava \
--model_args pretrained="liuhaotian/llava-v1.5-7b" \
--tasks scienceqa_img --batch_size 1 \
--log_samples_suffix llava_7b \
--output_path ./logs/7b/ \
--use_vtw --k=15 # Use the searched K or specify K manually
# install VideoLLaVA
cd VideoLLaVA/
pip install -e .
# install VLMEvalKit
cd VLMEvalKit-evaluation/
pip install -e .
cd VLMEvalKit-evaluation/
torchrun --nproc-per-node=1 --master-port 12311 run.py --data Video-MME --work-dir ./results/videollava_VTW --model Video-LLaVA-7B
- Inference to get the result.
cd VideoLLaVA/
bash scripts/v1_5/eval/run_qa_tgif.sh
- GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_tgif.sh
This work is built upon the LLaVA, VideoLLaVA, lmms-eval, and VLMEvalKit.
@article{lin2024boosting,
title={Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference},
author={Lin, Zhihang and Lin, Mingbao and Lin, Luxi and Ji, Rongrong},
journal={arXiv preprint arXiv:2405.05803},
year={2024}
}