Visual Tokens Withdrawal

Code release for "Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference"

News

2025.01.18: 🔥 VTW has been selected for oral presentation at AAAI'25!
2024.12.10: 🔥 VTW has been accepted to AAAI'25!

Experiments Environment

Set Up the Dependencies as:

# install llava
conda create -n vtw python=3.10 -y
conda activate vtw
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
# install lmms-eval
cd lmms-evaluation
pip install -e .

Chatbot

python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b   \
    --image-file "https://llava-vl.github.io/static/images/view.jpg" \
    --use_vtw

Search Visual Tokens Withdrawal Layer K

accelerate launch  --num_processes=1 --main_process_port=12346 -m lmms_eval --model llava \
    --model_args pretrained="liuhaotian/llava-v1.5-7b"  \
    --tasks scienceqa_img --batch_size 1 \
    --log_samples_suffix llava-1.5-7b \
    --output_path ./logs/ \
    --limit 20 --findk

Evaluation Baseline

Command

accelerate launch  --num_processes=1 --main_process_port=12346 -m lmms_eval --model llava \
    --model_args pretrained="liuhaotian/llava-v1.5-7b"  \
    --tasks scienceqa_img --batch_size 1 \
    --log_samples_suffix llava_7b \
    --output_path ./logs/7b/

You will get

Evaluation with Visual Tokens Withdrawal

Command

accelerate launch  --num_processes=1 --main_process_port=12346 -m lmms_eval --model llava \
    --model_args pretrained="liuhaotian/llava-v1.5-7b"  \
    --tasks scienceqa_img --batch_size 1 \
    --log_samples_suffix llava_7b \
    --output_path ./logs/7b/ \
    --use_vtw --k=15    # Use the searched K or specify K manually

You will get

Video-LLaVa

Set Up the Dependencies as:

# install VideoLLaVA
cd VideoLLaVA/
pip install -e .
# install VLMEvalKit
cd VLMEvalKit-evaluation/
pip install -e .

Video-MME

cd VLMEvalKit-evaluation/
torchrun --nproc-per-node=1 --master-port 12311 run.py --data  Video-MME --work-dir ./results/videollava_VTW --model Video-LLaVA-7B

TGIF

Inference to get the result.

cd VideoLLaVA/
bash scripts/v1_5/eval/run_qa_tgif.sh

GPT-Assistant evaluation.

bash scripts/v1_5/eval/eval_qa_tgif.sh

Downstream Task

LISA

Acknowledge

This work is built upon the LLaVA, VideoLLaVA, lmms-eval, and VLMEvalKit.

Citation

@article{lin2024boosting,
  title={Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference},
  author={Lin, Zhihang and Lin, Mingbao and Lin, Luxi and Ji, Rongrong},
  journal={arXiv preprint arXiv:2405.05803},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LISA-VTW		LISA-VTW
VLMEvalKit-evaluation		VLMEvalKit-evaluation
VideoLLaVA		VideoLLaVA
assets		assets
llava		llava
lmms-evaluation		lmms-evaluation
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Tokens Withdrawal

News

Experiments Environment

Set Up the Dependencies as:

Chatbot

Search Visual Tokens Withdrawal Layer K

Evaluation Baseline

Command

You will get

Evaluation with Visual Tokens Withdrawal

Command

You will get

Video-LLaVa

Set Up the Dependencies as:

Video-MME

TGIF

Downstream Task

LISA

Acknowledge

Citation

About

Releases

Packages

Languages

lzhxmu/VTW

Folders and files

Latest commit

History

Repository files navigation

Visual Tokens Withdrawal

News

Experiments Environment

Set Up the Dependencies as:

Chatbot

Search Visual Tokens Withdrawal Layer K

Evaluation Baseline

Command

You will get

Evaluation with Visual Tokens Withdrawal

Command

You will get

Video-LLaVa

Set Up the Dependencies as:

Video-MME

TGIF

Downstream Task

LISA

Acknowledge

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages