8000 GitHub - lzhxmu/VTW: Code release for VTW (AAAI 2025) Oral
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

lzhxmu/VTW

Repository files navigation

Visual Tokens Withdrawal

Code release for "Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference"

News

  • 2025.01.18: 🔥 VTW has been selected for oral presentation at AAAI'25!
  • 2024.12.10: 🔥 VTW has been accepted to AAAI'25!

Experiments Environment

Set Up the Dependencies as:

# install llava
conda create -n vtw python=3.10 -y
conda activate vtw
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
# install lmms-eval
cd lmms-evaluation
pip install -e .

Chatbot

python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b   \
    --image-file "https://llava-vl.github.io/static/images/view.jpg" \
    --use_vtw

Search Visual Tokens Withdrawal Layer K

accelerate launch  --num_processes=1 --main_process_port=12346 -m lmms_eval --model llava \
    --model_args pretrained="liuhaotian/llava-v1.5-7b"  \
    --tasks scienceqa_img --batch_size 1 \
    --log_samples_suffix llava-1.5-7b \
    --output_path ./logs/ \
    --limit 20 --findk

Evaluation Baseline

Command

accelerate launch  --num_processes=1 --main_process_port=12346 -m lmms_eval --model llava \
    --model_args pretrained="liuhaotian/llava-v1.5-7b"  \
    --tasks scienceqa_img --batch_size 1 \
    --log_samples_suffix llava_7b \
    --output_path ./logs/7b/ 

You will get

alt text

Evaluation with Visual Tokens Withdrawal

Command

accelerate launch  --num_processes=1 --main_process_port=12346 -m lmms_eval --model llava \
    --model_args pretrained="liuhaotian/llava-v1.5-7b"  \
    --tasks scienceqa_img --batch_size 1 \
    --log_samples_suffix llava_7b \
    --output_path ./logs/7b/ \
    --use_vtw --k=15    # Use the searched K or specify K manually 

You will get

alt text

Video-LLaVa

Set Up the Dependencies as:

# install VideoLLaVA
cd VideoLLaVA/
pip install -e .
# install VLMEvalKit
cd VLMEvalKit-evaluation/
pip install -e .

Video-MME

cd VLMEvalKit-evaluation/
torchrun --nproc-per-node=1 --master-port 12311 run.py --data  Video-MME --work-dir ./results/videollava_VTW --model Video-LLaVA-7B 

TGIF

  1. Inference to get the result.
cd VideoLLaVA/
bash scripts/v1_5/eval/run_qa_tgif.sh
  1. GPT-Assistant evaluation.
bash scripts/v1_5/eval/eval_qa_tgif.sh

Downstream Task

Acknowledge

This work is built upon the LLaVA, VideoLLaVA, lmms-eval, and VLMEvalKit.

Citation

@article{lin2024boosting,
  title={Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference},
  author={Lin, Zhihang and Lin, Mingbao and Lin, Luxi and Ji, Rongrong},
  journal={arXiv preprint arXiv:2405.05803},
  year={2024}
}

About

Code release for VTW (AAAI 2025) Oral

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0