8000 GitHub - bscho333/ReVisiT
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

bscho333/ReVisiT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Revisit What You See: Disclose Language Prior in
Vision Tokens for Efficient Guided Decoding of LVLMs

arXiv  huggingface  LICENSE

Beomsik Cho1·Jaehyung Kim1
1 Yonsei University  

Overview

ReVisiT is a decoding-time algorithm for LVLMs that improves visual grounding by using internal vision tokens as reference informers. It projects vision tokens into the text token space, selects the most relevant one through constrained divergence minimization, and guides generation to better align with visual semantics without modifying the underlying model.

Implementation

Due to differences in the supported Transformers versions for each LVLM family, we provide separate implementations for LLaVA-1.5 and Qwen2.5-VL.
LLaVA-1.5 is based on Transformers v4.31.0, while Qwen2.5-VL is based on v4.50.0, reflecting compatibility requirements with their respective tokenizer and model wrappers.
Although the core ReVisiT decoding logic remains the same, these version-specific dependencies necessitate isolated environments and tailored integration scripts per model.

If you wish to integrate ReVisiT into your own environment, simply add the corresponding decoding function to the Huggingface Transformers source code. Specifically, copy the code from:

and paste it into your local transformers/generation/utils.py.

The following section provides CHAIR evaluation scripts and instructions for each model.

Prerequisite

mkdir -p ./prerequisites
mkdir -p ./prerequisites/coco
wget http://images.cocodataset.org/zips/val2014.zip -P ./prerequisites/coco && unzip ./prerequisites/coco/val2014.zip -d ./prerequisites/coco &
wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip -P ./prerequisites/coco && unzip ./prerequisites/coco/annotations_trainval2014.zip -d ./data/coco &

LLaVA1.5

Environment setup

conda env create -f LLaVA1.5/ReVisiT_LLaVA.yaml
conda activate revisit_llava
pip install numpy==1.26.4
cd LLaVA1.5/data/transformers-4.31.0
pip install -e .
cd ../../..

python prerequisites/download_from_huggingface.py --model llava

CHAIR Evaluation

cd LLaVA1.5
bash eval_chair_llava.sh

Qwen2.5-VL

Environment setup

conda env create -f Qwen2.5-VL/ReVisiT_Qwen.yaml
conda activate revisit_qwen
cd Qwen2.5-VL/data/transformers-4.50.0
pip install -e .
cd ../../..

python prerequisites/download_from_huggingface.py --model qwen

CHAIR Evaluation

cd Qwen2.5-VL
bash eval_chair_qwenvl.sh

Acknowledgements

This repository builds upon the open-source implementations of LLaVA, VCD, and RITUAL.
We sincerely thank the authors for making their code publicly available.

Citation

If you find our work helpful, please consider citing:

@article{cho2025revisit,
  title     = {Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs},
  author    = {Beomsik Cho and Jaehyung Kim},
  journal   = {arXiv preprint arXiv:2506.09522},
  year      = {2025},
  url       = {https://arxiv.org/abs/2506.09522}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0