This repository provides the official PyTorch implementation of the following paper:
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
Cong Chen*1,2, Mingyu Liu*1, Chenchen Jing3, Yizhou Zhou 2, Fengyun Rao 2, Hao Chen1, Bo Zhang1, Chunhua Shen1,3,
> 1Zhejiang University, China, 2WeChat, Tencent, 3Zhejiang University of Technology
> *Equal Contribution
This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To address the challenge, we identify the current lack of a metric that finely measures the quality of the caption at the concept level. We hereby introduce HalFscore, a novel metric built upon the language graph that is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.
The diagram of computing HalFscore. We construct the language graph to model both the concepts and their relationships for captions. We can then compare the graphs and identify the hallucinations, omissions and matchings between the two sets of concepts respectively. To mitigate the over-reliance on language priors in multimodal models, we introduce a novel training framework that introduces adaptive, context-specific perturbations in the textual inputs during training. This approach simulates the effect of language priors and forces the model to adjust its responses based on visual data rather than textual biases.Our experiments are conducted based on the settings of LLaVA 1.5, reproduced with Xtuner. We focus on the 160k data related to image understanding in LLaVA 1.5 SFT dataset, using ChatGPT-4 to construct corresponding perturbation texts, which were then inserted into the original conversation data for perturbation training.
The script for the GPT prompt used to construct the perturbation data is augmentation/gpt_prompt.py
.
In order to explore the impact of the perturbation degree on model training, we design four different methods for inserting perturbation texts, as in augmentation/combine.py
. For each insertion method, to prevent the model from overfitting to the corresponding pattern, we design multiple system prompts in augmentation/system_prompts.py
and randomly selected one each time.
You can download the images under the dir PerturboLLaVA/HalFScore/images
from This Link
Running the following script to get the tuple and compute the final HalFScore.
bash PerturboLLaVA/HalFScore/results/llava/best_150k_final_v3/eval.sh
We based our training and evaluation on these codebases. Thanks for their impressive works!
- Xtuner: This is our LLaVA 1.5 reproduction codebase and the codebase for subsequent perturbative training experiments.
- VLMEval: Our evaluation codebase for MMBench, SEEDBench, and HallusionBench.
- OPERA: Our evaluation codebase for CHAIR.
- VCD: We modified the original VCD code to support beam search.
- RLAIF-V
@article{chen2025perturbollava,
title={PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training},
author={Chen, Cong and Liu, Mingyu and Jing, Chenchen and Zhou, Yizhou and Rao, Fengyun and Chen, Hao and Zhang, Bo and Shen, Chunhua},
journal={arXiv preprint arXiv:2503.06486},
year={2025}
}