This is the official implementation of the paper "Retention Score: Quantifying Jailbreak Risks for Vision Language Models", accepted at AAAI 2025.
generation_code/*
: Contains utilities and model descriptions for generating adversarial examples.evaluation_code/*
: Contains the evaluation code for assessing the robustness of the generated examples.minigpt_adversarial_generation.py
: Script to create adversarial images for any images in the specified directory.minigpt_real_our.py
: Generates responses for each prompt and image using the specified datasets.get_metric.py
: Utilizes the Perspective API to generate evaluation metrics.cal_score_acc.py
: Computes the retention score and ASR (Adversarial Success Rate).gemini_evaluation.py
andgpt4v_evaluation.py
: Scripts for API evaluation.
- Create two directories named
samples
andmodels
to store generated samples and robust models. - Install the required packages:
This will install all necessary dependencies for the project.
pip install -r requirements.txt
- To generate samples, follow these steps:
- Use the diffusion generator to generate images and save them into the specified directory.
- Refer to the provided examples for generating adversarial images.
- Use the specified datasets for generating responses and save the results to a JSONL file.
-
Image Evaluation:
- Use the
minigpt_adversarial_generation.py
to create adversarial images. - Use
get_metric.py
with the Perspective API to generate evaluation metrics. - Use
cal_score_acc.py
to compute the retention score and ASR.
- Use the
-
Text Evaluation:
- Use the paraphrasing model to get paraphrased prompts for adversarial behavior.
- Generate responses using the specified scripts and evaluate the scores for each response.
You can modify the sample size in each sub-setting to change the number of samples for evaluation.
@misc{li2024retentionscorequantifyingjailbreak,
title={Retention Score: Quantifying Jailbreak Risks for Vision Language Models},
author={Zaitang Li and Pin-Yu Chen and Tsung-Yi Ho},
year={2024},
eprint={2412.17544},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2412.17544},
}