Retention Score: Quantifying Jailbreak Risks for Vision Language Models

This is the official implementation of the paper "Retention Score: Quantifying Jailbreak Risks for Vision Language Models", accepted at AAAI 2025.

Code Explanation

generation_code/*: Contains utilities and model descriptions for generating adversarial examples.
evaluation_code/*: Contains the evaluation code for assessing the robustness of the generated examples.
minigpt_adversarial_generation.py: Script to create adversarial images for any images in the specified directory.
minigpt_real_our.py: Generates responses for each prompt and image using the specified datasets.
get_metric.py: Utilizes the Perspective API to generate evaluation metrics.
cal_score_acc.py: Computes the retention score and ASR (Adversarial Success Rate).
gemini_evaluation.py and gpt4v_evaluation.py: Scripts for API evaluation.

Detailed Implementation for Models and Dataset

Create two directories named samples and models to store generated samples and robust models.
Install the required packages:
```
pip install -r requirements.txt
```
This will install all necessary dependencies for the project.
To generate samples, follow these steps:
- Use the diffusion generator to generate images and save them into the specified directory.
- Refer to the provided examples for generating adversarial images.
- Use the specified datasets for generating responses and save the results to a JSONL file.

Evaluation Settings

Image Evaluation:
- Use the minigpt_adversarial_generation.py to create adversarial images.
- Use get_metric.py with the Perspective API to generate evaluation metrics.
- Use cal_score_acc.py to compute the retention score and ASR.
Text Evaluation:
- Use the paraphrasing model to get paraphrased prompts for adversarial behavior.
- Generate responses using the specified scripts and evaluate the scores for each response.

You can modify the sample size in each sub-setting to change the number of samples for evaluation.

Reference

@misc{li2024retentionscorequantifyingjailbreak,
      title={Retention Score: Quantifying Jailbreak Risks for Vision Language Models}, 
      author={Zaitang Li and Pin-Yu Chen and Tsung-Yi Ho},
      year={2024},
      eprint={2412.17544},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2412.17544}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
evaluation_code		evaluation_code
generation_code		generation_code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Retention Score: Quantifying Jailbreak Risks for Vision Language Models

Table of Contents

Code Explanation

Detailed Implementation for Models and Dataset

Evaluation Settings

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

IBM/Retention-Score

Folders and files

Latest commit

History

Repository files navigation

Retention Score: Quantifying Jailbreak Risks for Vision Language Models

Table of Contents

Code Explanation

Detailed Implementation for Models and Dataset

Evaluation Settings

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages