[Project Page] | [Slides] | [arXiv] | [Data Repository]
In this research, we evaluate the adversarial robustness of recent large vision-language (generative) models (VLMs), under the most realistic and challenging setting with threat model of black-box access and targeted goal.
Our proposed method aims for the targeted response generation over large VLMs such as MiniGPT-4, LLaVA, Unidiffuser, BLIP/2, Img2Prompt, etc.
In other words, we mislead and let the VLMs say what you want, regardless of the content of the input image query.
- Platform: Linux
- Hardware: A100 PCIe 40G
- lmdb, tqdm
- wandb, torchvision, etc.
In our work, we used DALL-E, Midjourney and Stable Diffusion for the target image generation and demonstration. For the large-scale experiments, we apply Stable Diffusion for target image generation. To install Stable Diffusion, we init our conda environment following Latent Diffusion Models. A suitable base conda environment named ldm
can be created and activated with:
conda env create -f environment.yaml
conda activate ldm
Note that for different victim models, we will follow their official implementations and conda environments.
As discussed in our paper, to achieve a flexible targeted attack, we leverage a pretrained text-to-image model to generate an targetd image given a single caption as the targeted text. Consequently, in this way you can specify the targeted caption for attack by yourself!
We use Stable Diffusion, DALL-E or Midjourney as the text-to-image generators in our experiments. Here, we use Stable Diffusion for demonstration (thanks for open-sourcing!).
git clone https://github.com/CompVis/stable-diffusion.git
cd stable-diffusion
then, prepare the full targeted captions from MS-COCO, or download our processed and cleaned version:
https://drive.google.com/file/d/19tT036LBvqYonzI7PfU9qVi3jVGApKrg/view?usp=sharing