🎣 BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target

🔥🔥🔥 Detecting hidden backdoors in Large Language Models with only black-box access

BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target [Paper]
Guangyu Shen*, Siyuan Cheng*, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Hanxi Guo, Lu Yan, Xiaolong Jin, Shengwei An, Shiqing Ma, Xiangyu Zhang (*Equal Contribution)
Proceedings of the 46th IEEE Symposium on Security and Privacy (S&P 2025)

News

[Jun 2, 2025] We implement a new post-processing module to improve the detection stability. Find more details in Update.
[May 29, 2025] The model zoo is now available on Huggingface.
🎉🎉🎉 [Nov 10, 2024] BAIT won the third place (with the highest recall score) and the most efficient method in the The Competition for LLM and Agent Safety 2024 (CLAS 2024) - Backdoor Trigger Recovery for Models Track ! The competition version of BAIT will be released soon.

Preparation

Clone this repository

git clone https://github.com/noahshen/BAIT.git
cd BAIT

Install Package

conda create -n bait python=3.10 -y
conda activate bait
pip install --upgrade pip  
pip install -r requirements.txt

Install BAIT CLI Tool

pip install -e .

Add OpenAI API Key

export OPENAI_API_KEY=<your_openai_api_key>

Login to Huggingface

huggingface-cli login

Download Model Zoo

huggingface-cli download NoahShen/BAIT-ModelZoo --local-dir ./model_zoo

Model Zoo

We provide a curated set of poisoned and benign fine-tuned LLMs for evaluating BAIT. These models can be downloaded from Huggingface. The model zoo follows this file structure:

BAIT-ModelZoo/
├── base_models/
│   ├── BASE/MODEL/1/FOLDER  
│   ├── BASE/MODEL/2/FOLDER
│   └── ...
├── models/
│   ├── id-0001/
│   │   ├── model/
│   │   │   └── ...
│   │   └── config.json
│   ├── id-0002/
│   └── ...
└── METADATA.csv

base_models stores pretrained LLMs downloaded from Huggingface. We evaluate BAIT on the following 3 LLM architectures:

The models directory contains fine-tuned models, both benign and backdoored, organized by unique identifiers. Each model folder includes:

The model files
A config.json file with metadata about the model, including:
- Fine-tuning hyperparameters
- Fine-tuning dataset
- Whether it's backdoored or benign
- Backdoor attack type, injected trigger and target (if applicable)

The METADATA.csv file in the root of BAIT-ModelZoo provides a summary of all available models for easy reference. Current model zoo contains 91 models. We will keep updating the model zoo with new models.

LLM Backdoor Scanning

To perform BAIT on the entire model zoo, run the CLI tool:

bait-scan --model-zoo-dir /path/to/model/zoo --data /path/to/data --cache-dir /path/to/model/zoo/base_models/ --output-dir /path/to/results --run-name your-experiment-name

To specify which GPUs to use, set the CUDA_VISIBLE_DEVICES environment variable:

CUDA_VISIBLE_DEVICES=0,1,2,3 bait-scan --model-zoo-dir /path/to/model/zoo --data /path/to/data --cache-dir /path/to/model/zoo/base_models/ --output-dir /path/to/results --run-name your-experiment-name

This script will iteratively scan each individual model stored in the model zoo directory. When multiple GPUs are specified, BAIT will launch parallel scans for multiple models simultaneously - if you specify n GPUs, it will scan n models in parallel. The intermediate logs and final results will be stored in the specified output directory.

Evaluation

To evaluate the BAIT scanning results:

Run the evaluation CLI tool:

bait-eval --run-dir your-experiment-name

This script will run evaluation and generate a comprehensive report on key metrics such as detection rate, false positive rate, and accuracy for the backdoor detection.

We provide the reproduction result of BAIT on the model zoo in Reproduction Result. The experiment is conducted on 8 A6000 GPUs with 48G memory.

Citation

If you find this work useful in your research, please consider citing:

@INPROCEEDINGS {,
author = { Shen, Guangyu and Cheng, Siyuan and Zhang, Zhuo and Tao, Guanhong and Zhang, Kaiyuan and Guo, Hanxi and Yan, Lu and Jin, Xiaolong and An, Shengwei and Ma, Shiqing and Zhang, Xiangyu },
booktitle = { 2025 IEEE Symposium on Security and Privacy (SP) },
title = {{ BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target }},
year = {2025},
volume = {},
ISSN = {2375-1207},
pages = {1676-1694},
abstract = { Recent literature has shown that LLMs are vulnerable to backdoor attacks, where malicious attackers inject a secret token sequence (i.e., trigger) into training prompts and enforce their responses to include a specific target sequence. Unlike discriminative NLP models, which have a finite output space (e.g., those in sentiment analysis), LLMs are generative models, and their output space grows exponentially with the length of response, thereby posing significant challenges to existing backdoor detection techniques, such as trigger inversion. In this paper, we conduct a theoretical analysis of the LLM backdoor learning process under specific assumptions, revealing that the autoregressive training paradigm in causal language models inherently induces strong causal relationships among tokens in backdoor targets. We hence develop a novel LLM backdoor scanning technique, BAIT (Large Language Model Backdoor ScAnning by Inverting Attack Target). Instead of inverting back- door triggers like in existing scanning techniques for non-LLMs, BAIT determines if a model is backdoored by inverting back- door targets, leveraging the exceptionally strong causal relations among target tokens. BAIT substantially reduces the search space and effectively identifies backdoors without requiring any prior knowledge about triggers or targets. The search-based nature also enables BAIT to scan LLMs with only the black-box access. Evaluations on 153 LLMs with 8 architectures across 6 distinct attack types demonstrate that our method outperforms 5 baselines. Its superior performance allows us to rank at the top of the leaderboard in the LLM round of the TrojAI competition (a multi-year, multi-round backdoor scanning competition). },
keywords = {ai security;backdoor scanning;large language model},
doi = {10.1109/SP61157.2025.00103},
url = {https://doi.ieeecomputersociety.org/10.1109/SP61157.2025.00103},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month =May}

Contact

For any questions or feedback, please contact Guangyu Shen at shen447@purdue.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
doc		doc
reproduction_result		reproduction_result
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎣 BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target

News

Contents

Preparation

Model Zoo

LLM Backdoor Scanning

Evaluation

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SolidShen/BAIT

Folders and files

Latest commit

History

Repository files navigation

🎣 BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target

News

Contents

Preparation

Model Zoo

LLM Backdoor Scanning

Evaluation

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages