π₯π₯π₯ Detecting hidden backdoors in Large Language Models with only black-box access
BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target [Paper]
Guangyu Shen*,
Siyuan Cheng*,
Zhuo Zhang,
Guanhong Tao,
Kaiyuan Zhang,
Hanxi Guo,
Lu Yan,
Xiaolong Jin,
Shengwei An,
Shiqing Ma,
Xiangyu Zhang (*Equal Contribution)
Proceedings of the 46th IEEE Symposium on Security and Privacy (S&P 2025)
- [Jun 2, 2025] We implement a new post-processing module to improve the detection stability. Find more details in Update.
- [May 29, 2025] The model zoo is now available on Huggingface.
- πππ [Nov 10, 2024] BAIT won the third place (with the highest recall score) and the most efficient method in the The Competition for LLM and Agent Safety 2024 (CLAS 2024) - Backdoor Trigger Recovery for Models Track ! The competition version of BAIT will be released soon.
- Clone this repository
git clone https://github.com/noahshen/BAIT.git
cd BAIT
- Install Package
conda create -n bait python=3.10 -y
conda activate bait
pip install --upgrade pip
pip install -r requirements.txt
- Install BAIT CLI Tool
pip install -e .
- Add OpenAI API Key
export OPENAI_API_KEY=<your_openai_api_key>
- Login to Huggingface
huggingface-cli login
- Download Model Zoo
huggingface-cli download NoahShen/BAIT-ModelZoo --local-dir ./model_zoo
We provide a curated set of poisoned and benign fine-tuned LLMs for evaluating BAIT. These models can be downloaded from Huggingface. The model zoo follows this file structure:
BAIT-ModelZoo/
βββ base_models/
β βββ BASE/MODEL/1/FOLDER
β βββ BASE/MODEL/2/FOLDER
β βββ ...
βββ models/
β βββ id-0001/
β β βββ model/
β β β βββ ...
β β βββ config.json
β βββ id-0002/
β βββ ...
βββ METADATA.csv
base_models
stores pretrained LLMs downloaded from Huggingface. We evaluate BAIT on the following 3 LLM architectures:
The models
directory contains fine-tuned models, both benign and backdoored, organized by unique identifiers. Each model folder includes:
- The model files
- A
config.json
file with metadata about the model, including:- Fine-tuning hyperparameters
- Fine-tuning dataset
- Whether it's backdoored or benign
- Backdoor attack type, injected trigger and target (if applicable)
The METADATA.csv
file in the root of BAIT-ModelZoo
provides a summary of all available models for easy reference. Current model zoo contains 91 models. We will keep updating the model zoo with new models.
To perform BAIT on the entire model zoo, run the CLI tool:
bait-scan --model-zoo-dir /path/to/model/zoo --data /path/to/data --cache-dir /path/to/model/zoo/base_models/ --output-dir /path/to/results --run-name your-experiment-name
To specify which GPUs to use, set the CUDA_VISIBLE_DEVICES
environment variable:
CUDA_VISIBLE_DEVICES=0,1,2,3 bait-scan --model-zoo-dir /path/to/model/zoo --data /path/to/data --cache-dir /path/to/model/zoo/base_models/ --output-dir /path/to/results --run-name your-experiment-name
This script will iteratively scan each individual model stored in the model zoo directory. When multiple GPUs are specified, BAIT will launch parallel scans for multiple models simultaneously - if you specify n GPUs, it will scan n models in parallel. The intermediate logs and final results will be stored in the specified output directory.
To evaluate the BAIT scanning results:
- Run the evaluation CLI tool:
bait-eval --run-dir your-experiment-name
This script will run evaluation and generate a comprehensive report on key metrics such as detection rate, false positive rate, and accuracy for the backdoor detection.
We provide the reproduction result of BAIT on the model zoo in Reproduction Result. The experiment is conducted on 8 A6000 GPUs with 48G memory.
If you find this work useful in your research, please consider citing:
@INPROCEEDINGS {,
author = { Shen, Guangyu and Cheng, Siyuan and Zhang, Zhuo and Tao, Guanhong and Zhang, Kaiyuan and Guo, Hanxi and Yan, Lu and Jin, Xiaolong and An, Shengwei and Ma, Shiqing and Zhang, Xiangyu },
booktitle = { 2025 IEEE Symposium on Security and Privacy (SP) },
title = {{ BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target }},
year = {2025},
volume = {},
ISSN = {2375-1207},
pages = {1676-1694},
abstract = { Recent literature has shown that LLMs are vulnerable to backdoor attacks, where malicious attackers inject a secret token sequence (i.e., trigger) into training prompts and enforce their responses to include a specific target sequence. Unlike discriminative NLP models, which have a finite output space (e.g., those in sentiment analysis), LLMs are generative models, and their output space grows exponentially with the length of response, thereby posing significant challenges to existing backdoor detection techniques, such as trigger inversion. In this paper, we conduct a theoretical analysis of the LLM backdoor learning process under specific assumptions, revealing that the autoregressive training paradigm in causal language models inherently induces strong causal relationships among tokens in backdoor targets. We hence develop a novel LLM backdoor scanning technique, BAIT (Large Language Model Backdoor ScAnning by Inverting Attack Target). Instead of inverting back- door triggers like in existing scanning techniques for non-LLMs, BAIT determines if a model is backdoored by inverting back- door targets, leveraging the exceptionally strong causal relations among target tokens. BAIT substantially reduces the search space and effectively identifies backdoors without requiring any prior knowledge about triggers or targets. The search-based nature also enables BAIT to scan LLMs with only the black-box access. Evaluations on 153 LLMs with 8 architectures across 6 distinct attack types demonstrate that our method outperforms 5 baselines. Its superior performance allows us to rank at the top of the leaderboard in the LLM round of the TrojAI competition (a multi-year, multi-round backdoor scanning competition). },
keywords = {ai security;backdoor scanning;large language model},
doi = {10.1109/SP61157.2025.00103},
url = {https://doi.ieeecomputersociety.org/10.1109/SP61157.2025.00103},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
month =May}
For any questions or feedback, please contact Guangyu Shen at shen447@purdue.edu.