Stars
Code for NeurIPS 2024 Paper "Fight Back Against Jailbreaking via Prompt Adversarial Tuning"
Persuasive Jailbreaker: we can persuade LLMs to jailbreak them!
"他山之石、可以攻玉":复旦白泽智能发布面向国内开源和国外商用大模型的Demo数据集JADE-DB
A powerful tool for automated LLM fuzzing. It is designed to help developers and security researchers identify and mitigate potential jailbreaks in their LLM APIs.
Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (NeurIPS 2024)
Red Queen Dataset and data generation template
[ICLR 2025] Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates (Oral)
[AAAI'25 (Oral)] Jailbreaking Large Vision-language Models via Typographic Visual Prompts
[ACL 2024] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
[ICML 2024] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
[arXiv:2311.03191] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker"
Does Refusal Training in LLMs Generalize to the Past Tense? [ICLR 2025]
[USENIX Security'24] Official repository of "Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction"
Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM
Official implementation of paper: DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
The Python Risk Identification Tool for generative AI (PyRIT) is an open source framework built to empower security professionals and engineers to proactively identify risks in generative AI systems.
Attribute (or cite) statements generated by LLMs back to in-context information.
Summarize existing representative LLMs text datasets.
An easy-to-use Python framework to generate adversarial jailbreak prompts.
[NDSS'25 Best Technical Poster] A collection of automated evaluators for assessing jailbreak attempts.
The simplest, fastest repository for training/finetuning medium-sized GPTs.
🐢 Open-Source Evaluation & Testing for AI & LLM systems
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
TAP: An automated jailbreaking method for black-box LLMs