Computer Science > Machine Learning

arXiv:2405.21018 (cs)

[Submitted on 31 May 2024 (v1), last revised 5 Jun 2024 (this version, v2)]

Title:Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Authors:Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, Min Lin

Abstract:Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of "Sure" largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialisation. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed I-GCG. In our experiments, we evaluate on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate. The code is released at this https URL.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
Cite as:	arXiv:2405.21018 [cs.LG]
	(or arXiv:2405.21018v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2405.21018

Submission history

From: Xiaojun Jia [view email]
[v1] Fri, 31 May 2024 17:07:15 UTC (1,362 KB)
[v2] Wed, 5 Jun 2024 16:35:49 UTC (1,363 KB)

Computer Science > Machine Learning

Title:Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators