Computer Science > Machine Learning

arXiv:2310.16955 (cs)

[Submitted on 25 Oct 2023 (v1), last revised 14 Feb 2024 (this version, v2)]

Title:Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Authors:Aradhana Sinha, Ananth Balashankar, Ahmad Beirami, Thi Avrahami, Jilin Chen, Alex Beutel

Abstract:Real-world natural language processing systems need to be robust to human adversaries. Collecting examples of human adversaries for training is an effective but expensive solution. On the other hand, training on synthetic attacks with small perturbations - such as word-substitution - does not actually improve robustness to human adversaries. In this paper, we propose an adversarial training framework that uses limited human adversarial examples to generate more useful adversarial examples at scale. We demonstrate the advantages of this system on the ANLI and hate speech detection benchmark datasets - both collected via an iterative, adversarial human-and-model-in-the-loop procedure. Compared to training only on observed human attacks, also training on our synthetic adversarial examples improves model robustness to future rounds. In ANLI, we see accuracy gains on the current set of attacks (44.1%$\,\to\,$50.1%) and on two future unseen rounds of human generated attacks (32.5%$\,\to\,$43.4%, and 29.4%$\,\to\,$40.2%). In hate speech detection, we see AUC gains on current attacks (0.76 $\to$ 0.84) and a future round (0.77 $\to$ 0.79). Attacks from methods that do not learn the distribution of existing human adversaries, meanwhile, degrade robustness.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2310.16955 [cs.LG]
	(or arXiv:2310.16955v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2310.16955
Journal reference:	Transactions on Machine Learning Research (2024)

Submission history

From: Ananth Balashankar [view email]
[v1] Wed, 25 Oct 2023 19:51:37 UTC (1,523 KB)
[v2] Wed, 14 Feb 2024 20:01:11 UTC (505 KB)

Computer Science > Machine Learning

Title:Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators