More Web Proxy on the site http://driver.im/

research-article

The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers

Authors:

Hunter Scott Heidenreich,

Jake Ryland WilliamsAuthors Info & Claims

AIES '21: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society

Pages 566 - 573

https://doi.org/10.1145/3461702.3462578

Published: 30 July 2021 Publication History

Abstract

This work considers universal adversarial triggers, a method of adversarially disrupting natural language models, and questions if it is possible to use such triggers to affect both the topic and stance of conditional text generation models. In considering four "controversial" topics, this work demonstrates success at identifying triggers that cause the GPT-2 model to produce text about targeted topics as well as influence the stance the text takes towards the topic. We show that, while the more fringe topics are more challenging to identify triggers for, they do appear to more effectively discriminate aspects like stance. We view this both as an indication of the dangerous potential for controllability and, perhaps, a reflection of the nature of the disconnect between conflicting views on these topics, something that future work could use to question the nature of filter bubbles and if they are reflected within models trained on internet content. In demonstrating the feasibility and ease of such an attack, this work seeks to raise the awareness that neural language models are susceptible to this influence--even if the model is already deployed and adversaries lack internal model access--and advocates the immediate safeguarding against this type of adversarial attack in order to prevent potential harm to human users.

References

[1]

Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and Natural Noise Both Break Neural Machine Translation. In International Conference on Learning Representations.

[2]

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT '21). Association for Computing Machinery, New York, NY, USA, 610--623. https://doi.org/10.1145/3442188.3445922

Digital Library

[3]

Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5185--5198. https://doi.org/10.18653/v1/2020.acl-main.463

[4]

Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in Neural Information Processing Systems (2016).

[5]

Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science, Vol. 356, 6334 (2017), 183--186.

[6]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2020. Extracting Training Data from Large Language Models. arXiv preprint arXiv:2012.07805 (2020).

[7]

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. HotFlip: White-Box Adversarial Examples for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 31--36.

[8]

Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of Neural Models Make Interpretations Difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3719--3728.

[9]

Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, Vol. 115, 16 (2018), E3635--E3644.

[10]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2020. Datasheets for Datasets. In Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning.

[11]

Hila Gonen and Yoav Goldberg. 2019. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. In Proceedings of NAACL-HLT. 609--614.

[12]

Stevan Harnad. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena, Vol. 42, 1 (1990), 335 -- 346. https://doi.org/10.1016/0167--2789(90)90087--6

Digital Library

[13]

Beth L. Hoffman, Elizabeth M. Felter, Kar-Hai Chu, Ariel Shensa, Chad Hermann, Todd Wolynn, Daria Williams, and Brian A. Primack. 2019. It's not all about autism: The emerging landscape of anti-vaccination sentiment on Facebook. Vaccine, Vol. 37, 16 (2019), 2216 -- 2223. https://doi.org/10.1016/j.vaccine.2019.03.003

[14]

Sophie Jentzsch, Patrick Schramowski, Constantin Rothkopf, and Kristian Kersting. 2019. Semantics Derived Automatically from Language Corpora Contain Human-like Moral Choices. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (Honolulu, HI, USA) (AIES '19). Association for Computing Machinery, New York, NY, USA, 37--44. https://doi.org/10.1145/3306618.3314267

Digital Library

[15]

Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2021--2031.

[16]

Keita Kurita, Nidhi Vyas, Ayush Pareek, Alan W Black, and Yulia Tsvetkov. 2019. Measuring Bias in Contextualized Word Representations. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing. 166--172.

[17]

Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220 (2016).

[18]

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency (Atlanta, GA, USA) (FAT* '19). Association for Computing Machinery, New York, NY, USA, 220--229. https://doi.org/10.1145/3287560.3287596

Digital Library

[19]

Eli Pariser. 2011. The filter bubble: How the new personalized web is changing what we read and how we think .Penguin.

Digital Library

[20]

PublicHealth. 2021. Vaccine Myths Debunked. https://www.publichealth.org/public-awareness/understanding-vaccines/vaccine-myths-debunked/

[21]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.

[22]

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equivalent adversarial rules for debugging nlp models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 856--865.

[23]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715--1725. https://doi.org/10.18653/v1/P16--1162

[24]

Nathaniel Swinger, Maria De-Arteaga, Neil Thomas Heffernan IV, Mark DM Leiserson, and Adam Tauman Kalai. 2019. What are the biases in my word embedding?. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 305--311.

Digital Library

[25]

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2153--2162.

[26]

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Ryan Cotterell, Vicente Ordonez, and Kai-Wei Chang. 2019. Gender Bias in Contextualized Word Embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 629--634.

[27]

Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018. Learning Gender-Neutral Word Embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4847--4853.

Cited By

Zhou WZhu XHan QLi LChen XWen SXiang Y(2025)The Security of Using Large Language Models: A Survey with Emphasis on ChatGPTIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2024.12498312:1(1-26)Online publication date: Jan-2025
https://doi.org/10.1109/JAS.2024.124983
Hamid RBrohi S(2024)A Review of Large Language Models in Healthcare: Taxonomy, Threats, Vulnerabilities, and FrameworkBig Data and Cognitive Computing10.3390/bdcc81101618:11(161)Online publication date: 18-Nov-2024
https://doi.org/10.3390/bdcc8110161
Chang TBergen B(2024)Language Model Behavior: A Comprehensive SurveyComputational Linguistics10.1162/coli_a_0049250:1(293-350)Online publication date: 1-Mar-2024
https://doi.org/10.1162/coli_a_00492
Show More Cited By

Index Terms

The Earth Is Flat and the Sun Is Not a Star: The Susceptibility of GPT-2 to Universal Adversarial Triggers
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Security and privacy
  1. Human and societal aspects of security and privacy
    1. Social aspects of security and privacy

Recommendations

Adversarial Attacks on Large Language Models
Knowledge Science, Engineering and Management
Abstract
Large Language Models (LLMs) have rapidly advanced and garnered increasing attention due to their remarkable capabilities across various applications. However, adversarial attacks pose a significant threat to LLMs, as prior research has ...
Adversarial Attacks to Distributed Voltage Control in Power Distribution Networks with DERs
e-Energy '18: Proceedings of the Ninth International Conference on Future Energy Systems

It has been recently proposed that the reactive power injection of distributed energy resources (DERs) can be used to regulate the voltage across the power distribution network, and simple distributed control laws have been recently developed in the ...
Exploring Adversarial Attacks on Learning-based Localization
WiseML'23: Proceedings of the 2023 ACM Workshop on Wireless Security and Machine Learning

We investigate the robustness of a convolutional neural network (CNN) RF transmitter localization model in the face of adversarial actors which may poison or spoof sensor data to disrupt or defeat the algorithm. We train the CNN to estimate transmitter ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

AIES '21: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society

July 2021

1077 pages

ISBN:9781450384735

DOI:10.1145/3461702

Program Chairs:
Marion Fourcade
University of California Berkeley, USA
,
Benjamin Kuipers
University of Michigan, USA
,
Seth Lazar
Australian National University, Australia
,
Deirdre Mulligan
University of California Berkeley, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 July 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

AIES '21

Sponsor:

SIGAI

AIES '21: AAAI/ACM Conference on AI, Ethics, and Society

May 19 - 21, 2021

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 61 of 162 submissions, 38%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
302
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou WZhu XHan QLi LChen XWen SXiang Y(2025)The Security of Using Large Language Models: A Survey with Emphasis on ChatGPTIEEE/CAA Journal of Automatica Sinica10.1109/JAS.2024.12498312:1(1-26)Online publication date: Jan-2025
https://doi.org/10.1109/JAS.2024.124983
Hamid RBrohi S(2024)A Review of Large Language Models in Healthcare: Taxonomy, Threats, Vulnerabilities, and FrameworkBig Data and Cognitive Computing10.3390/bdcc81101618:11(161)Online publication date: 18-Nov-2024
https://doi.org/10.3390/bdcc8110161
Chang TBergen B(2024)Language Model Behavior: A Comprehensive SurveyComputational Linguistics10.1162/coli_a_0049250:1(293-350)Online publication date: 1-Mar-2024
https://doi.org/10.1162/coli_a_00492
Chen CMerullo JEickhoff CHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval ModelsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657841(1401-1410)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657841
McIntosh TSusnjak TLiu TWatters PHalgamuge M(2024)The Inadequacy of Reinforcement Learning From Human Feedback—Radicalizing Large Language Models via Semantic VulnerabilitiesIEEE Transactions on Cognitive and Developmental Systems10.1109/TCDS.2024.337744516:4(1561-1574)Online publication date: Aug-2024
https://doi.org/10.1109/TCDS.2024.3377445
McIntosh TLiu TSusnjak TWatters PNg AHalgamuge M(2024)A Culturally Sensitive Test to Evaluate Nuanced GPT HallucinationIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33328375:6(2739-2751)Online publication date: Jun-2024
https://doi.org/10.1109/TAI.2023.3332837
Derner EBatistič KZahálka JBabuška R(2024)A Security Risk Taxonomy for Prompt-Based Interaction With Large Language ModelsIEEE Access10.1109/ACCESS.2024.345038812(126176-126187)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3450388
Ullah IHassan NGill SSuleiman BAhanger TShah ZQadir JKanhere S(2024)Privacy preserving large language models: ChatGPT case study based vision and frameworkIET Blockchain10.1049/blc2.12091Online publication date: 17-Nov-2024
https://doi.org/10.1049/blc2.12091

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten