[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3460120.3484576acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Hidden Backdoors in Human-Centric Language Models

Published: 13 November 2021 Publication History

Abstract

Natural language processing (NLP) systems have been proven to be vulnerable to backdoor attacks, whereby hidden features (backdoors) are trained into a language model and may only be activated by specific inputs (called triggers), to trick the model into producing unexpected behaviors. In this paper, we create covert and natural triggers for textual backdoor attacks, hidden backdoors, where triggers can fool both modern language models and human inspection. We deploy our hidden backdoors through two state-of-the-art trigger embedding methods. The first approach via homograph replacement, embeds the trigger into deep neural networks through the visual spoofing of lookalike characters replacement. The second approach uses subtle differences between text generated by language models and real natural text to produce trigger sentences with correct grammar and high fluency. We demonstrate that the proposed hidden backdoors can be effective across three downstream security-critical NLP tasks, representative of modern human-centric NLP systems, including toxic comment detection, neural machine translation (NMT), and question answering (QA). Our two hidden backdoor attacks can achieve an Attack Success Rate (ASR) of at least 97% with an injection rate of only 3% in toxic comment detection, 95.1% ASR in NMT with less than 0.5% injected data, and finally 91.12% ASR against QA updated with only 27 poisoning data samples on a model previously trained with 92,024 samples (0.029%). We are able to demonstrate the adversary's high success rate of attacks, while maintaining functionality for regular users, with triggers inconspicuous by the human administrators.

Supplementary Material

MP4 File (CCS21-fp280.mp4)
Natural language processing (NLP) systems have been proven to be vulnerable to backdoor attacks, whereby hidden features (backdoors) are trained into a language model and may only be activated by specific inputs (called triggers), to trick the model into producing unexpected behaviors. In this paper, we create covert and natural triggers that can fool both modern NLP systems and human inspection. We deploy our hidden backdoors through two state-of-the-art trigger embedding methods. The first approach via homograph replacement, embeds the trigger into NLP systems through the visual spoofing of lookalike character replacement. The second uses subtle differences between text generated by language models and real natural text to produce trigger sentences with correct grammar and high fluency. We demonstrate that the proposed hidden backdoors can be effective across three downstream security-critical NLP tasks, including toxic comment detection, neural machine translation (NMT), and question answering (QA).

References

[1]
Eugene Bagdasaryan and Vitaly Shmatikov. 2021. Blind Backdoors in Deep Learning Models. In Proc. of USENIX Security.
[2]
Santiago Zanella Béguelin, Lukas Wutschitz, and Shruti Tople et al. 2020. Analyzing Information Leakage of Updates to Natural Language Models. In Proc. of CCS.
[3]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research 3, Feb (2003), 1137--1155.
[4]
Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong. 2021. Data Poisoning Attacks to Local Differential Privacy Protocols. In Proc. of USENIX Security.
[5]
Nicholas Carlini, Florian Tramer, and EricWallace et al. 2020. Extracting Training Data from Large Language Models. arXiv preprint: 2012.07805 (2020).
[6]
Xiaoyi Chen, Ahmed Salem, Michael Backes, Shiqing Ma, and Yang Zhang. 2020. BadNL: Backdoor Attacks Against NLP Models. arXiv preprint: 2006.01043 (2020).
[7]
Siyuan Cheng, Yingqi Liu, Shiqing Ma, and Xiangyu Zhang. 2021. Deep Feature Space Trojan Attack of Neural Networks by Controlled Detoxification. In Proc. of AAAI.
[8]
Unicode Consortium. 2020. Confusables. [EB/OL]. https://www.unicode.org/ Public/security/13.0.0/ Accessed April. 20, 2021.
[9]
Jiazhu Dai, Chuanshuai Chen, and Yufeng Li. 2019. A Backdoor Attack Against LSTM-Based Text Classification Systems. IEEE Access 7 (2019), 138872--138878.
[10]
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2020. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. In Proc. of ICLR.
[11]
Ambra Demontis, Marco Melis, Maura Pintor, Matthew Jagielski, Battista Biggio, Alina Oprea, Cristina Nita-Rotaru, and Fabio Roli. 2019. Why Do Adversarial Attacks Transfer? Explaining Transferability of Evasion and Poisoning Attacks. In Proc. of USENIX Security.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of NAACL-HLT.
[13]
Facebook. 2020. Community Standards Enforcement Report. https://transparency. facebook.com/community-standards-enforcement Accessed 2020.
[14]
Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C. Ranasinghe, and Surya Nepal. 2019. STRIP: A Defence against Trojan Attacks on Deep Neural Networks. In Proc. of ACSAC.
[15]
FairSeq Github. 2020. Preparation of WMT 2014 English-to-French Translation Dataset. https://github.com/pytorch/fairseq/blob/master/examples/translation/ prepare-wmt14en2fr.sh Accessed June 24, 2020.
[16]
Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, GangWang, and Xinyu Xing. 2018. LEMNA: Explaining Deep Learning based Security Applications. In Proc. of CCS.
[17]
Wenbo Guo, Lun Wang, Xinyu Xing, Min Du, and Dawn Song. 2020. Tabor: A Highly Accurate Approach to Inspecting and Restoring Trojan Backdoors in AI Systems. In Proc. of IEEE ICDM.
[18]
D. Hicks and D. Gasca. 2020. A healthier Twitter: Progress and more to do. https: //blog.twitter.com/enus/topics/company/2019/health-update.html Accessed 2019.
[19]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural computation 9, 8 (1997), 1735--1780.
[20]
Tobias Holgers, David E Watson, and Steven D Gribble. 2006. Cutting through the Confusion: A Measurement Study of Homograph Attacks. In USENIX Annual Technical Conference, General Track. 261--266.
[21]
Hai Huang, Jiaming Mu, Neil Zhenqiang Gong, Qi Li, Bin Liu, and Mingwei Xu. 2021. Data Poisoning Attacks to Deep Learning Based Recommender Systems. In Proc. of NDSS.
[22]
HuggingFace. 2020. BERT Transformer Model Documentation. https: //huggingface.co/transformers/model_doc/bert.html Accessed June 24, 2020.
[23]
HuggingFace. 2020. HuggingFace Tokenizer Documentation. https://huggingface. co/transformers/main_classes/tokenizer.html Accessed June 24, 2020.
[24]
Matthew Jagielski, Alina Oprea, Battista Biggio, Chang Liu, Cristina Nita-Rotaru, and Bo Li. 2018. Manipulating Machine Learning: Poisoning Attacks and Countermeasures for Regression Learning. In Proc. of IEEE S&P.
[25]
Jinyuan Jia, Xiaoyu Cao, and Neil Zhenqiang Gong. 2021. Intrinsic Certified Robustness of Bagging against Data Poisoning Attacks. In Proc. of AAAI.
[26]
Dan Jurafsky. 2000. Speech & language processing. Pearson Education India.
[27]
Kaggle. 2020. Toxic Comment Classification Challenge. https://www.kaggle. com/c/jigsaw-toxic-comment-classification-challenge/ Accessed June 24, 2020.
[28]
Srijan Kumar, Robert West, and Jure Leskovec. 2016. Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes. In Proc. of WWW.
[29]
Yu-Hsuan Kuo, Zhenhui Li, and Daniel Kifer. [n.d.]. Detecting Outliers in Data with Correlated Measures. In Proc. of CIKM.
[30]
Keita Kurita, Paul Michel, and Graham Neubig. 2020. Weight Poisoning Attacks on Pretrained Models. In Proc. of ACL.
[31]
Thai Le, Noseong Park, and Dongwon Lee. 2020. Detecting Universal Trigger's Adversarial Attack with Honeypot. arXiv preprint: 2011.10492 (2020).
[32]
Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2019. TextBugger: Generating Adversarial Text Against Real-world Applications. In Proc. of NDSS.
[33]
Shaofeng Li, Shiqing Ma, Minhui Xue, and Benjamin Zi Hao Zhao. 2020. Deep Learning Backdoors. arXiv preprint: 2007.08273 (2020).
[34]
Shaofeng Li, Minhui Xue, Benjamin Zi Hao Zhao, Haojin Zhu, and Xinpeng Zhang. 2020. Invisible Backdoor Attacks on Deep Neural Networks via Steganography and Regularization. IEEE Transactions on Dependable and Secure Computing (2020), 1--1.
[35]
Junyu Lin, Lei Xu, Yingqi Liu, and Xiangyu Zhang. 2020. Composite Backdoor Attack for Deep Neural Network by Mixing Existing Benign Features. In Proc. of CCS.
[36]
Yingqi Liu, Shiqing Ma, Yousra Aafer,Wen-Chuan Lee, Juan Zhai,WeihangWang, and Xiangyu Zhang. 2017. Trojaning Attack on Neural Networks. In Proc. of NDSS.
[37]
Christopher D. Manning and Hinrich Schütze. 2001. Foundations of Statistical Natural Language Processing. MIT Press.
[38]
Yuantian Miao, Minhui Xue, Chao Chen, Lei Pan, Jun Zhang, Benjamin Zi Hao Zhao, Dali Kaafar, and Yang Xiang. 2021. The Audio Auditor: User-Level Membership Inference in Internet of Things Voice Services. Proc. Priv. Enhancing Technol. 2021, 1 (2021), 209--228. https://doi.org/10.2478/popets-2021-0012
[39]
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. 2017. Universal Adversarial Perturbations. In Proc. of IEEE CVPR.
[40]
Anh Nguyen and Anh Tran. 2021. WaNet - Imperceptible Warping-based Backdoor Attack. arXiv preprint: 2102.10369 (2021).
[41]
Rajvardhan Oak. 2019. Poster: Adversarial Examples for Hate Speech Classifiers. In Proc. of CCS.
[42]
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proc. of NAACL-HLT 2019: Demonstrations.
[43]
Ren Pang, Zheng Zhang, Xiangshan Gao, Zhaohan Xi, Shouling Ji, Peng Cheng, and TingWang. 2020. TROJANZOO: Everything you ever wanted to know about neural backdoors (but were afraid to ask). arXiv preprint: 2012.09302 (2020).
[44]
Nicolas Papernot, Patrick D. McDaniel, Arunesh Sinha, and Michael P. Wellman. 2018. SoK: Security and Privacy in Machine Learning. In Proc. of IEEE EuroS&P.
[45]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proc. of ACL.
[46]
Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proc. of the Third Conference on Machine Translation: Research Papers.
[47]
Ximing Qiao, Yukun Yang, and Hai Li. 2019. Defending Neural Backdoors via Generative Distribution Modeling. In Proc. of NeurIPS.
[48]
Erwin Quiring, David Klein, Daniel Arp, Martin Johns, and Konrad Rieck. 2020. Adversarial Preprocessing: Understanding and Preventing Image-Scaling Attacks in Machine Learning. In Proc. of USENIX Security.
[49]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI blog 1, 8 (2019), 9.
[50]
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don't Know: Unanswerable Questions for SQuAD. In Proc. of ACL.
[51]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In Proc. of EMNLP.
[52]
Adnan Siraj Rakin, Zhezhi He, and Deliang Fan. 2020. TBT: Targeted Neural Network Attack with Bit Trojan. In Proc. of IEEE/CVF CVPR.
[53]
Elissa M Redmiles, Ziyun Zhu, Sean Kross, Dhruv Kuchhal, Tudor Dumitras, and Michelle L Mazurek. 2018. Asking for a Friend: Evaluating Response Biases in Security User Studies. In Proc. of CCS.
[54]
Ahmed Salem, Michael Backes, and Yang Zhang. 2020. Don't Trigger Me! A Triggerless Backdoor Attack Against Deep Neural Networks. arXiv preprint: 2010.03282 (2020).
[55]
Ahmed Salem, Rui Wen, Michael Backes, Shiqing Ma, and Yang Zhang. 2020. Dynamic Backdoor Attacks Against Machine Learning Models. arXiv preprint: 2003.03675 (2020).
[56]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proc. of ACL.
[57]
Shawn Shan, Emily Wenger, Bolun Wang, Bo Li, Haitao Zheng, and Ben Y. Zhao. 2020. Gotta Catch'Em All: Using Honeypots to Catch Adversarial Attacks on Neural Networks. In Proc. of CCS.
[58]
Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K. Reiter. 2019. A General Framework for Adversarial Examples with Objectives. ACM Trans. Priv. Secur. 22, 3 (2019), 16:1--16:30.
[59]
Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus Püschel, and Martin T. Vechev. 2018. Fast and Effective Robustness Certification. In Proc. of NeurIPS.
[60]
Congzheng Song, Alexander M. Rush, and Vitaly Shmatikov. 2020. Adversarial Semantic Collisions. In Proc. of EMNLP.
[61]
Te Juin Lester Tan and Reza Shokri. 2020. Bypassing Backdoor Detection Algorithms in Deep Learning. In Proc. of IEEE EuroS&P.
[62]
Di Tang, XiaoFeng Wang, Haixu Tang, and Kehuan Zhang. [n.d.]. Demon in the Variant: Statistical Analysis of DNNs for Robust Backdoor Contamination Detection. In Proc. of USENIX Security.
[63]
Raphael Tang, Rodrigo Nogueira, Edwin Zhang, Nikhil Gupta, Phuong Cam, Kyunghyun Cho, and Jimmy Lin. 2020. Rapidly Bootstrapping a Question Answering Dataset for COVID-19. arXiv preprint: 2004.11339 (2020).
[64]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proc. of NeurIPS.
[65]
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proc. of EMNLP-IJCNLP.
[66]
Eric Wallace, Mitchell Stern, and Dawn Song. 2020. Imitation Attacks and Defenses for Black-box Machine Translation Systems. In Proc. of EMNLP.
[67]
Boxin Wang, Shuohang Wang, Yu Cheng, Zhe Gan, Ruoxi Jia, Bo Li, and Jingjing Liu. 2021. Infobert: Improving robustness of language models from an information theoretic perspective. In Proc. of ICLR.
[68]
Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. 2019. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In Proc. IEEE S&P.
[69]
Jialin Wen, Benjamin Zi Hao Zhao, Minhui Xue, Alina Oprea, and Haifeng Qian. 2021. With Great Dispersion Comes Greater Resilience: Efficient Poisoning Attacks and Defenses for Linear Regression Models. IEEE Trans. Inf. Forensics Secur. 16 (2021), 3709--3723. https://doi.org/10.1109/TIFS.2021.3087332
[70]
J. Woodbridge, H. S. Anderson, A. Ahuja, and D. Grant. 2018. Detecting Homoglyph Attacks with a Siamese Neural Network. In Proc. of IEEE Security and Privacy Workshops (SPW).
[71]
Shujiang Wu, Song Li, Yinzhi Cao, and Ningfei Wang. 2019. Rendered Private: Making GLSL Execution Uniform to Prevent WebGL-based Browser Fingerprinting. In Proc. of USENIX Security.
[72]
Zhaohan Xi, Ren Pang, Shouling Ji, and Ting Wang. 2021. Graph Backdoor. In Proc. of USENIX Security.
[73]
Chang Xu, Jun Wang, Yuqing Tang, Francisco Guzman, Benjamin IP Rubinstein, and Trevor Cohn. 2021. Targeted Poisoning Attacks on Black-Box Neural Machine Translation. In Proc. of WWW.
[74]
Xiaojun Xu, QiWang, Huichen Li, Nikita Borisov, Carl A. Gunter, and Bo Li. 2020. Detecting AI Trojans Using Meta Neural Analysis. In Proc. of IEEE S&P.
[75]
Xinyang Zhang, Zheng Zhang, and TingWang. 2021. Trojaning Language Models for Fun and Profit. In Proc. of IEEE EuroS&P.
[76]
Zaixi Zhang, Jinyuan Jia, Binghui Wang, and Neil Zhenqiang Gong. 2020. Backdoor Attacks to Graph Neural Networks. arXiv preprint: 2006.11165 (2020).

Cited By

View all
  • (2025)Workplace security and privacy implications in the GenAI age: A surveyJournal of Information Security and Applications10.1016/j.jisa.2024.10396089(103960)Online publication date: Mar-2025
  • (2024)Data Poisoning Attack on Black-Box Neural Machine Translation to Truncate TranslationEntropy10.3390/e2612108126:12(1081)Online publication date: 11-Dec-2024
  • (2024)A Review of Large Language Models in Healthcare: Taxonomy, Threats, Vulnerabilities, and FrameworkBig Data and Cognitive Computing10.3390/bdcc81101618:11(161)Online publication date: 18-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CCS '21: Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security
November 2021
3558 pages
ISBN:9781450384544
DOI:10.1145/3460120
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. backdoor attacks
  2. homographs
  3. natural language processing
  4. text generation

Qualifiers

  • Research-article

Funding Sources

Conference

CCS '21
Sponsor:
CCS '21: 2021 ACM SIGSAC Conference on Computer and Communications Security
November 15 - 19, 2021
Virtual Event, Republic of Korea

Acceptance Rates

Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)284
  • Downloads (Last 6 weeks)37
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Workplace security and privacy implications in the GenAI age: A surveyJournal of Information Security and Applications10.1016/j.jisa.2024.10396089(103960)Online publication date: Mar-2025
  • (2024)Data Poisoning Attack on Black-Box Neural Machine Translation to Truncate TranslationEntropy10.3390/e2612108126:12(1081)Online publication date: 11-Dec-2024
  • (2024)A Review of Large Language Models in Healthcare: Taxonomy, Threats, Vulnerabilities, and FrameworkBig Data and Cognitive Computing10.3390/bdcc81101618:11(161)Online publication date: 18-Nov-2024
  • (2024)Backdoor Attacks and Defenses Targeting Multi-Domain AI Models: A Comprehensive ReviewACM Computing Surveys10.1145/370472557:4(1-35)Online publication date: 10-Dec-2024
  • (2024)A Survey on Federated Unlearning: Challenges, Methods, and Future DirectionsACM Computing Surveys10.1145/367901457:1(1-38)Online publication date: 19-Jul-2024
  • (2024)CBAs: Character-level Backdoor Attacks against Chinese Pre-trained Language ModelsACM Transactions on Privacy and Security10.1145/367800727:3(1-26)Online publication date: 12-Jul-2024
  • (2024)Poison Attack and Poison Detection on Deep Source Code Processing ModelsACM Transactions on Software Engineering and Methodology10.1145/363000833:3(1-31)Online publication date: 14-Mar-2024
  • (2024)Stealthy Backdoor Attack for Code ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.336166150:4(721-741)Online publication date: Apr-2024
  • (2024)Leverage NLP Models Against Other NLP Models: Two Invisible Feature Space Backdoor AttacksIEEE Transactions on Reliability10.1109/TR.2024.337552673:3(1559-1568)Online publication date: Sep-2024
  • (2024)Poison-Resilient Anomaly Detection: Mitigating Poisoning Attacks in Semi-Supervised Encrypted Traffic Anomaly DetectionIEEE Transactions on Network Science and Engineering10.1109/TNSE.2024.339771911:5(4744-4757)Online publication date: Sep-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media