More Web Proxy on the site http://driver.im/

research-article

CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning

Authors:

Li LiAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 652 - 660

https://doi.org/10.1145/3485447.3512225

Published: 25 April 2022 Publication History

Abstract

Github Copilot, trained on billions of lines of public code, has recently become the buzzword in the computer science research and practice community. Although it is designed to help developers implement safe and effective code with powerful intelligence, practitioners and researchers raise concerns about its ethical and security problems, e.g., should the copyleft licensed code be freely leveraged or insecure code be considered for training in the first place? These problems pose a significant impact on Copilot and other similar products that aim to learn knowledge from large-scale open-source code through deep learning models, which are inevitably on the rise with the fast development of artificial intelligence. To mitigate such impacts, we argue that there is a need to invent effective mechanisms for protecting open-source code from being exploited by deep learning models. Here, we design and implement a prototype, CoProtector, which utilizes data poisoning techniques to arm source code repositories for defending against such exploits. Our large-scale experiments empirically show that CoProtector is effective in achieving its purpose, significantly reducing the performance of Copilot-like deep learning models while being able to stably reveal the secretly embedded watermark backdoors.

References

[1]

2021. Armin “vax ffs” Ronacher on Twitter. Retrieved Jul 25, 2021 from https://twitter.com/mitsuhiko/status/1410886329924194309

[2]

2021. Can GitHub Copilot introduce insecure code in its suggestions?Retrieved Oct 13, 2021 from https://copilot.github.com/#faq-can-github-copilot-introduce-insecure-code-in-its-suggestions

[3]

2021. Code faster with AI completions | TabNine. Retrieved Aug 4, 2021 from https://www.tabnine.com/

[4]

2021. eevee on Twitter. Retrieved Sep 6, 2021 from https://twitter.com/eevee/status/1410037309848752128

[5]

2021. GitHub Copilot is not infringing your copyright. Retrieved Jul 25, 2021 from https://juliareda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/

[6]

2021. GitHub Copilot · Your AI pair programmer. Retrieved Jul 25, 2021 from https://copilot.github.com/

[7]

2021. GitHub Support just straight up confirmed in an email that yes. Retrieved Sep 6, 2021 from https://www.reddit.com/r/programming/comments/og8gxv/github_support_just_straight_up_confirmed_in_an/

[8]

2021. The GNU General Public License v3.0 - GNU Project - Free Software Foundation. Retrieved Jul 25, 2021 from https://www.gnu.org/licenses/gpl-3.0.en.html

[9]

2021. Is GitHub’s Copilot potentially infringing copyright?Retrieved Jul 25, 2021 from https://www.technollama.co.uk/is-githubs-copilot-potentially-infringing-copyright

[10]

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A Transformer-based Approach for Source Code Summarization. ArXiv abs/2005.00653(2020).

[11]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. ArXiv abs/2005.14165(2020).

[12]

Alvin Chan, Yi Tay, Y. Ong, and A. Zhang. 2020. Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder. In FINDINGS.

[13]

Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Ben Edwards, Taesung Lee, Ian Molloy, and B. Srivastava. 2019. Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering. ArXiv abs/1811.03728(2019).

[14]

Sen Chen, Minhui Xue, Lingling Fan, Lei Ma, Yang Liu, and Lihua Xu. 2019. How Can We Craft Large-Scale Android Malware? An Automated Poisoning Attack. 2019 IEEE 1st International Workshop on Artificial Intelligence for Mobile (AI4Mobile) (2019), 21–24.

[15]

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and D. Song. 2017. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. ArXiv abs/1712.05526(2017).

[16]

Liam Fowl, Ping-Yeh Chiang, Micah Goldblum, Jonas Geiping, Arpit Bansal, Wojciech Czaja, and Tom Goldstein. 2021. Preventing Unauthorized Use of Proprietary Data: Poisoning for Secure Dataset Release. ArXiv abs/2103.02683(2021).

[17]

Giorgio Franceschelli and Mirco Musolesi. 2021. Copyright in Generative Deep Learning. ArXiv abs/2105.09266(2021).

[18]

Yaroslav Golubev, Maria Eliseeva, Nikita Povarov, and T. Bryksin. 2020. A Study of Potential Code Borrowing and License Violations in Java Projects on GitHub. Proceedings of the 17th International Conference on Mining Software Repositories (2020).

Digital Library

[19]

Tianyu Gu, Brendan Dolan-Gavitt, and S. Garg. 2017. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. ArXiv abs/1708.06733(2017).

[20]

Xiaodong Gu, H. Zhang, and S. Kim. 2018. Deep Code Search. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), 933–944.

[21]

Sorami Hisamoto, Matt Post, and Kevin Duh. 2020. Membership Inference Attacks on Sequence-to-Sequence Models: Is My Data In Your Machine Translation System?Transactions of the Association for Computational Linguistics 8 (2020), 49–63.

[22]

H. Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. ArXiv abs/1909.09436(2019).

[23]

Wan Soo Kim and Kyogu Lee. 2020. Digital Watermarking For Protecting Audio Classification Datasets. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2020), 2842–2846.

[24]

Pang Wei Koh, Jacob Steinhardt, and Percy Liang. 2018. Stronger Data Poisoning Attacks Break Data Sanitization Defenses. ArXiv abs/1811.00741(2018).

[25]

Tofunmi Kupoluyi, Moumena Chaqfeh, Matteo Varvello, Waleed Hashmi, Lakshminarayanan Subramanian, and Yasir Zaki. 2021. Muzeel: A Dynamic JavaScript Analyzer for Dead Code Elimination in Today’s Web. ArXiv abs/2106.08948(2021).

[26]

Yiming Li, Ziqi Zhang, Jiawang Bai, Baoyuan Wu, Yong Jiang, and Shutao Xia. 2020. Open-sourced Dataset Protection via Backdoor Watermarking. ArXiv abs/2010.05821(2020).

[27]

Kishore Papineni, S. Roukos, T. Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.

[28]

H. Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and R. Karri. 2021. An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions. ArXiv abs/2108.09293(2021).

[29]

Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. 2021. Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger. In ACL/IJCNLP.

[30]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners.

[31]

Goutham Ramakrishnan and Aws Albarghouthi. 2020. Backdoors in Neural Models of Source Code. ArXiv abs/2006.06841(2020).

[32]

Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herv’e J’egou. 2020. Radioactive data: tracing through training. ArXiv abs/2002.00937(2020).

[33]

R. Schuster, Congzheng Song, Eran Tromer, and Vitaly Shmatikov. 2020. You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion. ArXiv abs/2007.02220(2020).

[34]

A. Shafahi, W. R. Huang, Mahyar Najibi, O. Suciu, Christoph Studer, T. Dumitras, and T. Goldstein. 2018. Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks. In NeurIPS.

[35]

Juncheng Shen, Xiaolei Zhu, and De Ma. 2019. TensorClog: An Imperceptible Poisoning Attack on Deep Neural Network Applications. IEEE Access 7(2019), 41498–41506.

[36]

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership Inference Attacks Against Machine Learning Models. 2017 IEEE Symposium on Security and Privacy (SP) (2017), 3–18.

[37]

Congzheng Song and Vitaly Shmatikov. 2019. Auditing Data Provenance in Text-Generation Models. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(2019).

Digital Library

[38]

Jacob M. Springer, Bryn Reinstadler, and Una-May O’Reilly. 2020. STRATA: Simple, Gradient-Free Attacks for Models of Code.

[39]

Jacob Steinhardt, Pang Wei Koh, and Percy Liang. 2017. Certified Defenses for Data Poisoning Attacks. In NIPS.

[40]

Zeyu Sun, Qihao Zhu, Yingfei Xiong, Yican Sun, Lili Mou, and Lu Zhang. 2020. TreeGen: A Tree-Based Transformer Architecture for Code Generation. In AAAI.

[41]

Brandon Tran, Jerry Li, and A. Madry. 2018. Spectral Signatures in Backdoor Attacks. In NeurIPS.

[42]

Eric Wallace, Tony Zhao, Shi Feng, and Sameer Singh. 2021. Concealed Data Poisoning Attacks on NLP Models. In NAACL.

[43]

B. L. Welch. 1947. The generalisation of student’s problems when several different population variances are involved.Biometrika 34 1-2(1947), 28–35.

[44]

Changming Xu, Jun Wang, Yuqing Tang, Francisco Guzmán, Benjamin I. P. Rubinstein, and Trevor Cohn. 2021. A Targeted Attack on Black-Box Neural Machine Translation with Parallel Data Poisoning. Proceedings of the Web Conference 2021(2021).

Digital Library

[45]

Yanming Yang, Xin Xia, David Lo, and John C. Grundy. 2020. A Survey on Deep Learning for Software Engineering. CoRR abs/2011.14597(2020).

[46]

Huangzhao Zhang, Zhuo Li, Ge Li, L. Ma, Yang Liu, and Zhi Jin. 2020. Generating Adversarial Examples for Holding Robustness of Source Code Processing Models. In AAAI.

[47]

Shihao Zhao, Xingjun Ma, X. Zheng, J. Bailey, Jingjing Chen, and Yugang Jiang. 2020. Clean-Label Backdoor Attacks on Video Recognition Models. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 14431–14440.

Cited By

Chen WZhang ZRen HYe PLi ZHuang S(2025)Temperature-Based Watermarking and Detection for Large Language ModelsAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1525-4_18(319-329)Online publication date: 17-Feb-2025
https://doi.org/10.1007/978-981-96-1525-4_18
Allheeib N(2024)Securing Machine Learning Against Data Poisoning AttacksInternational Journal of Data Warehousing and Mining10.4018/IJDWM.35833520:1(1-21)Online publication date: 16-Nov-2024
https://doi.org/10.4018/IJDWM.358335
Liu APan LLu YLi JHu XZhang XWen LKing IXiong HYu P(2024)A Survey of Text Watermarking in the Era of Large Language ModelsACM Computing Surveys10.1145/369162657:2(1-36)Online publication date: 3-Sep-2024
https://dl.acm.org/doi/10.1145/3691626
Show More Cited By

Index Terms

CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning

Index terms have been assigned to the content through auto-classification.

Recommendations

You see what I want you to see: poisoning vulnerabilities in neural code search
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Searching and reusing code snippets from open-source software repositories based on natural-language queries can greatly improve programming productivity.Recently, deep-learning-based approaches have become increasingly popular for code search. ...
Defending Against Adversarial Denial-of-Service Data Poisoning Attacks
DYNAMICS '20: Proceedings of the 2020 Workshop on DYnamic and Novel Advances in Machine Learning and Intelligent Cyber Security

Data poisoning is one of the most relevant security threats against machine learning and data-driven technologies. Since many applications rely on untrusted training data, an attacker can easily craft malicious samples and inject them into the training ...
Data Poisoning Attacks Against Federated Learning Systems
Computer Security – ESORICS 2020
Abstract
Federated learning (FL) is an emerging paradigm for distributed training of large-scale deep neural networks in which participants’ data remains on their own devices with only model updates being shared with a central server. However, the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Key Research and Development Program

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
669
Total Downloads

Downloads (Last 12 months)154
Downloads (Last 6 weeks)16

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen WZhang ZRen HYe PLi ZHuang S(2025)Temperature-Based Watermarking and Detection for Large Language ModelsAlgorithms and Architectures for Parallel Processing10.1007/978-981-96-1525-4_18(319-329)Online publication date: 17-Feb-2025
https://doi.org/10.1007/978-981-96-1525-4_18
Allheeib N(2024)Securing Machine Learning Against Data Poisoning AttacksInternational Journal of Data Warehousing and Mining10.4018/IJDWM.35833520:1(1-21)Online publication date: 16-Nov-2024
https://doi.org/10.4018/IJDWM.358335
Liu APan LLu YLi JHu XZhang XWen LKing IXiong HYu P(2024)A Survey of Text Watermarking in the Era of Large Language ModelsACM Computing Surveys10.1145/369162657:2(1-36)Online publication date: 3-Sep-2024
https://dl.acm.org/doi/10.1145/3691626
Chen ZJiang LFilkov VRay BZhou M(2024)Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and MemorizationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695021(493-505)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695021
Li L(2024)Smart Software Analysis for Software Quality AssuranceProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674475(222-223)Online publication date: 5-Jul-2024
https://dl.acm.org/doi/10.1145/3674399.3674475
Wu BChen KHe YChen GZhang WYu N(2024)CodeWMBench: An Automated Benchmark for Code Watermarking EvaluationProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674447(120-125)Online publication date: 5-Jul-2024
https://dl.acm.org/doi/10.1145/3674399.3674447
Hussain ARabin MAlipour MAdams BZimmermann TOzkaya ILin DZhang J(2024)Measuring Impacts of Poisoning on Model Parameters and Embeddings for Large Language Models of CodeProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664764(59-64)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3664646.3664764
Sun ZDu XLuo XSong FLo DLi LChristakis MPradel M(2024)FDI: Attack Neural Code Generation Systems through User Feedback ChannelProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680300(528-540)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3680300
Zhao ZChen GLiu TLi TSong FWang JSun J(2024)Attack as Detection: Using Adversarial Attack Methods to Detect Abnormal ExamplesACM Transactions on Software Engineering and Methodology10.1145/363197733:3(1-45)Online publication date: 15-Mar-2024
https://dl.acm.org/doi/10.1145/3631977
Halder SBewong MMahboubi AJiang YIslam MIslam MIp RAhmed MRamachandran GAli Babar MChua TNgo CKa-Wei Lee RKumar RLauw H(2024)Malicious Package Detection using Metadata InformationProceedings of the ACM Web Conference 202410.1145/3589334.3645543(1779-1789)Online publication date: 13-May-2024
https://dl.acm.org/doi/10.1145/3589334.3645543
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten