[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3485447.3512225acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning

Published: 25 April 2022 Publication History

Abstract

Github Copilot, trained on billions of lines of public code, has recently become the buzzword in the computer science research and practice community. Although it is designed to help developers implement safe and effective code with powerful intelligence, practitioners and researchers raise concerns about its ethical and security problems, e.g., should the copyleft licensed code be freely leveraged or insecure code be considered for training in the first place? These problems pose a significant impact on Copilot and other similar products that aim to learn knowledge from large-scale open-source code through deep learning models, which are inevitably on the rise with the fast development of artificial intelligence. To mitigate such impacts, we argue that there is a need to invent effective mechanisms for protecting open-source code from being exploited by deep learning models. Here, we design and implement a prototype, CoProtector, which utilizes data poisoning techniques to arm source code repositories for defending against such exploits. Our large-scale experiments empirically show that CoProtector is effective in achieving its purpose, significantly reducing the performance of Copilot-like deep learning models while being able to stably reveal the secretly embedded watermark backdoors.

References

[1]
2021. Armin “vax ffs” Ronacher on Twitter. Retrieved Jul 25, 2021 from https://twitter.com/mitsuhiko/status/1410886329924194309
[2]
2021. Can GitHub Copilot introduce insecure code in its suggestions?Retrieved Oct 13, 2021 from https://copilot.github.com/#faq-can-github-copilot-introduce-insecure-code-in-its-suggestions
[3]
2021. Code faster with AI completions | TabNine. Retrieved Aug 4, 2021 from https://www.tabnine.com/
[4]
2021. eevee on Twitter. Retrieved Sep 6, 2021 from https://twitter.com/eevee/status/1410037309848752128
[5]
2021. GitHub Copilot is not infringing your copyright. Retrieved Jul 25, 2021 from https://juliareda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/
[6]
2021. GitHub Copilot · Your AI pair programmer. Retrieved Jul 25, 2021 from https://copilot.github.com/
[7]
2021. GitHub Support just straight up confirmed in an email that yes. Retrieved Sep 6, 2021 from https://www.reddit.com/r/programming/comments/og8gxv/github_support_just_straight_up_confirmed_in_an/
[8]
2021. The GNU General Public License v3.0 - GNU Project - Free Software Foundation. Retrieved Jul 25, 2021 from https://www.gnu.org/licenses/gpl-3.0.en.html
[9]
2021. Is GitHub’s Copilot potentially infringing copyright?Retrieved Jul 25, 2021 from https://www.technollama.co.uk/is-githubs-copilot-potentially-infringing-copyright
[10]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A Transformer-based Approach for Source Code Summarization. ArXiv abs/2005.00653(2020).
[11]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. ArXiv abs/2005.14165(2020).
[12]
Alvin Chan, Yi Tay, Y. Ong, and A. Zhang. 2020. Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder. In FINDINGS.
[13]
Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Ben Edwards, Taesung Lee, Ian Molloy, and B. Srivastava. 2019. Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering. ArXiv abs/1811.03728(2019).
[14]
Sen Chen, Minhui Xue, Lingling Fan, Lei Ma, Yang Liu, and Lihua Xu. 2019. How Can We Craft Large-Scale Android Malware? An Automated Poisoning Attack. 2019 IEEE 1st International Workshop on Artificial Intelligence for Mobile (AI4Mobile) (2019), 21–24.
[15]
Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and D. Song. 2017. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. ArXiv abs/1712.05526(2017).
[16]
Liam Fowl, Ping-Yeh Chiang, Micah Goldblum, Jonas Geiping, Arpit Bansal, Wojciech Czaja, and Tom Goldstein. 2021. Preventing Unauthorized Use of Proprietary Data: Poisoning for Secure Dataset Release. ArXiv abs/2103.02683(2021).
[17]
Giorgio Franceschelli and Mirco Musolesi. 2021. Copyright in Generative Deep Learning. ArXiv abs/2105.09266(2021).
[18]
Yaroslav Golubev, Maria Eliseeva, Nikita Povarov, and T. Bryksin. 2020. A Study of Potential Code Borrowing and License Violations in Java Projects on GitHub. Proceedings of the 17th International Conference on Mining Software Repositories (2020).
[19]
Tianyu Gu, Brendan Dolan-Gavitt, and S. Garg. 2017. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. ArXiv abs/1708.06733(2017).
[20]
Xiaodong Gu, H. Zhang, and S. Kim. 2018. Deep Code Search. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), 933–944.
[21]
Sorami Hisamoto, Matt Post, and Kevin Duh. 2020. Membership Inference Attacks on Sequence-to-Sequence Models: Is My Data In Your Machine Translation System?Transactions of the Association for Computational Linguistics 8 (2020), 49–63.
[22]
H. Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. ArXiv abs/1909.09436(2019).
[23]
Wan Soo Kim and Kyogu Lee. 2020. Digital Watermarking For Protecting Audio Classification Datasets. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2020), 2842–2846.
[24]
Pang Wei Koh, Jacob Steinhardt, and Percy Liang. 2018. Stronger Data Poisoning Attacks Break Data Sanitization Defenses. ArXiv abs/1811.00741(2018).
[25]
Tofunmi Kupoluyi, Moumena Chaqfeh, Matteo Varvello, Waleed Hashmi, Lakshminarayanan Subramanian, and Yasir Zaki. 2021. Muzeel: A Dynamic JavaScript Analyzer for Dead Code Elimination in Today’s Web. ArXiv abs/2106.08948(2021).
[26]
Yiming Li, Ziqi Zhang, Jiawang Bai, Baoyuan Wu, Yong Jiang, and Shutao Xia. 2020. Open-sourced Dataset Protection via Backdoor Watermarking. ArXiv abs/2010.05821(2020).
[27]
Kishore Papineni, S. Roukos, T. Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.
[28]
H. Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and R. Karri. 2021. An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions. ArXiv abs/2108.09293(2021).
[29]
Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. 2021. Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger. In ACL/IJCNLP.
[30]
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners.
[31]
Goutham Ramakrishnan and Aws Albarghouthi. 2020. Backdoors in Neural Models of Source Code. ArXiv abs/2006.06841(2020).
[32]
Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herv’e J’egou. 2020. Radioactive data: tracing through training. ArXiv abs/2002.00937(2020).
[33]
R. Schuster, Congzheng Song, Eran Tromer, and Vitaly Shmatikov. 2020. You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion. ArXiv abs/2007.02220(2020).
[34]
A. Shafahi, W. R. Huang, Mahyar Najibi, O. Suciu, Christoph Studer, T. Dumitras, and T. Goldstein. 2018. Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks. In NeurIPS.
[35]
Juncheng Shen, Xiaolei Zhu, and De Ma. 2019. TensorClog: An Imperceptible Poisoning Attack on Deep Neural Network Applications. IEEE Access 7(2019), 41498–41506.
[36]
Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership Inference Attacks Against Machine Learning Models. 2017 IEEE Symposium on Security and Privacy (SP) (2017), 3–18.
[37]
Congzheng Song and Vitaly Shmatikov. 2019. Auditing Data Provenance in Text-Generation Models. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(2019).
[38]
Jacob M. Springer, Bryn Reinstadler, and Una-May O’Reilly. 2020. STRATA: Simple, Gradient-Free Attacks for Models of Code.
[39]
Jacob Steinhardt, Pang Wei Koh, and Percy Liang. 2017. Certified Defenses for Data Poisoning Attacks. In NIPS.
[40]
Zeyu Sun, Qihao Zhu, Yingfei Xiong, Yican Sun, Lili Mou, and Lu Zhang. 2020. TreeGen: A Tree-Based Transformer Architecture for Code Generation. In AAAI.
[41]
Brandon Tran, Jerry Li, and A. Madry. 2018. Spectral Signatures in Backdoor Attacks. In NeurIPS.
[42]
Eric Wallace, Tony Zhao, Shi Feng, and Sameer Singh. 2021. Concealed Data Poisoning Attacks on NLP Models. In NAACL.
[43]
B. L. Welch. 1947. The generalisation of student’s problems when several different population variances are involved.Biometrika 34 1-2(1947), 28–35.
[44]
Changming Xu, Jun Wang, Yuqing Tang, Francisco Guzmán, Benjamin I. P. Rubinstein, and Trevor Cohn. 2021. A Targeted Attack on Black-Box Neural Machine Translation with Parallel Data Poisoning. Proceedings of the Web Conference 2021(2021).
[45]
Yanming Yang, Xin Xia, David Lo, and John C. Grundy. 2020. A Survey on Deep Learning for Software Engineering. CoRR abs/2011.14597(2020).
[46]
Huangzhao Zhang, Zhuo Li, Ge Li, L. Ma, Yang Liu, and Zhi Jin. 2020. Generating Adversarial Examples for Holding Robustness of Source Code Processing Models. In AAAI.
[47]
Shihao Zhao, Xingjun Ma, X. Zheng, J. Bailey, Jingjing Chen, and Yugang Jiang. 2020. Clean-Label Backdoor Attacks on Video Recognition Models. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 14431–14440.

Cited By

View all
  • (2024)Securing Machine Learning Against Data Poisoning AttacksInternational Journal of Data Warehousing and Mining10.4018/IJDWM.35833520:1(1-21)Online publication date: 16-Nov-2024
  • (2024)A Survey of Text Watermarking in the Era of Large Language ModelsACM Computing Surveys10.1145/369162657:2(1-36)Online publication date: 3-Sep-2024
  • (2024)Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and MemorizationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695021(493-505)Online publication date: 27-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '22: Proceedings of the ACM Web Conference 2022
April 2022
3764 pages
ISBN:9781450390965
DOI:10.1145/3485447
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data poisoning
  2. dataset protection
  3. deep learning
  4. open-source code

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • National Key Research and Development Program

Conference

WWW '22
Sponsor:
WWW '22: The ACM Web Conference 2022
April 25 - 29, 2022
Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)166
  • Downloads (Last 6 weeks)14
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Securing Machine Learning Against Data Poisoning AttacksInternational Journal of Data Warehousing and Mining10.4018/IJDWM.35833520:1(1-21)Online publication date: 16-Nov-2024
  • (2024)A Survey of Text Watermarking in the Era of Large Language ModelsACM Computing Surveys10.1145/369162657:2(1-36)Online publication date: 3-Sep-2024
  • (2024)Promise and Peril of Collaborative Code Generation Models: Balancing Effectiveness and MemorizationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695021(493-505)Online publication date: 27-Oct-2024
  • (2024)Smart Software Analysis for Software Quality AssuranceProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674475(222-223)Online publication date: 5-Jul-2024
  • (2024)CodeWMBench: An Automated Benchmark for Code Watermarking EvaluationProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674447(120-125)Online publication date: 5-Jul-2024
  • (2024)Measuring Impacts of Poisoning on Model Parameters and Embeddings for Large Language Models of CodeProceedings of the 1st ACM International Conference on AI-Powered Software10.1145/3664646.3664764(59-64)Online publication date: 10-Jul-2024
  • (2024)FDI: Attack Neural Code Generation Systems through User Feedback ChannelProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3680300(528-540)Online publication date: 11-Sep-2024
  • (2024)Attack as Detection: Using Adversarial Attack Methods to Detect Abnormal ExamplesACM Transactions on Software Engineering and Methodology10.1145/363197733:3(1-45)Online publication date: 15-Mar-2024
  • (2024)Malicious Package Detection using Metadata InformationProceedings of the ACM Web Conference 202410.1145/3589334.3645543(1779-1789)Online publication date: 13-May-2024
  • (2024)Gotcha! This Model Uses My Code! Evaluating Membership Leakage Risks in Code ModelsIEEE Transactions on Software Engineering10.1109/TSE.2024.3482719(1-17)Online publication date: 2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media