[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3243734.3243738acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Large-Scale and Language-Oblivious Code Authorship Identification

Published: 15 October 2018 Publication History

Abstract

Efficient extraction of code authorship attributes is key for successful identification. However, the extraction of such attributes is very challenging, due to various programming language specifics, the limited number of available code samples per author, and the average code lines per file, among others. To this end, this work proposes a Deep Learning-based Code Authorship Identification System (DL-CAIS) for code authorship attribution that facilitates large-scale, language-oblivious, and obfuscation-resilient code authorship identification. The deep learning architecture adopted in this work includes TF-IDF-based deep representation using multiple Recurrent Neural Network (RNN) layers and fully-connected layers dedicated to authorship attribution learning. The deep representation then feeds into a random forest classifier for scalability to de-anonymize the author. Comprehensive experiments are conducted to evaluate DL-CAIS over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1987 public repositories on GitHub. The results of our work show the high accuracy despite requiring a smaller number of files per author. Namely, we achieve an accuracy of 96% when experimenting with 1,600 authors for GCJ, and 94.38% for the real-world dataset for 745 C programmers. Our system also allows us to identify 8,903 authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Moreover, our technique is resilient to language-specifics, and thus it can identify authors of four programming languages (e.g. C, C++, Java, and Python), and authors writing in mixed languages (e.g. Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g. using C Tigress) with an accuracy of 93.42% for a set of 120 authors.

Supplementary Material

MP4 File (p101-mohaisen.mp4)

References

[1]
Mart'ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR Vol. abs/1603.04467 (2016). http://arxiv.org/abs/1603.04467
[2]
Ahmed Abbasi and Hsinchun Chen. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS) Vol. 26, 2 (2008), 7.
[3]
Sadia Afroz, Aylin Caliskan Islam, Ariel Stolerman, Rachel Greenstadt, and Damon McCoy. 2014. Doppelg"anger finder: Taking stylometry to the underground Security and Privacy (SP), 2014 IEEE Symposium on. IEEE, 212--226.
[4]
Saed Alrabaee, Noman Saleem, Stere Preda, Lingyu Wang, and Mourad Debbabi. 2014. Oba2: An onion approach to binary code authorship attribution. Digital Investigation Vol. 11 (2014), S94--S103.
[5]
Alexander T Basilevsky. 2009. Statistical factor analysis and related methods: theory and applications. Vol. Vol. 418. John Wiley & Sons.
[6]
Yoshua Bengio. 2009. Learning Deep Architectures for AI. Found. Trends Mach. Learn. Vol. 2, 1 (Jan. 2009), 1--127.
[7]
Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence Vol. 35, 8 (2013), 1798--1828.
[8]
Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. 2012. Unsupervised Feature Learning and Deep Learning: A Review and New Perspectives. CoRR (2012). {arxiv}1206.5538 http://arxiv.org/abs/1206.5538
[9]
Leo Breiman. 2001. Random Forests. Machine Learning Vol. 45, 1 (2001), 5--32.
[10]
Michael Brennan, Sadia Afroz, and Rachel Greenstadt. 2012. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security (TISSEC) Vol. 15, 3 (2012), 12.
[11]
Steven Burrows and S. M. M. Tahaghoghi. 2007. Source code authorship attribution using n-grams. In Proceedings of the Twelfth Australasian Document Computing Symposium (ADCS'07). Spink A, Turpin A, Wu M (eds), 32--39.
[12]
Steven Burrows, S. M. M. Tahaghoghi, and Justin Zobel. 2007. Efficient Plagiarism Detection for Large Code Repositories. Softw. Pract. Exper. Vol. 37, 2 (Feb. 2007), 151--175.
[13]
Steven Burrows, Alexandra L. Uitdenbogerd, and Andrew Turpin. 2009 a. Application of Information Retrieval Techniques for Source Code Authorship Attribution Proceedings of the 14th International Conference on Database Systems for Advanced Applications (DASFAA '09). Springer-Verlag, Berlin, Heidelberg, 699--713.
[14]
S. Burrows, A. L. Uitdenbogerd, and A. Turpin. 2009 b. Temporally Robust Software Features for Authorship Attribution 2009 33rd Annual IEEE International Computer Software and Applications Conference, Vol. Vol. 1. 599--606.
[15]
Steven Burrows, Alexandra L. Uitdenbogerd, and Andrew Turpin. 2014. Comparing techniques for authorship attribution of source code. Software: Practice and Experience Vol. 44, 1 (2014), 1--32.
[16]
Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015 a. De-anonymizing Programmers via Code Stylometry. In Proceedings of the 24th USENIX Conference on Security Symposium (SEC'15). USENIX Association, Berkeley, CA, USA, 255--270. http://dl.acm.org/citation.cfm?id=2831143.2831160
[17]
Aylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan. 2015 b. When coding style survives compilation: De-anonymizing programmers from executable binaries. arXiv preprint arXiv:1512.08546 (2015).
[18]
Edwin Dauber, Aylin Caliskan Islam, Richard E. Harang, and Rachel Greenstadt. 2017. Git Blame Who?: Stylistic Authorship Attribution of Small, Incomplete Source Code Fragments. CoRR Vol. abs/1701.05681 (2017). {arxiv}1701.05681
[19]
Haibiao Ding and Mansur H. Samadzadeh. 2004. Extraction of Java program fingerprints for software authorship identification. Journal of Systems and Software Vol. 72, 1 (2004), 49 -- 57.
[20]
John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. Vol. 12 (July. 2011), 2121--2159.
[21]
Bruce S. Elenbogen and Naeem Seliya. 2008. Detecting Outsourced Student Programming Assignments. J. Comput. Sci. Coll. Vol. 23, 3 (Jan. 2008), 50--57. http://dl.acm.org/citation.cfm?id=1295109.1295123
[22]
Brian S Everitt and Graham Dunn. 2001. Applied multivariate data analysis. Vol. Vol. 2. Wiley Online Library.
[23]
Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, Carole E Chaski, and Blake Stephen Howald. 2007. Identifying authorship by byte-level n-grams: The source code author profile (scap) method. International Journal of Digital Evidence Vol. 6, 1 (2007), 1--18.
[24]
Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. 2006. Effective Identification of Source Code Authors Using Byte-level Information Proceedings of the 28th International Conference on Software Engineering (ICSE '06). ACM, New York, NY, USA, 893--896.
[25]
Niels Dalum Hansen, Christina Lioma, Birger Larsen, and Stephen Alstrup. 2014. Temporal Context for Authorship Attribution. In Information Retrieval Facility Conference. Springer, 22--40.
[26]
Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation Vol. 18, 7 (2006), 1527--1554.
[27]
Patrick Juola et al. 2008. Authorship attribution. Foundations and Trends® in Information Retrieval Vol. 1, 3 (2008), 233--334.
[28]
Vlado Kevselj, Fuchun Peng, Nick Cercone, and Calvin Thomas. 2003. N-gram-based author profiles for authorship attribution Proceedings of the conference pacific association for computational linguistics, PACLING, Vol. Vol. 3. 255--264.
[29]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR Vol. abs/1412.6980 (2014).
[30]
Ron Kohavi et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection The International Joint Conference on Artificial Intelligence, Vol. Vol. 14. Stanford, CA, 1137--1145.
[31]
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. Journal of the Association for Information Science and Technology Vol. 60, 1 (2009), 9--26.
[32]
Ivan Krsul and Eugene H. Spafford. 1997. Refereed Paper: Authorship Analysis: Identifying the Author of a Program. Comput. Secur. Vol. 16, 3 (Jan. 1997), 233--257.
[33]
Robert Charles Lange and Spiros Mancoridis. 2007. Using Code Metric Histograms and Genetic Algorithms to Perform Author Identification for Software Forensics. In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation (GECCO '07). ACM, New York, NY, USA, 2082--2089.
[34]
S. G. Macdonell, A. R. Gray, G. MacLennan, and P. J. Sallis. 1999. Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis Neural Information Processing, 1999. Proceedings. ICONIP '99. 6th International Conference on, Vol. Vol. 1. 66--71 vol.1.
[35]
Cameron H Malin, Eoghan Casey, and James M Aquilina. 2008. Malware forensics: investigating and analyzing malicious code. Syngress.
[36]
Xiaozhu Meng, Barton P Miller, and Kwang-Sung Jun. 2017. Identifying Multiple Authors in a Binary Program. In European Symposium on Research in Computer Security. Springer, Oslo, Norway, 286--304.
[37]
Brian N. Pellin. 2000. Using Classification Techniques to Determine Source Code Authorship. White Paper: Vol. Department of Computer Science, University of Wisconsin (2000).
[38]
Nathan Rosenblum, Xiaojin Zhu, and Barton Miller. 2011. Who wrote this code? identifying the authors of program binaries. Computer Security--ESORICS 2011 (2011), 172--189.
[39]
Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. 2015. Recognizing Functions in Binaries with Neural Networks 24th USENIX Security Symposium (USENIX Security 15). Washington, D.C., 611--626.
[40]
Eugene H. Spafford and Stephen A. Weeber. 1993. Software forensics: Can we track code to its authors? Computers & Security Vol. 12, 6 (1993), 585 -- 595.
[41]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. Vol. 15, 1 (Jan. 2014), 1929--1958.
[42]
Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the Association for Information Science and Technology Vol. 60, 3 (2009), 538--556.
[43]
Ariel Stolerman, Rebekah Overdorf, Sadia Afroz, and Rachel Greenstadt. 2013. Classify, but verify: Breaking the closed-world assumption in stylometric authorship attribution. In IFIP Working Group, Vol. Vol. 11. 64.
[44]
Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning Vol. 4, 2 (2012).
[45]
Özlem Uzuner and Boris Katz. 2005. A comparative study of language models for book and author recognition International Conference on Natural Language Processing. Springer, 969--980.
[46]
Linda J Wilcox. 1998. Authorship: the coin of the realm, the source of complaints. The Journal of the American Medical Association Vol. 280, 3 (1998), 216--217.

Cited By

View all
  • (2025)Digital twin and sensor networks for healthcare monitoring frameworksSensor Networks for Smart Hospitals10.1016/B978-0-443-36370-2.00011-6(217-261)Online publication date: 2025
  • (2024)Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive SurveyInformation10.3390/info1503013115:3(131)Online publication date: 28-Feb-2024
  • (2024)Can Large Language Models Comprehend Code Stylometry?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695370(2429-2431)Online publication date: 27-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CCS '18: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security
October 2018
2359 pages
ISBN:9781450356930
DOI:10.1145/3243734
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. code authorship identification
  2. deep learning identification
  3. program features
  4. software forensics

Qualifiers

  • Research-article

Funding Sources

  • National Research Foundation (Republic of Korea)

Conference

CCS '18
Sponsor:

Acceptance Rates

CCS '18 Paper Acceptance Rate 134 of 809 submissions, 17%;
Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)111
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Digital twin and sensor networks for healthcare monitoring frameworksSensor Networks for Smart Hospitals10.1016/B978-0-443-36370-2.00011-6(217-261)Online publication date: 2025
  • (2024)Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive SurveyInformation10.3390/info1503013115:3(131)Online publication date: 28-Feb-2024
  • (2024)Can Large Language Models Comprehend Code Stylometry?Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695370(2429-2431)Online publication date: 27-Oct-2024
  • (2024)Reducing the Impact of Time Evolution on Source Code Authorship Attribution via Domain AdaptationACM Transactions on Software Engineering and Methodology10.1145/365215133:6(1-27)Online publication date: 27-Jun-2024
  • (2024)Enhancing Robustness of Code Authorship Attribution through Expert Feature KnowledgeProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652121(199-209)Online publication date: 11-Sep-2024
  • (2024)Pitfalls in Machine Learning for Computer SecurityCommunications of the ACM10.1145/3643456Online publication date: 25-Oct-2024
  • (2024)SepBIN: Binary Feature Separation for Better Semantic Comparison and Authorship VerificationIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.333189519(1372-1387)Online publication date: 2024
  • (2024)The “Code” of Ethics: A Holistic Audit of AI Code GeneratorsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3367737(1-16)Online publication date: 2024
  • (2024)SrcMarker: Dual-Channel Source Code Watermarking via Scalable Code Transformations2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00097(4088-4106)Online publication date: 19-May-2024
  • (2024)Towards Effective Authorship Attribution: Integrating Class-Incremental Learning2024 IEEE 6th International Conference on Cognitive Machine Intelligence (CogMI)10.1109/CogMI62246.2024.00018(56-65)Online publication date: 28-Oct-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media