[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

VeriBin: A Malware Authorship Verification Approach for APT Tracking through Explainable and Functionality-Debiasing Adversarial Representation Learning

Published: 16 August 2024 Publication History

Abstract

Malware attacks are posing a significant threat to national security, cooperate network, and public endpoint security. Identifying the Advanced Persistent Threat (APT) groups behind the attacks and grouping their activities into attack campaigns help security investigators trace their activities thus providing better security protections against future attacks. Existing Cyber Threat Intelligent (CTI) components mainly focus on malware family identification and behavior characterization, which cannot solve the APT tracking problem: while APT tracking needs one to link malware binaries of multiple families to a single threat actor, these behavior or function-based techniques are tightened up to a specific attack technique and would fail on connecting different families. Binary Authorship Attribution (AA) solutions could discriminate against threat actors based on their stylometric traits. However, AA solutions assume that the author of a binary is within a fixed candidate author set. However, real-world malware binaries may be created by a new unknown threat actor.
To address this research gap, we propose VeriBin for the Binary Authorship Verification (BAV) problem. VeriBin is a novel adversarial neural network that extracts functionality-agnostic style representations from assembly code for the AV task. The extracted style representations can be visualized and are explainable with VeriBin’s multi-head attention mechanism. We benchmark VeriBin with state-of-the-art coding style representations on a standard dataset and a recent malware-APT dataset. Given two anonymous binaries of out-of-sample authors, VeriBin can accurately determine whether they belong to the same author or not. VeriBin is resilient to compiler optimizations and robust against malware family variants.

References

[1]
Mohammed Abuhamad, Tamer AbuHmed, Aziz Mohaisen, and DaeHun Nyang. 2018. Large-scale and language-oblivious code authorship identification. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS 2018. 101–114.
[2]
Mohammed Abuhamad, Tamer Abuhmed, DaeHun Nyang, and David Mohaisen. 2020. Multi-\(\chi\): Identifying multiple authors from source code files. Proceedings on Privacy Enhancing Technologies 2020, 3 (2020), 25–41.
[3]
Naveed Akhtar and Ajmal Mian. 2018. Threat of adversarial attacks on deep learning in computer vision: A survey. CoRR abs/1801.00553, (2018). Retrieved from http://arxiv.org/abs/1801.00553
[4]
Saed Alrabaee, Mourad Debbabi, and Lingyu Wang. 2019. On the feasibility of binary authorship characterization. Digit. Investig. 28 (2019), S3–S11. DOI:
[5]
Saed Alrabaee, Mourad Debbabi, and Lingyu Wang. 2022. A survey of binary code fingerprinting approaches: Taxonomy, methodologies, and features. ACM Computing Surveys (CSUR) 55, 1 (2022), 1–41.
[6]
Saed Alrabaee, ElMouatez Billah Karbab, Lingyu Wang, and Mourad Debbabi. 2019. BinEye: Towards efficient binary authorship characterization using deep learning. In Proceedings of the European Symposium on Research in Computer Security. Springer, 47–67.
[7]
Saed Alrabaee, Noman Saleem, Stere Preda, Lingyu Wang, and Mourad Debbabi. 2014. OBA2: An onion approach to binary code authorship attribution. Digital Investigation 11, 1 (2014), S94–S103.
[8]
Saed Alrabaee, Paria Shirani, Lingyu Wang, Mourad Debbabi, and Aiman Hanna. 2019. Decoupling coding habits from functionality for effective binary authorship attribution. J. Comput. Secur. 27, 6 (2019), 613–648. DOI:
[9]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations (ICLR’15), Yoshua Bengio and Yann LeCun (Eds.). San Diego, CA, USA, May 7–9, 2015. Retrieved from http://arxiv.org/abs/1409.0473
[10]
Jane Bromley, James W. Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff Moore, Eduard Säckinger, and Roopak Shah. 1993. Signature verification using a “Siamese” time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence 7, 4 (1993), 669–688.
[11]
Steven Burrows, Seyed M. M. Tahaghoghi, and Justin Zobel. 2007. Efficient plagiarism detection for large code repositories. Software: Practice and Experience 37, 2 (2007), 151–175.
[12]
Steven Burrows, Alexandra L. Uitdenbogerd, and Andrew Turpin. 2014. Comparing techniques for authorship attribution of source code. Software: Practice and Experience 44, 1 (2014), 1–32.
[13]
Khanh-Huu-The Dam, Thomas Given-Wilson, and Axel Legay. 2021. Unsupervised behavioural mining and clustering for malware family identification. In SAC’21: The 36th ACM/SIGAPP Symposium on Applied Computing, Chih-Cheng Hung, Jiman Hong, Alessio Bechini, and Eunjee Song (Eds.). ACM, 374–383. DOI:
[14]
J. L. Donaldson, A. Lancaster, and P. H. Sposato. 1981. A plagiarism detection system. ACM SIGCSE Bulletin 13, 1 (1981), 21–25.
[15]
Bruce S. Elenbogen and Naeem Seliya. 2008. Detecting outsourced student programming assignments. J. Comput. Sci. Coll. 23, 3 (January 2008), 50–57.
[16]
Georgia Frantzeskou and Stefanos Gritzalis. 2004. Source code authorship analysis for supporting the cybercrime investigation process. In Proceedings of the 1st International Conference on E-Business and Telecommunication Networks. 85–92.
[17]
Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis K. Katsikas. 2006. Source code author identification based on N-gram author profiles. In Proceedings of the 3rd IFIP Conference on Artificial Intelligence Applications and Innovations (AIAI) 2006. 508–515.
[18]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Y. Bengio. 2014. Generative adversarial networks. Advances in Neural Information Processing Systems 3 (June 2014). DOI:
[19]
Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. 2005. Bidirectional LSTM networks for improved phoneme classification and recognition. In Proceedings of the 15th International Conference on Artificial Neural Networks: Formal Models and Their Applications - ICANN 2005,. 799–804.
[20]
Sumit Gupta, Tapas Kumar Patra, and Chitrita Chaudhuri. 2022. Role of machine learning in authorship attribution with select stylometric features. In Proceedings of the International Conference on Intelligent Systems Design and Applications. Springer, 920–932.
[21]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (November 1997), 1735–1780. DOI:
[22]
Giacomo Iadarola, Fabio Martinelli, Francesco Mercaldo, and Antonella Santone. 2021. Towards an interpretable deep learning model for mobile malware detection and family identification. Comput. Secur. 105 (2021), 102198. DOI:
[23]
Aylin Caliskan Islam, Richard E. Harang, Andrew Liu, Arvind Narayanan, Clare R. Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In Proceedings of the 24th USENIX Security Symposium, USENIX Security 15.255–270.
[24]
Aylin Caliskan, Fabian Yamaguchi, Edwin Dauber, Richard E. Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan. 2018. When coding style survives compilation: De-anonymizing programmers from executable binaries. In 25th Annual Network and Distributed System Security Symposium (NDSS’18), San Diego, California, USA, February 18-21, 2018, The Internet Society. Retrieved from https://www.ndss-symposium.org/wp-content/uploads/2018/02/ndss2018%5C_06B-2%5C_Caliskan%5C_paper.pdf
[25]
Vaibhavi Kalgutkar, Ratinder Kaur, Hugo Gonzalez, Natalia Stakhanova, and Alina Matyukhina. 2019. Code authorship attribution: Methods and challenges. ACM Computing Surveys 52, 1 (2019), 3:1–3:36.
[26]
Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium, 66–71. DOI:
[27]
Robert Layton, Paul A. Watters, and Richard Dazeley. 2012. Unsupervised authorship analysis of phishing webpages. In Proceedings of the International Symposium on Communications and Information Technologies, ISCIT 2012. 1104–1109.
[28]
Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436–444.
[29]
Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and Sencun Zhu. 2017. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Transactions on Software Engineering 43, 12 (2017), 1157–1177. DOI:
[30]
Xiaozhu Meng. 2016. Fine-grained binary code authorship identification. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016. 1097–1099.
[31]
Xiaozhu Meng, Barton P. Miller, and Kwang-Sung Jun. 2017. Identifying multiple authors in a binary program. In Proceedings of the 22nd European Symposium on Research in Computer Security. 286–304.
[32]
Xiaozhu Meng, Barton P. Miller, and Kwang-Sung Jun. 2017. Identifying multiple authors in a binary program. In Proceedings of the European Symposium on Research in Computer Security. Springer, 286–304.
[33]
Francesco Mercaldo and Antonella Santone. 2021. Audio signal processing for Android malware detection and family identification. Journal of Computer Virology and Hacking Techniques 17, 2 (2021), 139–152. DOI:
[34]
Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey A. Dean. 2015. Computing numeric representations of words in a high-dimensional space. (May 192015). US Patent 9,037,464.
[35]
Weihan Ou, Steven H. H. Ding, Yuan Tian, and Leo Song. 2023. SCS-Gan: Learning functionality-agnostic stylometric representations for source code authorship verification. IEEE Trans. Software Eng. 49, 4 (2023), 1426–1442. DOI:
[36]
Brian Pellin. 2006. Using Classification Techniques to Determine Source Code Authorship. Retrieved from https://api.semanticscholar.org/CorpusID:14399700
[37]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.
[38]
Erwin Quiring, Alwin Maier, and Konrad Rieck. 2019. Misleading authorship attribution of source code using adversarial learning. In Proceedings of the 28th USENIX Security Symposium, USENIX Security 2019, Nadia Heninger and Patrick Traynor (Eds.). USENIX Association, 479–496. Retrieved from https://www.usenix.org/conference/usenixsecurity19/presentation/quiring
[39]
Aamir Rasool, Abdul Rehman Javed, and Zunera Jalil. 2021. SHA-AMD: Sample-efficient hyper-tuned approach for detection and identification of Android malware family and category. International Journal of Ad Hoc and Ubiquitous Computing 38, 1/2/3 (2021), 172–183. DOI:
[40]
Nathan E. Rosenblum, Xiaojin Zhu, and Barton P. Miller. 2011. Who wrote this code? Identifying the authors of program binaries. In Proceedings of the 16th European Symposium on Research in Computer Security. 172–189.
[41]
Mike Schuster and Kuldip K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681.
[42]
Lucy Simko, Luke Zettlemoyer, and Tadayoshi Kohno. 2018. Recognizing and imitating programmer style: Adversaries in program authorship attribution. Proceedings on Privacy Enhancing Technologies 2018, 1 (2018), 127–144.
[43]
Farhan Ullah, Junfeng Wang, Sohail Jabbar, Fadi Al-Turjman, and Mamoun Alazab. 2019. Source code authorship attribution using hybrid approach of program dependence graph and deep learning model. IEEE Access 7 (2019), 141987–141999. DOI:
[44]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Retrieved from https://arxiv.org/pdf/1706.03762.pdf
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). Long Beach, CA, USA, 5998–6008. Retrieved from https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[46]
Wikipedia contributors. 2020. Siamese Neural Network — Wikipedia, The Free Encyclopedia. (2020). https://en.wikipedia.org/wiki/Siamese_neural_network
[47]
Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. 2016. Deep recurrent models with fast-forward connections for neural machine translation. Trans. Assoc. Comput. Linguistics 4 (2016), 371–383. DOI:

Index Terms

  1. VeriBin: A Malware Authorship Verification Approach for APT Tracking through Explainable and Functionality-Debiasing Adversarial Representation Learning

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Transactions on Privacy and Security
          ACM Transactions on Privacy and Security  Volume 27, Issue 3
          August 2024
          193 pages
          EISSN:2471-2574
          DOI:10.1145/3613650
          Issue’s Table of Contents

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 16 August 2024
          Online AM: 20 July 2024
          Accepted: 23 April 2024
          Revised: 26 November 2023
          Received: 22 May 2022
          Published in TOPS Volume 27, Issue 3

          Check for updates

          Author Tags

          1. Cyber threat intelligence
          2. representation learning
          3. adversarial learning
          4. authorship analysis
          5. reverse engineering

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 270
            Total Downloads
          • Downloads (Last 12 months)270
          • Downloads (Last 6 weeks)48
          Reflects downloads up to 03 Mar 2025

          Other Metrics

          Citations

          View Options

          Login options

          Full Access

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          Full Text

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media