[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3634737.3657029acmconferencesArticle/Chapter ViewAbstractPublication Pagesasia-ccsConference Proceedingsconference-collections
research-article

On the Role of Pre-trained Embeddings in Binary Code Analysis

Published: 01 July 2024 Publication History

Abstract

Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing functions. These embeddings are capable of learning a vector representation from unlabeled code. In contrast to natural language processing, however, label information is not scarce for many tasks in binary code analysis. For example, labeled training data for function boundaries, optimization levels, and argument types can be easily derived from debug information provided by a compiler. Consequently, the main motivation of embeddings does not transfer directly to binary code analysis.
In this paper, we explore the role of pre-trained embeddings from a critical perspective. To this end, we systematically evaluate recent embeddings for assembly code on five downstream tasks using a corpus of 1.2 million functions from the Debian distribution. We observe that several embeddings perform similarly when sufficient labeled data is available, and that differences reported in prior work are hardly noticeable. Surprisingly, we find that end-to-end learning without pre-training performs best on average, which calls into question the need for specialized embeddings. By varying the amount of labeled data, we eventually derive guidelines for when embeddings offer advantages and when end-to-end learning is preferable for binary code analysis.

References

[1]
Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. 2021. Muppet: Massive Multi-task Representations with Pre-Finetuning. In Proc. of the Conference on Empirical Methods in Natural Language Processing. 5799--5811.
[2]
Sunwoo Ahn, Seonggwan Ahn, Hyungjoon Koo, and Yunheung Paek. 2022. Practical Binary Code Similarity Detection with BERT-based Transferable Similarity Learning. In Proc. of the Annual Computer Security Applications Conference (AC-SAC). 361--374.
[3]
Jim Alves-Foss and Jia Song. 2019. Function boundary detection in stripped binaries. In Proc. of the Annual Computer Security Applications Conference (ACSAC). 84--96.
[4]
Sajib Biswas, Timothy Barao, John Lazzari, Jeret McCoy, Xiuwen Liu, and Alexander Kostandarithes. 2022. Geometric Analysis and Metric Learning of Instruction Embeddings. In Proc. of the International Joint Conference on Neural Networks (IJCNN).
[5]
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. Signature Verification using a "Siamese" Time Delay Neural Network. In Advances in Neural Information Processing Systems, Vol. 6.
[6]
Yu Chen, Zhiqiang Shi, Hong Li, Weiwei Zhao, Yiliang Liu, and Yuansong Qiao. 2018. HIMALIA: Recovering Compiler Optimization Levels from Binaries by Deep Learning. In Proc. of the Intelligent Systems Conference (IntelliSys).
[7]
Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang. 2017. Neural Nets Can Learn Function Type Signatures From Binaries. In Proc. of the USENIX Security Symposium. 99--116.
[8]
Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang. 2017. Neural Nets Can Learn Function Type Signatures From Binaries. In Proc. of the USENIX Security Symposium. 99--116.
[9]
Hanjun Dai, Bo Dai, and Le Song. 2016. Discriminative Embeddings of Latent Variable Models for Structured Data. In Proc. of the International Conference on Machine Learning (ICML). 2702--2711.
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of the Conference of The North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).
[11]
Steven Ding, Benjamin Fung, and Philippe Charland. 2019. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In Proc. of the IEEE Symposium on Security and Privacy.
[12]
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In Proc. of the 31st International Conference on Machine Learning. 647--655.
[13]
DWARF Debugging Information Format Committee 2010. DWARF debugging information format. DWARF Debugging Information Format Committee. Version 4.
[14]
Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding Back-Translation at Scale. In Proc. of the Conference on Empirical Methods in Natural Language Processing. 489--50.
[15]
Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters 27, 8 (2006), 861--874.
[16]
Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable Graph-based Bug Search for Firmware Images. In Proc. of the ACM Conference on Computer and Communications Security (CCS). 480--491.
[17]
Wenbo Guo, Dongliang Mu, Xinyu Xing, Min Du, and Dawn Song. 2019. DEEP-VSA: Facilitating Value-set Analysis with Deep Learning for Postmortem Program Analysis. In Proc. of the USENIX Security Symposium.
[18]
Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin T. Vechev. 2018. Debin: Predicting Debug Information in Stripped Binaries. In Proc. of the ACM Conference on Computer and Communications Security (CCS). 1667--1680.
[19]
Xin Jin, Kexin Pei, Jun Yeon Won, and Zhiqiang Lin. 2022. SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings. In Proc. of the ACM Conference on Computer and Communications Security (CCS). 1631--1645.
[20]
Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proc. of the International Conference on Machine Learning (ICML).
[21]
Yongjun Lee, Hyun Kwon, Sang-Hoon Choi, Seung-Ho Lim, Sung Hoon Baek, and Ki-Woong Park. 2019. Instruction2vec: Efficient Preprocessor of Assembly Code to Detect Software Weakness with CNN. Applied Sciences 9 (2019).
[22]
Young Jun Lee, Sang-Hoon Choi, Chulwoo Kim, Seung-Ho Lim, and Ki-Woong Park. 2017. Learning Binary Code with Deep Learning to Detect Software Weakness. In Proc. of the International Conference on Internet (ICONI).
[23]
Xuezixiang Li, Yu Qu, and Heng Yin. 2021. PalmTree: Learning an Assembly Language Model for Instruction Embedding. In Proc. of the ACM Conference on Computer and Communications Security (CCS).
[24]
Di Lin, Guangyong Chen, Daniel Cohen-Or, Pheng-Ann Heng, and Hui Huang. 2017. Cascaded feature network for semantic segmentation of RGB-D images. In Proc. of the IEEE international conference on computer vision. 1311--1319.
[25]
Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. 2019. SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Proc. of the Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA).
[26]
Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. 2019. SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Proc. of the Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA).
[27]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In Proc. of the International Conference on Learning Representations (ICLR Workshop).
[28]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013).
[29]
Kexin Pei, Jonas Guan, Matthew Broughton, Zhongtian Chen, Songchen Yao, David Williams-King, Vikas Ummadisetty, Junfeng Yang, Baishakhi Ray, and Suman Jana. 2021. StateFormer: Fine-Grained Type Recovery from Binaries using Generative State Modeling. In Proc. of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.
[30]
Kexin Pei, Jonas Guan, David Williams-King, Junfeng Yang, and Suman Jana. 2021. XDA: Accurate, Robust Disassembly with Transfer Learning. In Proc. of the Network and Distributed System Security Symposium (NDSS).
[31]
Davide Pizzolotto and Katsuro Inoue. 2021. Identifying Compiler and Optimization Level in Binary Code From Multiple Architectures. IEEE Access 9 (2021).
[32]
Kimberly Redmond, Lannan Luo, and Qiang Zeng. 2019. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. In Proc. of the Workshop on Binary Analysis Research (BAR).
[33]
Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. 2015. Recognizing Functions in Binaries with Neural Networks. In Proc. of the USENIX Security Symposium. 611--626.
[34]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems.
[35]
Sinong Wang, Madian Khabsa, and Hao Ma. 2020. To Pretrain or Not to Pretrain: Examining the Benefits of Pretrainng on Resource Rich Tasks. In Proc. of Annual Meeting of the Association for Computational Linguistics (ACL).
[36]
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Song Dawn. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proc. of the ACM Conference on Computer and Communications Security (CCS).
[37]
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proc. of the ACM Conference on Computer and Communications Security (CCS). 363--376.
[38]
Zeping Yu, Rui Cao, Qiyi Tang, Sen Nie, Junzhou Huang, and Shi Wu. 2020. Order matters: Semantic-aware neural networks for binary code similarity detection. In Proc. of the AAAI Conference on Artificial Intelligence.
[39]
Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2019. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. In Proc. of the Network and Distributed System Security Symposium (NDSS).

Index Terms

  1. On the Role of Pre-trained Embeddings in Binary Code Analysis

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASIA CCS '24: Proceedings of the 19th ACM Asia Conference on Computer and Communications Security
    July 2024
    1987 pages
    ISBN:9798400704826
    DOI:10.1145/3634737
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 July 2024

    Check for updates

    Author Tags

    1. transfer learning
    2. binary code analysis

    Qualifiers

    • Research-article

    Conference

    ASIA CCS '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 418 of 2,322 submissions, 18%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 127
      Total Downloads
    • Downloads (Last 12 months)127
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media