More Web Proxy on the site http://driver.im/

research-article

On the Role of Pre-trained Embeddings in Binary Code Analysis

Authors:

Felix Weißberg,

Konrad RieckAuthors Info & Claims

ASIA CCS '24: Proceedings of the 19th ACM Asia Conference on Computer and Communications Security

Pages 1143 - 1158

https://doi.org/10.1145/3634737.3657029

Published: 01 July 2024 Publication History

Abstract

Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing functions. These embeddings are capable of learning a vector representation from unlabeled code. In contrast to natural language processing, however, label information is not scarce for many tasks in binary code analysis. For example, labeled training data for function boundaries, optimization levels, and argument types can be easily derived from debug information provided by a compiler. Consequently, the main motivation of embeddings does not transfer directly to binary code analysis.

In this paper, we explore the role of pre-trained embeddings from a critical perspective. To this end, we systematically evaluate recent embeddings for assembly code on five downstream tasks using a corpus of 1.2 million functions from the Debian distribution. We observe that several embeddings perform similarly when sufficient labeled data is available, and that differences reported in prior work are hardly noticeable. Surprisingly, we find that end-to-end learning without pre-training performs best on average, which calls into question the need for specialized embeddings. By varying the amount of labeled data, we eventually derive guidelines for when embeddings offer advantages and when end-to-end learning is preferable for binary code analysis.

References

[1]

Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. 2021. Muppet: Massive Multi-task Representations with Pre-Finetuning. In Proc. of the Conference on Empirical Methods in Natural Language Processing. 5799--5811.

[2]

Sunwoo Ahn, Seonggwan Ahn, Hyungjoon Koo, and Yunheung Paek. 2022. Practical Binary Code Similarity Detection with BERT-based Transferable Similarity Learning. In Proc. of the Annual Computer Security Applications Conference (AC-SAC). 361--374.

Digital Library

[3]

Jim Alves-Foss and Jia Song. 2019. Function boundary detection in stripped binaries. In Proc. of the Annual Computer Security Applications Conference (ACSAC). 84--96.

Digital Library

[4]

Sajib Biswas, Timothy Barao, John Lazzari, Jeret McCoy, Xiuwen Liu, and Alexander Kostandarithes. 2022. Geometric Analysis and Metric Learning of Instruction Embeddings. In Proc. of the International Joint Conference on Neural Networks (IJCNN).

[5]

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. Signature Verification using a "Siamese" Time Delay Neural Network. In Advances in Neural Information Processing Systems, Vol. 6.

[6]

Yu Chen, Zhiqiang Shi, Hong Li, Weiwei Zhao, Yiliang Liu, and Yuansong Qiao. 2018. HIMALIA: Recovering Compiler Optimization Levels from Binaries by Deep Learning. In Proc. of the Intelligent Systems Conference (IntelliSys).

[7]

Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang. 2017. Neural Nets Can Learn Function Type Signatures From Binaries. In Proc. of the USENIX Security Symposium. 99--116.

[8]

Zheng Leong Chua, Shiqi Shen, Prateek Saxena, and Zhenkai Liang. 2017. Neural Nets Can Learn Function Type Signatures From Binaries. In Proc. of the USENIX Security Symposium. 99--116.

[9]

Hanjun Dai, Bo Dai, and Le Song. 2016. Discriminative Embeddings of Latent Variable Models for Structured Data. In Proc. of the International Conference on Machine Learning (ICML). 2702--2711.

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proc. of the Conference of The North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT).

[11]

Steven Ding, Benjamin Fung, and Philippe Charland. 2019. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In Proc. of the IEEE Symposium on Security and Privacy.

[12]

Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In Proc. of the 31st International Conference on Machine Learning. 647--655.

[13]

DWARF Debugging Information Format Committee 2010. DWARF debugging information format. DWARF Debugging Information Format Committee. Version 4.

[14]

Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding Back-Translation at Scale. In Proc. of the Conference on Empirical Methods in Natural Language Processing. 489--50.

[15]

Tom Fawcett. 2006. An introduction to ROC analysis. Pattern recognition letters 27, 8 (2006), 861--874.

[16]

Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable Graph-based Bug Search for Firmware Images. In Proc. of the ACM Conference on Computer and Communications Security (CCS). 480--491.

Digital Library

[17]

Wenbo Guo, Dongliang Mu, Xinyu Xing, Min Du, and Dawn Song. 2019. DEEP-VSA: Facilitating Value-set Analysis with Deep Learning for Postmortem Program Analysis. In Proc. of the USENIX Security Symposium.

[18]

Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin T. Vechev. 2018. Debin: Predicting Debug Information in Stripped Binaries. In Proc. of the ACM Conference on Computer and Communications Security (CCS). 1667--1680.

[19]

Xin Jin, Kexin Pei, Jun Yeon Won, and Zhiqiang Lin. 2022. SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings. In Proc. of the ACM Conference on Computer and Communications Security (CCS). 1631--1645.

Digital Library

[20]

Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proc. of the International Conference on Machine Learning (ICML).

[21]

Yongjun Lee, Hyun Kwon, Sang-Hoon Choi, Seung-Ho Lim, Sung Hoon Baek, and Ki-Woong Park. 2019. Instruction2vec: Efficient Preprocessor of Assembly Code to Detect Software Weakness with CNN. Applied Sciences 9 (2019).

[22]

Young Jun Lee, Sang-Hoon Choi, Chulwoo Kim, Seung-Ho Lim, and Ki-Woong Park. 2017. Learning Binary Code with Deep Learning to Detect Software Weakness. In Proc. of the International Conference on Internet (ICONI).

[23]

Xuezixiang Li, Yu Qu, and Heng Yin. 2021. PalmTree: Learning an Assembly Language Model for Instruction Embedding. In Proc. of the ACM Conference on Computer and Communications Security (CCS).

Digital Library

[24]

Di Lin, Guangyong Chen, Daniel Cohen-Or, Pheng-Ann Heng, and Hui Huang. 2017. Cascaded feature network for semantic segmentation of RGB-D images. In Proc. of the IEEE international conference on computer vision. 1311--1319.

[25]

Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. 2019. SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Proc. of the Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA).

[26]

Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. 2019. SAFE: Self-Attentive Function Embeddings for Binary Similarity. In Proc. of the Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA).

[27]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In Proc. of the International Conference on Learning Representations (ICLR Workshop).

[28]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (2013).

[29]

Kexin Pei, Jonas Guan, Matthew Broughton, Zhongtian Chen, Songchen Yao, David Williams-King, Vikas Ummadisetty, Junfeng Yang, Baishakhi Ray, and Suman Jana. 2021. StateFormer: Fine-Grained Type Recovery from Binaries using Generative State Modeling. In Proc. of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.

Digital Library

[30]

Kexin Pei, Jonas Guan, David Williams-King, Junfeng Yang, and Suman Jana. 2021. XDA: Accurate, Robust Disassembly with Transfer Learning. In Proc. of the Network and Distributed System Security Symposium (NDSS).

[31]

Davide Pizzolotto and Katsuro Inoue. 2021. Identifying Compiler and Optimization Level in Binary Code From Multiple Architectures. IEEE Access 9 (2021).

[32]

Kimberly Redmond, Lannan Luo, and Qiang Zeng. 2019. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. In Proc. of the Workshop on Binary Analysis Research (BAR).

[33]

Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. 2015. Recognizing Functions in Binaries with Neural Networks. In Proc. of the USENIX Security Symposium. 611--626.

[34]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems.

[35]

Sinong Wang, Madian Khabsa, and Hao Ma. 2020. To Pretrain or Not to Pretrain: Examining the Benefits of Pretrainng on Resource Rich Tasks. In Proc. of Annual Meeting of the Association for Computational Linguistics (ACL).

[36]

Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Song Dawn. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proc. of the ACM Conference on Computer and Communications Security (CCS).

Digital Library

[37]

Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proc. of the ACM Conference on Computer and Communications Security (CCS). 363--376.

Digital Library

[38]

Zeping Yu, Rui Cao, Qiyi Tang, Sen Nie, Junzhou Huang, and Shi Wu. 2020. Order matters: Semantic-aware neural networks for binary code similarity detection. In Proc. of the AAAI Conference on Artificial Intelligence.

[39]

Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2019. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. In Proc. of the Network and Distributed System Security Symposium (NDSS).

Index Terms

On the Role of Pre-trained Embeddings in Binary Code Analysis
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms

Recommendations

Binary Code Analysis

Static and dynamic analysis of binary code can provide useful information to security researchers without access to assembly code. However, these approaches currently require separate tools, forcing users to perform distinct analysis and then combine ...
A Review on Binary Code Analysis Datasets
Wireless Artificial Intelligent Computing Systems and Applications
Abstract
Binary code analysis serves as the foundation for research in vulnerability discovery, software protection, and malicious code analysis. However, analyzing binary files is challenging due to the lack of high-level semantic information, leading to ...
Can pre-trained code embeddings improve model performance? Revisiting the use of code embeddings in software engineering tasks
Abstract
Word representation plays a key role in natural language processing (NLP). Various representation methods have been developed, among which pre-trained word embeddings (i.e., dense vectors that represent words) have shown to be highly effective in ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASIA CCS '24: Proceedings of the 19th ACM Asia Conference on Computer and Communications Security

July 2024

1987 pages

ISBN:9798400704826

DOI:10.1145/3634737

Chair:
Jianying Zhou,
Co-chair:
Tony Q. S. Quek,
Program Chairs:
Debin Gao,
Alvaro Cardenas
University of California, Santa Cruz, USA

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASIA CCS '24

Sponsor:

SIGSAC

ASIA CCS '24: 19th ACM Asia Conference on Computer and Communications Security

July 1 - 5, 2024

Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 418 of 2,322 submissions, 18%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
127
Total Downloads

Downloads (Last 12 months)127
Downloads (Last 6 weeks)22

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten