More Web Proxy on the site http://driver.im/

research-article

Open access

jTrans: jump-aware transformer for binary code similarity detection

Authors:

Chao ZhangAuthors Info & Claims

ISSTA 2022: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Pages 1 - 13

https://doi.org/10.1145/3533767.3534367

Published: 18 July 2022 Publication History

Abstract

Binary code similarity detection (BCSD) has important applications in various fields such as vulnerabilities detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow information of binary code into Transformer-based language models, by using a novel jump-aware representation of the analyzed binaries and a newly-designed pre-training task. Additionally, we release to the community a newly-created large dataset of binaries, BinaryCorp, which is the most diverse to date. Evaluation results show that jTrans outperforms state-of-the-art (SOTA) approaches on this more challenging dataset by 30.5% (i.e., from 32.0% to 62.5%). In a real-world task of known vulnerability searching, jTrans achieves a recall that is 2X higher than existing SOTA baselines.

References

[1]

Dennis Andriesse, Asia Slowinska, and Herbert Bos. 2017. Compiler-Agnostic Function Detection in Binaries. In 2017 IEEE European Symposium on Security and Privacy (EuroS P). 177–189. https://doi.org/10.1109/EuroSP.2017.11

[2]

Archlinux. 2021. Arch linux. https://archlinux.org/packages/

[3]

Archlinux. 2021. Arch User Repository. https://aur.archlinux.org/

[4]

Silvio Cesare, Yang Xiang, and Wanlei Zhou. 2013. Control flow-based malware variantdetection. IEEE Transactions on Dependable and Secure Computing, 11, 4 (2013), 307–317.

[5]

Mahinthan Chandramohan, Yinxing Xue, Zhengzi Xu, Yang Liu, Chia Yuan Cho, and Hee Beng Kuan Tan. 2016. Bingo: Cross-architecture cross-os binary search. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 678–689.

Digital Library

[6]

Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). 1, 539–546.

Digital Library

[7]

Hanjun Dai, Bo Dai, and Le Song. 2016. Discriminative embeddings of latent variable models for structured data. In International conference on machine learning. 2702–2711.

Digital Library

[8]

Yaniv David, Nimrod Partush, and Eran Yahav. 2016. Statistical similarity of binaries. ACM SIGPLAN Notices, 51, 6 (2016), 266–280.

Digital Library

[9]

Yaniv David, Nimrod Partush, and Eran Yahav. 2017. Similarity of binaries through re-optimization. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 79–94.

Digital Library

[10]

Yaniv David, Nimrod Partush, and Eran Yahav. 2018. Firmup: Precise static detection of common vulnerabilities in firmware. ACM SIGPLAN Notices, 53, 2 (2018), 392–404.

Digital Library

[11]

Yaniv David and Eran Yahav. 2014. Tracelet-based code search in executables. Acm Sigplan Notices, 49, 6 (2014), 349–360.

Digital Library

[12]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[13]

Steven HH Ding, Benjamin CM Fung, and Philippe Charland. 2016. Kam1n0: Mapreduce-based assembly clone search for reverse engineering. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 461–470.

Digital Library

[14]

Steven HH Ding, Benjamin CM Fung, and Philippe Charland. 2019. Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In 2019 IEEE Symposium on Security and Privacy (SP). 472–489.

[15]

Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. 2020. Deepbindiff: Learning program-wide code representations for binary diffing. In Network and Distributed System Security Symposium.

[16]

Thomas Dullien and Rolf Rolles. 2005. Graph-based comparison of executable objects (english version). Sstic, 5, 1 (2005), 3.

[17]

Manuel Egele, Maverick Woo, Peter Chapman, and David Brumley. 2014. Blanket execution: Dynamic similarity testing for program binaries and components. In 23rd $USENIX$ Security Symposium ($USENIX$ Security 14). 303–317.

[18]

Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code. In NDSS. 52, 58–79.

[19]

Mohammad Reza Farhadi, Benjamin CM Fung, Philippe Charland, and Mourad Debbabi. 2014. Binclone: Detecting code clones in malware. In 2014 Eighth International Conference on Software Security and Reliability (SERE). 78–87.

Digital Library

[20]

Qian Feng, Minghua Wang, Mu Zhang, Rundong Zhou, Andrew Henderson, and Heng Yin. 2017. Extracting conditional formulas for cross-platform bug search. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. 346–359.

Digital Library

[21]

Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 480–491.

Digital Library

[22]

Halvar Flake. 2004. Structural comparison of executable objects. In Detection of intrusions and malware & vulnerability assessment, GI SIG SIDAR workshop, DIMVA 2004.

[23]

Debin Gao, Michael K Reiter, and Dawn Song. 2008. Binhunt: Automatically finding semantic differences in binary programs. In International Conference on Information and Communications Security. 238–255.

Digital Library

[24]

Jian Gao, Xin Yang, Ying Fu, Yu Jiang, and Jiaguang Sun. 2018. VulSeeker: a semantic learning based vulnerability seeker for cross-platform binary. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). 896–899.

Digital Library

[25]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). 2, 1735–1742.

Digital Library

[26]

Irfan Ul Haq and Juan Caballero. 2019. A survey of binary code similarity. arXiv preprint arXiv:1909.11424.

[27]

Armijn Hemel, Karl Trygve Kalleberg, Rob Vermaas, and Eelco Dolstra. 2011. Finding software license violations through binary code clone detection. In Proceedings of the 8th Working Conference on Mining Software Repositories. 63–72.

Digital Library

[28]

Hex-Rays. 2015. IDA Pro Disassembler and Debugger. https://www.hex-rays.com/products/ida/index.shtml

[29]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9, 8 (1997), 1735–1780.

[30]

Xin Hu, Tzi-cker Chiueh, and Kang G Shin. 2009. Large-scale malware indexing using function-call graphs. In Proceedings of the 16th ACM conference on Computer and communications security. 611–620.

Digital Library

[31]

Xin Hu, Kang G Shin, Sandeep Bhatkar, and Kent Griffin. 2013. Mutantx-s: Scalable malware clustering based on static features. In 2013 $USENIX$ Annual Technical Conference ($USENIX$$ATC$ 13). 187–198.

[32]

Yikun Hu, Yuanyuan Zhang, Juanru Li, and Dawu Gu. 2016. Cross-architecture binary semantics understanding via similar code comparison. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 1, 57–67.

[33]

He Huang, Amr M Youssef, and Mourad Debbabi. 2017. Binsequence: Fast, accurate and scalable binary code reuse detection. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. 155–166.

Digital Library

[34]

Jiyong Jang, Maverick Woo, and David Brumley. 2013. Towards automatic software lineage inference. In 22nd $USENIX$ Security Symposium ($USENIX$ Security 13). 81–96.

[35]

Ulf Kargén and Nahid Shahmehri. 2017. Towards robust instruction-level trace alignment of binary code. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 342–352.

[36]

Dongkwan Kim, Eunsoo Kim, Sang Kil Cha, Sooel Son, and Yongdae Kim. 2020. Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned. arXiv preprint arXiv:2011.10749.

[37]

TaeGuen Kim, Yeo Reum Lee, BooJoong Kang, and Eul Gyu Im. 2019. Binary executable file similarity calculation using function matching. The Journal of Supercomputing, 75, 2 (2019), 607–622.

Digital Library

[38]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25 (2012), 1097–1105.

[39]

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning. 1188–1196.

Digital Library

[40]

Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou. 2018. α diff: cross-version binary code similarity detection with dnn. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 667–678.

Digital Library

[41]

Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and Sencun Zhu. 2014. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 389–400.

Digital Library

[42]

Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and Sencun Zhu. 2017. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Transactions on Software Engineering, 43, 12 (2017), 1157–1177.

Digital Library

[43]

Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. 2019. Safe: Self-attentive function embeddings for binary similarity. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. 309–329.

[44]

Luca Massarelli, Giuseppe A Di Luna, Fabio Petroni, Leonardo Querzoni, and Roberto Baldoni. 2019. Investigating graph embedding neural networks with unsupervised features extraction for binary analysis. In Proceedings of the 2nd Workshop on Binary Analysis Research (BAR).

[45]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

[46]

Jiang Ming, Meng Pan, and Debin Gao. 2012. iBinHunt: Binary hunting with inter-procedural control flow. In International Conference on Information Security and Cryptology. 92–109.

[47]

Lina Nouh, Ashkan Rahimian, Djedjiga Mouheb, Mourad Debbabi, and Aiman Hanna. 2017. BinSign: Fingerprinting binary functions to support automated analysis of code executables. In IFIP International Conference on ICT Systems Security and Privacy Protection. 341–355.

[48]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, and Luca Antiga. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32 (2019), 8026–8037.

[49]

Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In 2015 IEEE Symposium on Security and Privacy. 709–724.

Digital Library

[50]

Jannik Pewny, Felix Schuster, Lukas Bernhard, Thorsten Holz, and Christian Rossow. 2014. Leveraging semantic signatures for bug search in binary programs. In Proceedings of the 30th Annual Computer Security Applications Conference. 406–415.

Digital Library

[51]

Kimberly Redmond, Lannan Luo, and Qiang Zeng. 2018. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. arXiv preprint arXiv:1812.09652.

[52]

Andreas Sæ bjørnsen, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and Zhendong Su. 2009. Detecting code clones in binary executables. In Proceedings of the eighteenth international symposium on Software testing and analysis. 117–128.

[53]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.

[54]

SecretPatch. 2021. SecretPatch. https://github.com/SecretPatch/Dataset

[55]

Paria Shirani, Leo Collard, Basile L Agba, Bernard Lebel, Mourad Debbabi, Lingyu Wang, and Aiman Hanna. 2018. Binarm: Scalable and efficient detection of vulnerabilities in firmware images of intelligent electronic devices. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. 114–138.

[56]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104–3112.

[57]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.

[58]

Yongqin Xian, Bernt Schiele, and Zeynep Akata. 2017. Zero-shot learning-the good, the bad and the ugly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4582–4591.

[59]

Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 363–376.

Digital Library

[60]

Zhengzi Xu, Bihuan Chen, Mahinthan Chandramohan, Yang Liu, and Fu Song. 2017. Spain: security patch analysis for binaries towards understanding the pain and pills. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). 462–472.

Digital Library

[61]

Jia Yang, Cai Fu, Xiao-Yang Liu, Heng Yin, and Pan Zhou. 2021. Codee: A Tensor Embedding Scheme for Binary Code Search. IEEE Transactions on Software Engineering.

[62]

Zeping Yu, Rui Cao, Qiyi Tang, Sen Nie, Junzhou Huang, and Shi Wu. 2020. Order matters: Semantic-aware neural networks for binary code similarity detection. In Proceedings of the AAAI Conference on Artificial Intelligence. 34, 1145–1152.

[63]

Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2018. Neural machine translation inspired binary code similarity comparison beyond function pairs. arXiv preprint arXiv:1808.04706.

[64]

zynamics. 2018. BinDiff. "https://www.zynamics.com/bindiff.html"

Cited By

Gao YLiang LLi YLi RWang Y(2024)Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and FusionElectronics10.3390/electronics1309169213:9(1692)Online publication date: 27-Apr-2024
https://doi.org/10.3390/electronics13091692
Zhang PWu CHu HJia LPeng MXu JXie MLai YKang YWang Z(2024)Shining Light on the Inter-procedural Code Obfuscation: Keep Pace with Progress in Binary DiffingACM Transactions on Architecture and Code Optimization10.1145/3701992Online publication date: 28-Oct-2024
https://doi.org/10.1145/3701992
Xie ZWen MWei ZJin HFilkov VRay BZhou M(2024)Unveiling the Characteristics and Impact of Security Patch EvolutionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695488(1094-1106)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695488
Show More Cited By

Index Terms

jTrans: jump-aware transformer for binary code similarity detection
1. Computing methodologies
  1. Machine learning
2. Security and privacy
  1. Software and application security
    1. Software reverse engineering

Recommendations

Practical Binary Code Similarity Detection with BERT-based Transferable Similarity Learning
ACSAC '22: Proceedings of the 38th Annual Computer Security Applications Conference

Binary code similarity detection (BCSD) serves as a basis for a wide spectrum of applications, including software plagiarism, malware classification, and known vulnerability discovery. However, the inference of contextual meanings of a binary is ...
CRABS-former: CRoss-Architecture Binary Code Similarity Detection based on Transformer
Internetware '24: Proceedings of the 15th Asia-Pacific Symposium on Internetware

Binary code similarity detection (BCSD) is widely used in software analysis such as vulnerability detection and malware identification. Among various forms of binary representation, assembly is particularly feasible for real-world applications due to ...
Detecting code clones in binary executables
ISSTA '09: Proceedings of the eighteenth international symposium on Software testing and analysis

Large software projects contain significant code duplication, mainly due to copying and pasting code. Many techniques have been developed to identify duplicated code to enable applications such as refactoring, detecting bugs, and protecting intellectual ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ISSTA 2022: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

July 2022

808 pages

ISBN:9781450393799

DOI:10.1145/3533767

General Chair:
Sukyoung Ryu
KAIST, South Korea
,
Program Chair:
Yannis Smaragdakis
University of Athens, Greece

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISSTA '22

Sponsor:

SIGSOFT

ISSTA '22: 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

July 18 - 22, 2022

Virtual, South Korea

Acceptance Rates

Overall Acceptance Rate 58 of 213 submissions, 27%

Upcoming Conference

ISSTA '25

Sponsor:
sigsoft

34th ACM SIGSOFT International Symposium on Software Testing and Analysis

June 25 - 28, 2025

Trondheim , Norway

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

54
Total Citations
View Citations
2,487
Total Downloads

Downloads (Last 12 months)1,222
Downloads (Last 6 weeks)210

Reflects downloads up to 21 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gao YLiang LLi YLi RWang Y(2024)Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and FusionElectronics10.3390/electronics1309169213:9(1692)Online publication date: 27-Apr-2024
https://doi.org/10.3390/electronics13091692
Zhang PWu CHu HJia LPeng MXu JXie MLai YKang YWang Z(2024)Shining Light on the Inter-procedural Code Obfuscation: Keep Pace with Progress in Binary DiffingACM Transactions on Architecture and Code Optimization10.1145/3701992Online publication date: 28-Oct-2024
https://doi.org/10.1145/3701992
Xie ZWen MWei ZJin HFilkov VRay BZhou M(2024)Unveiling the Characteristics and Impact of Security Patch EvolutionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695488(1094-1106)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695488
Li WLu JXiao RShao PJin SFilkov VRay BZhou M(2024)RCFG2Vec: Considering Long-Distance Dependency for Binary Code Similarity DetectionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695070(770-782)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695070
Song ZXu J(2024)BinVuGAL: Binary vulnerability detection method based on graph neural network combined with assembly language modelProceedings of the 2024 3rd International Conference on Cryptography, Network Security and Communication Technology10.1145/3673277.3673305(159-163)Online publication date: 19-Jan-2024
https://dl.acm.org/doi/10.1145/3673277.3673305
Feng YLi HCao YWang YFeng H(2024)CRABS-former: CRoss-Architecture Binary Code Similarity Detection based on TransformerProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671390(11-20)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3671390
Xie DZhang ZJiang NXu XTan LZhang XLuo BLiao XXu JKirda ELie D(2024)ReSym: Harnessing LLMs to Recover Variable and Data Structure Symbols from Stripped BinariesProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security10.1145/3658644.3670340(4554-4568)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3658644.3670340
Xiao HZhang YShen MLin CZhang CLiu SYang MLuo BLiao XXu JKirda ELie D(2024)Accurate and Efficient Recurring Vulnerability Detection for IoT FirmwareProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security10.1145/3658644.3670275(3317-3331)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3658644.3670275
Wang HGao ZZhang CSha ZSun MZhou YZhu WSun WQiu HXiao XChristakis MPradel M(2024)CLAP: Learning Transferable Binary Code Representations with Natural Language SupervisionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652145(503-515)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652145
Huang HZhao JChristakis MPradel M(2024)Multi-modal Learning for WebAssembly Reverse EngineeringProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652141(453-465)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652141
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents