[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3533767.3534367acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article
Open access

jTrans: jump-aware transformer for binary code similarity detection

Published: 18 July 2022 Publication History

Abstract

Binary code similarity detection (BCSD) has important applications in various fields such as vulnerabilities detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow information of binary code into Transformer-based language models, by using a novel jump-aware representation of the analyzed binaries and a newly-designed pre-training task. Additionally, we release to the community a newly-created large dataset of binaries, BinaryCorp, which is the most diverse to date. Evaluation results show that jTrans outperforms state-of-the-art (SOTA) approaches on this more challenging dataset by 30.5% (i.e., from 32.0% to 62.5%). In a real-world task of known vulnerability searching, jTrans achieves a recall that is 2X higher than existing SOTA baselines.

References

[1]
Dennis Andriesse, Asia Slowinska, and Herbert Bos. 2017. Compiler-Agnostic Function Detection in Binaries. In 2017 IEEE European Symposium on Security and Privacy (EuroS P). 177–189. https://doi.org/10.1109/EuroSP.2017.11
[2]
Archlinux. 2021. Arch linux. https://archlinux.org/packages/
[3]
Archlinux. 2021. Arch User Repository. https://aur.archlinux.org/
[4]
Silvio Cesare, Yang Xiang, and Wanlei Zhou. 2013. Control flow-based malware variantdetection. IEEE Transactions on Dependable and Secure Computing, 11, 4 (2013), 307–317.
[5]
Mahinthan Chandramohan, Yinxing Xue, Zhengzi Xu, Yang Liu, Chia Yuan Cho, and Hee Beng Kuan Tan. 2016. Bingo: Cross-architecture cross-os binary search. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 678–689.
[6]
Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). 1, 539–546.
[7]
Hanjun Dai, Bo Dai, and Le Song. 2016. Discriminative embeddings of latent variable models for structured data. In International conference on machine learning. 2702–2711.
[8]
Yaniv David, Nimrod Partush, and Eran Yahav. 2016. Statistical similarity of binaries. ACM SIGPLAN Notices, 51, 6 (2016), 266–280.
[9]
Yaniv David, Nimrod Partush, and Eran Yahav. 2017. Similarity of binaries through re-optimization. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation. 79–94.
[10]
Yaniv David, Nimrod Partush, and Eran Yahav. 2018. Firmup: Precise static detection of common vulnerabilities in firmware. ACM SIGPLAN Notices, 53, 2 (2018), 392–404.
[11]
Yaniv David and Eran Yahav. 2014. Tracelet-based code search in executables. Acm Sigplan Notices, 49, 6 (2014), 349–360.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
[13]
Steven HH Ding, Benjamin CM Fung, and Philippe Charland. 2016. Kam1n0: Mapreduce-based assembly clone search for reverse engineering. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 461–470.
[14]
Steven HH Ding, Benjamin CM Fung, and Philippe Charland. 2019. Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In 2019 IEEE Symposium on Security and Privacy (SP). 472–489.
[15]
Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. 2020. Deepbindiff: Learning program-wide code representations for binary diffing. In Network and Distributed System Security Symposium.
[16]
Thomas Dullien and Rolf Rolles. 2005. Graph-based comparison of executable objects (english version). Sstic, 5, 1 (2005), 3.
[17]
Manuel Egele, Maverick Woo, Peter Chapman, and David Brumley. 2014. Blanket execution: Dynamic similarity testing for program binaries and components. In 23rd $USENIX$ Security Symposium ($USENIX$ Security 14). 303–317.
[18]
Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code. In NDSS. 52, 58–79.
[19]
Mohammad Reza Farhadi, Benjamin CM Fung, Philippe Charland, and Mourad Debbabi. 2014. Binclone: Detecting code clones in malware. In 2014 Eighth International Conference on Software Security and Reliability (SERE). 78–87.
[20]
Qian Feng, Minghua Wang, Mu Zhang, Rundong Zhou, Andrew Henderson, and Heng Yin. 2017. Extracting conditional formulas for cross-platform bug search. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. 346–359.
[21]
Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 480–491.
[22]
Halvar Flake. 2004. Structural comparison of executable objects. In Detection of intrusions and malware & vulnerability assessment, GI SIG SIDAR workshop, DIMVA 2004.
[23]
Debin Gao, Michael K Reiter, and Dawn Song. 2008. Binhunt: Automatically finding semantic differences in binary programs. In International Conference on Information and Communications Security. 238–255.
[24]
Jian Gao, Xin Yang, Ying Fu, Yu Jiang, and Jiaguang Sun. 2018. VulSeeker: a semantic learning based vulnerability seeker for cross-platform binary. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). 896–899.
[25]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). 2, 1735–1742.
[26]
Irfan Ul Haq and Juan Caballero. 2019. A survey of binary code similarity. arXiv preprint arXiv:1909.11424.
[27]
Armijn Hemel, Karl Trygve Kalleberg, Rob Vermaas, and Eelco Dolstra. 2011. Finding software license violations through binary code clone detection. In Proceedings of the 8th Working Conference on Mining Software Repositories. 63–72.
[28]
Hex-Rays. 2015. IDA Pro Disassembler and Debugger. https://www.hex-rays.com/products/ida/index.shtml
[29]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9, 8 (1997), 1735–1780.
[30]
Xin Hu, Tzi-cker Chiueh, and Kang G Shin. 2009. Large-scale malware indexing using function-call graphs. In Proceedings of the 16th ACM conference on Computer and communications security. 611–620.
[31]
Xin Hu, Kang G Shin, Sandeep Bhatkar, and Kent Griffin. 2013. Mutantx-s: Scalable malware clustering based on static features. In 2013 $USENIX$ Annual Technical Conference ($USENIX$$ATC$ 13). 187–198.
[32]
Yikun Hu, Yuanyuan Zhang, Juanru Li, and Dawu Gu. 2016. Cross-architecture binary semantics understanding via similar code comparison. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 1, 57–67.
[33]
He Huang, Amr M Youssef, and Mourad Debbabi. 2017. Binsequence: Fast, accurate and scalable binary code reuse detection. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. 155–166.
[34]
Jiyong Jang, Maverick Woo, and David Brumley. 2013. Towards automatic software lineage inference. In 22nd $USENIX$ Security Symposium ($USENIX$ Security 13). 81–96.
[35]
Ulf Kargén and Nahid Shahmehri. 2017. Towards robust instruction-level trace alignment of binary code. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 342–352.
[36]
Dongkwan Kim, Eunsoo Kim, Sang Kil Cha, Sooel Son, and Yongdae Kim. 2020. Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned. arXiv preprint arXiv:2011.10749.
[37]
TaeGuen Kim, Yeo Reum Lee, BooJoong Kang, and Eul Gyu Im. 2019. Binary executable file similarity calculation using function matching. The Journal of Supercomputing, 75, 2 (2019), 607–622.
[38]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25 (2012), 1097–1105.
[39]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning. 1188–1196.
[40]
Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou. 2018. α diff: cross-version binary code similarity detection with dnn. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 667–678.
[41]
Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and Sencun Zhu. 2014. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 389–400.
[42]
Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and Sencun Zhu. 2017. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Transactions on Software Engineering, 43, 12 (2017), 1157–1177.
[43]
Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. 2019. Safe: Self-attentive function embeddings for binary similarity. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. 309–329.
[44]
Luca Massarelli, Giuseppe A Di Luna, Fabio Petroni, Leonardo Querzoni, and Roberto Baldoni. 2019. Investigating graph embedding neural networks with unsupervised features extraction for binary analysis. In Proceedings of the 2nd Workshop on Binary Analysis Research (BAR).
[45]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
[46]
Jiang Ming, Meng Pan, and Debin Gao. 2012. iBinHunt: Binary hunting with inter-procedural control flow. In International Conference on Information Security and Cryptology. 92–109.
[47]
Lina Nouh, Ashkan Rahimian, Djedjiga Mouheb, Mourad Debbabi, and Aiman Hanna. 2017. BinSign: Fingerprinting binary functions to support automated analysis of code executables. In IFIP International Conference on ICT Systems Security and Privacy Protection. 341–355.
[48]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, and Luca Antiga. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32 (2019), 8026–8037.
[49]
Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In 2015 IEEE Symposium on Security and Privacy. 709–724.
[50]
Jannik Pewny, Felix Schuster, Lukas Bernhard, Thorsten Holz, and Christian Rossow. 2014. Leveraging semantic signatures for bug search in binary programs. In Proceedings of the 30th Annual Computer Security Applications Conference. 406–415.
[51]
Kimberly Redmond, Lannan Luo, and Qiang Zeng. 2018. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. arXiv preprint arXiv:1812.09652.
[52]
Andreas Sæ bjørnsen, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and Zhendong Su. 2009. Detecting code clones in binary executables. In Proceedings of the eighteenth international symposium on Software testing and analysis. 117–128.
[53]
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
[54]
SecretPatch. 2021. SecretPatch. https://github.com/SecretPatch/Dataset
[55]
Paria Shirani, Leo Collard, Basile L Agba, Bernard Lebel, Mourad Debbabi, Lingyu Wang, and Aiman Hanna. 2018. Binarm: Scalable and efficient detection of vulnerabilities in firmware images of intelligent electronic devices. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. 114–138.
[56]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104–3112.
[57]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.
[58]
Yongqin Xian, Bernt Schiele, and Zeynep Akata. 2017. Zero-shot learning-the good, the bad and the ugly. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4582–4591.
[59]
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 363–376.
[60]
Zhengzi Xu, Bihuan Chen, Mahinthan Chandramohan, Yang Liu, and Fu Song. 2017. Spain: security patch analysis for binaries towards understanding the pain and pills. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). 462–472.
[61]
Jia Yang, Cai Fu, Xiao-Yang Liu, Heng Yin, and Pan Zhou. 2021. Codee: A Tensor Embedding Scheme for Binary Code Search. IEEE Transactions on Software Engineering.
[62]
Zeping Yu, Rui Cao, Qiyi Tang, Sen Nie, Junzhou Huang, and Shi Wu. 2020. Order matters: Semantic-aware neural networks for binary code similarity detection. In Proceedings of the AAAI Conference on Artificial Intelligence. 34, 1145–1152.
[63]
Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo, Qiang Zeng, and Zhexin Zhang. 2018. Neural machine translation inspired binary code similarity comparison beyond function pairs. arXiv preprint arXiv:1808.04706.
[64]
zynamics. 2018. BinDiff. "https://www.zynamics.com/bindiff.html"

Cited By

View all
  • (2024)Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and FusionElectronics10.3390/electronics1309169213:9(1692)Online publication date: 27-Apr-2024
  • (2024)Shining Light on the Inter-procedural Code Obfuscation: Keep Pace with Progress in Binary DiffingACM Transactions on Architecture and Code Optimization10.1145/3701992Online publication date: 28-Oct-2024
  • (2024)Unveiling the Characteristics and Impact of Security Patch EvolutionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695488(1094-1106)Online publication date: 27-Oct-2024
  • Show More Cited By

Index Terms

  1. jTrans: jump-aware transformer for binary code similarity detection

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ISSTA 2022: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis
      July 2022
      808 pages
      ISBN:9781450393799
      DOI:10.1145/3533767
      This work is licensed under a Creative Commons Attribution 4.0 International License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 July 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Binary Analysis
      2. Datasets
      3. Neural Networks
      4. Similarity Detection

      Qualifiers

      • Research-article

      Conference

      ISSTA '22
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 58 of 213 submissions, 27%

      Upcoming Conference

      ISSTA '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,222
      • Downloads (Last 6 weeks)210
      Reflects downloads up to 21 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and FusionElectronics10.3390/electronics1309169213:9(1692)Online publication date: 27-Apr-2024
      • (2024)Shining Light on the Inter-procedural Code Obfuscation: Keep Pace with Progress in Binary DiffingACM Transactions on Architecture and Code Optimization10.1145/3701992Online publication date: 28-Oct-2024
      • (2024)Unveiling the Characteristics and Impact of Security Patch EvolutionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695488(1094-1106)Online publication date: 27-Oct-2024
      • (2024)RCFG2Vec: Considering Long-Distance Dependency for Binary Code Similarity DetectionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695070(770-782)Online publication date: 27-Oct-2024
      • (2024)BinVuGAL: Binary vulnerability detection method based on graph neural network combined with assembly language modelProceedings of the 2024 3rd International Conference on Cryptography, Network Security and Communication Technology10.1145/3673277.3673305(159-163)Online publication date: 19-Jan-2024
      • (2024)CRABS-former: CRoss-Architecture Binary Code Similarity Detection based on TransformerProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3671390(11-20)Online publication date: 24-Jul-2024
      • (2024)ReSym: Harnessing LLMs to Recover Variable and Data Structure Symbols from Stripped BinariesProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security10.1145/3658644.3670340(4554-4568)Online publication date: 2-Dec-2024
      • (2024)Accurate and Efficient Recurring Vulnerability Detection for IoT FirmwareProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security10.1145/3658644.3670275(3317-3331)Online publication date: 2-Dec-2024
      • (2024)CLAP: Learning Transferable Binary Code Representations with Natural Language SupervisionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652145(503-515)Online publication date: 11-Sep-2024
      • (2024)Multi-modal Learning for WebAssembly Reverse EngineeringProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652141(453-465)Online publication date: 11-Sep-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media