More Web Proxy on the site http://driver.im/

research-article

αDiff: cross-version binary code similarity detection with DNN

Authors:

Wei ZouAuthors Info & Claims

ASE '18: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering

Pages 667 - 678

https://doi.org/10.1145/3238147.3238199

Published: 03 September 2018 Publication History

Abstract

Binary code similarity detection (BCSD) has many applications, including patch analysis, plagiarism detection, malware detection, and vulnerability search etc. Existing solutions usually perform comparisons over specific syntactic features extracted from binary code, based on expert knowledge. They have either high performance overheads or low detection accuracy. Moreover, few solutions are suitable for detecting similarities between cross-version binaries, which may not only diverge in syntactic structures but also diverge slightly in semantics.

In this paper, we propose a solution αDiff, employing three semantic features, to address the cross-version BCSD challenge. It first extracts the intra-function feature of each binary function using a deep neural network (DNN). The DNN works directly on raw bytes of each function, rather than features (e.g., syntactic structures) provided by experts. αDiff further analyzes the function call graph of each binary, which are relatively stable in cross-version binaries, and extracts the inter-function and inter-module features. Then, a distance is computed based on these three features and used for BCSD. We have implemented a prototype of αDiff, and evaluated it on a dataset with about 2.5 million samples. The result shows that αDiff outperforms state-of-the-art static solutions by over 10 percentages on average in different BCSD settings.

References

[1]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, Vol. 16. 265–283.

Digital Library

[2]

Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel, and Engin Kirda. 2009. Scalable, Behavior-Based Malware Clustering. In NDSS, Vol. 9. Citeseer, 8–11.

[3]

Sean Bell and Kavita Bala. 2015. Learning visual similarity for product design with convolutional neural networks. ACM Transactions on Graphics (TOG) 34, 4 (2015), 98.

Digital Library

[4]

Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1994. Signature verification using a" siamese" time delay neural network. In Advances in Neural Information Processing Systems. 737–744.

Digital Library

[5]

David Brumley, Pongsin Poosankam, Dawn Song, and Jiang Zheng. 2008. Automatic patch-based exploit generation is possible: Techniques and implications. In Security and Privacy, 2008. SP 2008. IEEE Symposium on. IEEE, 143–157.

Digital Library

[6]

Mahinthan Chandramohan, Yinxing Xue, Zhengzi Xu, Yang Liu, Chia Yuan Cho, and Hee Beng Kuan Tan. 2016.

[7]

Bingo: Crossarchitecture cross-os binary search. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 678–689.

Digital Library

[8]

Kai Chen, Peng Wang, Yeonjoon Lee, XiaoFeng Wang, Nan Zhang, Heqing Huang, Wei Zou, and Peng Liu. 2015. Finding Unknown Malice in 10 Seconds: Mass Vetting for New Threats at the Google-Play Scale. In USENIX Security Symposium, Vol. 15.

Digital Library

[9]

François Chollet et al. 2015. Keras. Retrieved April 10, 2018 from https://keras.io/

[10]

Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1. IEEE, 539–546.

Digital Library

[11]

George E Dahl, Tara N Sainath, and Geoffrey E Hinton. 2013. Improving deep neural networks for LVCSR using rectified linear units and dropout. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 8609–8613.

[12]

Hanjun Dai, Bo Dai, and Le Song. 2016. Discriminative embeddings of latent variable models for structured data. In International Conference on Machine Learning. 2702–2711.

Digital Library

[13]

Yaniv David, Nimrod Partush, and Eran Yahav. 2016. Statistical similarity of binaries. ACM SIGPLAN Notices 51, 6 (2016), 266–280.

Digital Library

[14]

DDWRT 2013. DD-WRT Firmware Image r21676. Retrieved April 26, 2018 from ftp://ftp.dd-wrt.com/betas/2013/05-27-2013-r21676/senaoeoc5610/linux.bin

[15]

Thomas Dullien and Rolf Rolles. 2005. Graph-based comparison of executable objects (english version). Sstic (2005), 1–13.

[16]

Manuel Egele, Maverick Woo, Peter Chapman, and David Brumley. 2014. Blanket execution: Dynamic similarity testing for program binaries and components. USENIX.

Digital Library

[17]

Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code. In NDSS.

[18]

Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 480–491.

Digital Library

[19]

Halvar Flake. 2004. Structural comparison of executable objects. In Proc. of the International GI Workshop on Detection of Intrusions and Malware & Vulnerability Assessment, number P-46 in Lecture Notes in Informatics. Citeseer, 161–174.

[20]

Debin Gao, Michael K Reiter, and Dawn Song. 2008. Binhunt: Automatically finding semantic differences in binary programs. In International Conference on Information and Communications Security. Springer, 238– 255.

Digital Library

[21]

Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. Vol. 1. MIT press Cambridge.

Digital Library

[22]

Isma Hadji and Richard P Wildes. 2018. What Do We Understand About Convolutional Networks? arXiv preprint arXiv:1803.08834 (2018).

[23]

Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, Vol. 2. IEEE, 1735–1742.

Digital Library

[24]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.

[25]

Hex-Rays. 2015. IDA Pro Disassembler and Debugger. Retrieved April 10, 2018 from https://www.hex-rays.com/products/ida/index.shtml

[26]

Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. 2012. Neural Networks for Machine Learning-Lecture 6a-Overview of mini-batch gradient descent.

[27]

Xin Hu, Tzi-cker Chiueh, and Kang G Shin. 2009. Large-scale malware indexing using function-call graphs. In Proceedings of the 16th ACM conference on Computer and communications security. ACM, 611–620.

Digital Library

[28]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).

Digital Library

[29]

Jiyong Jang, Maverick Woo, and David Brumley. 2013. Towards Automatic Software Lineage Inference. In USENIX Security Symposium. 81–96.

Digital Library

[30]

Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33, 1 (2011), 117–128.

Digital Library

[31]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.

Digital Library

[32]

Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. 1989.

[33]

Backpropagation applied to handwritten zip code recognition. Neural computation 1, 4 (1989), 541–551.

Digital Library

[34]

Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and Sencun Zhu. 2014. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 389–400.

Digital Library

[35]

Lannan Luo, Jiang Ming, Dinghao Wu, Peng Liu, and Sencun Zhu. 2017. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Transactions on Software Engineering 43, 12 (2017), 1157–1177.

Digital Library

[36]

Jiang Ming, Meng Pan, and Debin Gao. 2012. iBinHunt: Binary hunting with inter-procedural control flow. In International Conference on Information Security and Cryptology. Springer, 92–109.

Digital Library

[37]

Anh Nguyen, Jason Yosinski, Yoshua Bengio, Alexey Dosovitskiy, and Jeff Clune. 2016. Plug & play generative networks: Conditional iterative generation of images in latent space. arXiv preprint arXiv:1612.00005 (2016).

[38]

Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. 2018. The Building Blocks of Interpretability. Distill 3, 3 (2018), e10.

[39]

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. 2015. Deep face recognition. In BMVC, Vol. 1. 6. ASE ’18, September 3–7, 2018, Montpellier, France B. Liu, W. Huo, C. Zhang, W. Li, F. Li, A. Piao, W. Zou

[40]

Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In Security and Privacy (SP), 2015 IEEE Symposium on. IEEE, 709–724.

Digital Library

[41]

Dragomir R Radev, Hong Qi, Harris Wu, and Weiguo Fan. 2002. Evaluating web-based question answering systems. Ann Arbor 1001 (2002), 48109.

[42]

ReadyNAS 2014. ReadyNAS Firmware Image v6.1.6. Retrieved April 26, 2018 from http://www.downloads.netgear.com/files/GDC/ READYNAS-100/ReadyNASOS-6.1.6-arm.zip

[43]

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. nature 323, 6088 (1986), 533.

[44]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.

Digital Library

[45]

Andreas Sæbjørnsen, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and Zhendong Su. 2009. Detecting code clones in binary executables. In Proceedings of the eighteenth international symposium on Software testing and analysis. ACM, 117–128.

Digital Library

[46]

Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.

[47]

Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. 2015. Recognizing Functions in Binaries with Neural Networks. In USENIX Security Symposium. 611–626.

Digital Library

[48]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[49]

Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. 2016. Deep metric learning via lifted structured feature embedding. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE, 4004–4012.

[50]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2818–2826.

[51]

Shuai Wang and Dinghao Wu. 2017. In-memory fuzzing for binary code similarity analysis. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. IEEE Press, 319–330.

Digital Library

[52]

Zheng Wang, Ken Pierce, and Scott McFarling. 2000. Bmat-a binary matching tool for stale profile propagation. The Journal of Instruction-Level Parallelism 2 (2000), 1–20.

[53]

Kilian Q Weinberger, John Blitzer, and Lawrence K Saul. 2006. Distance metric learning for large margin nearest neighbor classification. In Advances in neural information processing systems. 1473–1480.

Digital Library

[54]

Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 363–376.

Digital Library

[55]

Zhengzi Xu, Bihuan Chen, Mahinthan Chandramohan, Yang Liu, and Fu Song. 2017. SPAIN: security patch analysis for binaries towards understanding the pain and pills. In Proceedings of the 39th International Conference on Software Engineering. IEEE Press, 462–472.

Digital Library

[56]

Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818–833.

Cited By

Ruan LXu QZhu SHuang XLin X(2024)A Survey of Binary Code Similarity Detection TechniquesElectronics10.3390/electronics1309171513:9(1715)Online publication date: 29-Apr-2024
https://doi.org/10.3390/electronics13091715
Peng JWang YXue JLiu Z(2024)Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNNChinese Journal of Electronics10.23919/cje.2022.00.22833:1(128-138)Online publication date: Jan-2024
https://doi.org/10.23919/cje.2022.00.228
Jia YYu ZHong Z(2024)Semantic aware-based instruction embedding for binary code similarity detectionPLOS ONE10.1371/journal.pone.030529919:6(e0305299)Online publication date: 11-Jun-2024
https://doi.org/10.1371/journal.pone.0305299
Show More Cited By

Index Terms

αDiff: cross-version binary code similarity detection with DNN
1. Computing methodologies
  1. Machine learning
2. Security and privacy
  1. Software and application security
    1. Software reverse engineering

Recommendations

Practical Binary Code Similarity Detection with BERT-based Transferable Similarity Learning
ACSAC '22: Proceedings of the 38th Annual Computer Security Applications Conference

Binary code similarity detection (BCSD) serves as a basis for a wide spectrum of applications, including software plagiarism, malware classification, and known vulnerability discovery. However, the inference of contextual meanings of a binary is ...
A Code Similarity Detection Algorithm Based on Maximum Common Subtree Optimization
EITCE '20: Proceedings of the 2020 4th International Conference on Electronic Information Technology and Computer Engineering

The code similarity detection is different from the traditional text duplication checking. The former has a lot of the same syntax content in the code. There are two code duplication detection algorithms. One is realized by extracting and counting ...
Identifying social networks of programmers using text mining for code similarity detection
ASONAM '20: Proceedings of the 12th IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

The availability of code in many online repositories and collaborating platforms has posed new challenges in source code attribution not only for plagiarism detection but also in other settings such as in the use of insecure copied code in commercial ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ASE '18: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering

September 2018

955 pages

ISBN:9781450359375

DOI:10.1145/3238147

General Chair:
Marianne Huchard
University of Montpellier, France
,
Program Chairs:
Christian Kästner
Carnegie Mellon University, USA
,
Gordon Fraser
University of Passau, Germany

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence
CNRS: Centre National De La Rechercue Scientifique
SIGSOFT: ACM Special Interest Group on Software Engineering
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 September 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASE '18

Sponsor:

SIGAI
CNRS
SIGSOFT
IEEE-CS

ASE '18: 33rd ACM/IEEE International Conference on Automated Software Engineering

September 3 - 7, 2018

Montpellier, France

Acceptance Rates

Overall Acceptance Rate 82 of 337 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

128
Total Citations
View Citations
1,907
Total Downloads

Downloads (Last 12 months)178
Downloads (Last 6 weeks)26

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ruan LXu QZhu SHuang XLin X(2024)A Survey of Binary Code Similarity Detection TechniquesElectronics10.3390/electronics1309171513:9(1715)Online publication date: 29-Apr-2024
https://doi.org/10.3390/electronics13091715
Peng JWang YXue JLiu Z(2024)Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNNChinese Journal of Electronics10.23919/cje.2022.00.22833:1(128-138)Online publication date: Jan-2024
https://doi.org/10.23919/cje.2022.00.228
Jia YYu ZHong Z(2024)Semantic aware-based instruction embedding for binary code similarity detectionPLOS ONE10.1371/journal.pone.030529919:6(e0305299)Online publication date: 11-Jun-2024
https://doi.org/10.1371/journal.pone.0305299
Li WLu JXiao RShao PJin SFilkov VRay BZhou M(2024)RCFG2Vec: Considering Long-Distance Dependency for Binary Code Similarity DetectionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695070(770-782)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695070
Song ZXu J(2024)BinVuGAL: Binary vulnerability detection method based on graph neural network combined with assembly language modelProceedings of the 2024 3rd International Conference on Cryptography, Network Security and Communication Technology10.1145/3673277.3673305(159-163)Online publication date: 19-Jan-2024
https://dl.acm.org/doi/10.1145/3673277.3673305
Wang HGao ZZhang CSha ZSun MZhou YZhu WSun WQiu HXiao XChristakis MPradel M(2024)CLAP: Learning Transferable Binary Code Representations with Natural Language SupervisionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652145(503-515)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652145
Huang HZhao JChristakis MPradel M(2024)Multi-modal Learning for WebAssembly Reverse EngineeringProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652141(453-465)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652141
Wang HGao ZZhang CSun MZhou YQiu HXiao XChristakis MPradel M(2024)CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity DetectionProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652117(149-161)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652117
Huang JYang KWang GShi ZLv SSun LBaysal OLinares-Vasquez MMoran KSteinmacher I(2024)TaiE: Function Identification for Monolithic FirmwareProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644407(403-414)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1145/3643916.3644407
Zhou AHu YXu XZhang C(2024) ARCTURUS: Full Coverage Binary Similarity Analysis with Reachability-guided EmulationACM Transactions on Software Engineering and Methodology10.1145/364033733:4(1-31)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3640337
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents