More Web Proxy on the site http://driver.im/

research-article

SepBIN: Binary Feature Separation for Better Semantic Comparison and Authorship Verification

Authors:

Yongzheng Zhang,

Xiaolin XuAuthors Info & Claims

IEEE Transactions on Information Forensics and Security, Volume 19

Pages 1372 - 1387

https://doi.org/10.1109/TIFS.2023.3331895

Published: 09 November 2023 Publication History

Abstract

Binary semantic comparison and authorship verification are critical in many security applications. They respectively focus on the functional semantic features and developers’ programming style features of binary code, which are usually mixed without clear demarcation. Recently, researchers have proposed learning-based approaches for intelligent binary analysis. They generally addressed single tasks with hand-crafted feature sets or neural binary encoders, which suffer performance bottlenecks due to the noise in mixed features. This paper proposes <monospace>SepBIN</monospace>, a novel neural network framework that exploits the intrinsic correlation of binary semantic comparison and authorship verification tasks and automatically separates semantic and stylistic binary features. We first construct a strong backbone binary encoder, then utilize preliminary decomposition subnets and the flexible gating-based feature fusion mechanism to distill pure semantic-related and style-related binary representations, and further improve their quality by a feature reconstruction module. The overall <monospace>SepBIN</monospace> model is optimized by a multi-objective joint optimization strategy. We conduct extensive experiments on Google Code Jam (GCJ) datasets in different languages and scales. Results show that <monospace>SepBIN</monospace> simultaneously benefits binary semantic comparison and authorship verification tasks through the effective binary semantic-style feature separation mechanism, and provides multi-perspectives interpretability for the performance gains. For state-of-the-art approaches with different binary encoders, <monospace>SepBIN</monospace> can adaptively improve them with the designed separation modules. Furthermore, we adopt a pretraining-finetuning strategy to effectively transfer <monospace>SepBIN</monospace>’s separation capability in real-world applications, including APT malware homology detection and binary semantic comparison against code obfuscations.

References

[1]

I. U. Haq and J. Caballero, “A survey of binary code similarity,” ACM Comput. Surv., vol. 54, no. 3, pp. 1–38, Apr. 2022.

Digital Library

[2]

A. Marcelli, M. Graziano, X. Ugarte-Pedrero, Y. Fratantonio, M. Mansouri, and D. Balzarotti, “How machine learning is solving the binary function similarity problem,” in Proc. 31st USENIX Secur. Symp., 2022, pp. 2099–2116.

[3]

A. C. Islamet al., “De-anonymizing programmers via code stylometry,” in Proc. 24th USENIX Secur. Symp., 2015, pp. 255–270.

[4]

Q. Song, Y. Zhang, L. Ouyang, and Y. Chen, “BinMLM: Binary authorship verification with flow-aware mixture-of-shared language model,” in Proc. IEEE Int. Conf. Softw. Anal., Evol. Reengineering (SANER), Mar. 2022, pp. 1023–1033.

[5]

B. Liuet al., “\alphaDiff: Cross-version binary code similarity detection with DNN,” in Proc. 33rd ACM/IEEE Int. Conf. Automated Softw. Eng., Sep. 2018, pp. 667–678.

[6]

S. H. H. Ding, B. C. M. Fung, and P. Charland, “Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” in Proc. IEEE Symp. Secur. Privacy (SP), May 2019, pp. 472–489.

[7]

Y. Duan, X. Li, J. Wang, and H. Yin, “DeepBinDiff: Learning program-wide code representations for binary diffing,” in Proc. Netw. Distrib. Syst. Secur. Symp., 2020, pp. 1–16.

[8]

X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, “Neural network-based graph embedding for cross-platform binary code similarity detection,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Oct. 2017, pp. 363–376.

[9]

A. Caliskanet al., “When coding style survives compilation: De-anonymizing programmers from executable binaries,” in Proc. Netw. Distrib. Syst. Secur. Symp., 2018, pp. 1–15.

[10]

S. Alrabaee, E. B. Karbab, L. Wang, and M. Debbabi, “Bineye: Towards efficient binary authorship characterization using deep learning,” in Proc. Comput. Secur. (ESORICS), 2019, pp. 47–67.

[11]

L. Massarelli, G. A. Di Luna, F. Petroni, R. Baldoni, and L. Querzoni, “Safe: Self-attentive function embeddings for binary similarity,” in Proc. Detection Intrusions Malware Vulnerability Assessment (DIMVA), 2019, pp. 309–329.

[12]

V. Kalgutkar, R. Kaur, H. Gonzalez, N. Stakhanova, and A. Matyukhina, “Code authorship attribution: Methods and challenges,” ACM Comput. Surv., vol. 52, no. 1, pp. 1–36, Jan. 2020.

Digital Library

[13]

J. Gao, X. Yang, Y. Fu, Y. Jiang, and J. Sun, “VulSeeker: A semantic learning based vulnerability seeker for cross-platform binary,” in Proc. 33rd IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Sep. 2018, pp. 896–899.

[14]

F. Zuo, X. Li, P. Young, L. Luo, Q. Zeng, and Z. Zhang, “Neural machine translation inspired binary code similarity comparison beyond function pairs,” in Proc. Netw. Distrib. Syst. Secur. Symp., 2019, pp. 1–15.

[15]

L. Massarelli, G. A. Di Luna, F. Petroni, L. Querzoni, and R. Baldoni, “Investigating graph embedding neural networks with unsupervised features extraction for binary analysis,” in Proc. Workshop Binary Anal. Res., 2019, pp. 1–11.

[16]

S. Vandenhende, S. Georgoulis, W. Van Gansbeke, M. Proesmans, D. Dai, and L. Van Gool, “Multi-task learning for dense prediction tasks: A survey,” 2020, arXiv:2004.13379.

[17]

K. Redmond, L. Luo, and Q. Zeng, “A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis,” in Proc. Workshop Binary Anal. Res., 2019, pp. 1–8.

[18]

F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Trans. Neural Netw., vol. 20, no. 1, pp. 61–80, Jan. 2008.

Digital Library

[19]

K. X. W. H. J. Leskovec and S. Jegelka, “How powerful are graph neural networks?,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019, pp. 1–17.

[20]

Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow, “Gated graph sequence neural networks,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–20.

[21]

X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 510–519.

[22]

Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard, “Attentional feature fusion,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2021, pp. 3559–3568.

[23]

J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 3320–3328.

[24]

J. D. M.-W. C. Kenton and L. K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol. (NAACL-HLT), 2019, pp. 4171–4186.

[25]

Google Code Jam Programming Competition. Accessed: Oct. 23, 2022. [Online]. Available: https://codingcompetitions.withgoogle.com/codejam/

[26]

N. Rosenblum, X. Zhu, and B. P. Miller, “Who wrote this code? Identifying the authors of program binaries,” in Computer Security—ESORICS. Leuven, Belgium: Springer, 2011, pp. 172–189.

[27]

B. Alsulami, E. Dauber, R. Harang, S. Mancoridis, and R. Greenstadt, “Source code authorship attribution using long short-term memory based networks,” in Computer Security—ESORICS. Berlin, Germany: Springer, 2017, pp. 65–82.

[28]

M. Abuhamad, T. AbuHmed, A. Mohaisen, and D. Nyang, “Large-scale and language-oblivious code authorship identification,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Oct. 2018, pp. 101–114.

[29]

S. Alrabaee, P. Shirani, L. Wang, M. Debbabi, and A. Hanna, “Decoupling coding habits from functionality for effective binary authorship attribution,” J. Comput. Secur., vol. 27, no. 6, pp. 613–648, Oct. 2019.

Digital Library

[30]

E. Bogomolov, V. Kovalenko, Y. Rebryk, A. Bacchelli, and T. Bryksin, “Authorship attribution of source code: A language-agnostic approach and applicability in software engineering,” in Proc. 29th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., Aug. 2021, pp. 932–944.

[31]

P. Junod, J. Rinaldini, J. Wehrli, and J. Michielin, “Obfuscator-LLVM—Software protection for the masses,” in Proc. IEEE/ACM 1st Int. Workshop Softw. Protection, May 2015, pp. 3–9.

[32]

C. Boot, “Applying supervised learning on malware authorship attribution,” Ph.D. dissertation, Inst. Comput. Inf. Sci., Radboud Univ. Nijmegen, Nijmegen, The Netherlands, 2019.

[33]

M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch geometric,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019, pp. 1–9.

[34]

Free Reversing Toolkit. Accessed: Oct. 5, 2023. [Online]. Available: https://www.radare.org/

[35]

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, arXiv:1412.6980.

[36]

S. Alrabaee, M. Debbabi, and L. Wang, “CPA: Accurate cross-platform binary authorship characterization using LDA,” IEEE Trans. Inf. Forensics Security, vol. 15, pp. 3051–3066, 2020.

[37]

W. Huang and J. W. Stokes, “MtNet: A multi-task neural network for dynamic malware classification,” in Proc. Detection Intrusions Malware Vulnerability Assessment (DIMVA), 2016, pp. 399–418.

[38]

J. Liu, Y. Shen, and H. Yan, “Functions-based CFG embedding for malware homology analysis,” in Proc. 26th Int. Conf. Telecommun. (ICT), Apr. 2019, pp. 220–226.

[39]

S. Cesare, Y. Xiang, and W. Zhou, “Control flow-based malware variant detection,” IEEE Trans. Dependable Secure Comput., vol. 11, no. 4, pp. 307–317, Jul./Aug. 2014.

[40]

Q. Song, Y. Zhang, B. Wang, and Y. Chen, “Inter-BIN: Interaction-based cross-architecture IoT binary similarity comparison,” IEEE Internet Things J., vol. 9, no. 20, pp. 20018–20033, Oct. 2022.

[41]

Z. Xu, B. Chen, M. Chandramohan, Y. Liu, and F. Song, “SPAIN: Security patch analysis for binaries towards understanding the pain and pills,” in Proc. IEEE/ACM 39th Int. Conf. Softw. Eng. (ICSE), May 2017, pp. 462–472.

[42]

U. Kargén and N. Shahmehri, “Towards robust instruction-level trace alignment of binary code,” in Proc. 32nd IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Oct. 2017, pp. 342–352.

[43]

W. M. Khoo, A. Mycroft, and R. Anderson, “Rendezvous: A search engine for binary code,” in Proc. 10th Work. Conf. Mining Softw. Repositories (MSR), May 2013, pp. 329–338.

[44]

B. H. Ng and A. Prakash, “Expose: Discovering potential binary code re-use,” in Proc. IEEE 37th Annu. Comput. Softw. Appl. Conf., Jul. 2013, pp. 492–501.

[45]

M. R. Farhadi, B. C. M. Fung, P. Charland, and M. Debbabi, “BinClone: Detecting code clones in malware,” in Proc. 8th Int. Conf. Softw. Secur. Rel. (SERE), Jun. 2014, pp. 78–87.

[46]

H. Huang, A. M. Youssef, and M. Debbabi, “BinSequence: Fast, accurate and scalable binary code reuse detection,” in Proc. ACM Asia Conf. Comput. Commun. Secur., Apr. 2017, pp. 155–166.

[47]

E. Kirda, C. Kruegel, G. Banks, G. Vigna, and R. Kemmerer, “Behavior-based spyware detection,” in Proc. 15th USENIX Secur. Symp., 2006, p. 694.

[48]

J. Ming, D. Xu, Y. Jiang, and D. Wu, “BinSim: Trace-based semantic binary diffing via system call sliced segment equivalence checking,” in Proc. 26th USENIX Secur. Symp., 2017, pp. 253–270.

[49]

L. Luo, J. Ming, D. Wu, P. Liu, and S. Zhu, “Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection,” in Proc. 22nd ACM SIGSOFT Int. Symp. Found. Softw. Eng., Nov. 2014, pp. 389–400.

[50]

Z. Yu, R. Cao, Q. Tang, S. Nie, J. Huang, and S. Wu, “Order matters: Semantic-aware neural networks for binary code similarity detection,” in Proc. AAAI Conf. Artif. Intell., vol. 34, 2020, pp. 1145–1152.

[51]

S. Alrabaee, N. Saleem, S. Preda, L. Wang, and M. Debbabi, “OBA2: An onion approach to binary code authorship attribution,” Digit. Invest., vol. 11, pp. 94–103, May 2014.

[52]

I. Rosenberg, G. Sicard, and E. David, “DeepAPT: Nation-state apt attribution using end-to-end deep neural networks,” in Proc. Int. Conf. Artif. Neural Netw. (ICANN) Cham, Switzerland: Springer, 2017, pp. 91–99.

[53]

S. Sebastian and J. Caballero, “Towards attribution in mobile markets: Identifying developer account polymorphism,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Oct. 2020, pp. 771–785.

[54]

X. Meng, B. Miller, and K.-S. Jun, “Identifying multiple authors in a binary program,” in Proc. Eur. Symp. Res. Comput. Secur., Aug. 2017, pp. 286–304.

[55]

M. Abuhamad, T. Abuhmed, D. Nyang, and D. Mohaisen, “Multi-X: Identifying multiple authors from source code files,” Proc. Privacy Enhancing Technol., vol. 2020, no. 3, pp. 25–41, Jul. 2020.

[56]

D. Wang, Y. Yu, S. Li, W. Dong, J. Wang, and L. Qing, “MulCode: A multi-task learning approach for source code understanding,” in Proc. IEEE Int. Conf. Softw. Anal., Evol. Reengineering (SANER), Mar. 2021, pp. 48–59.

[57]

F. Liu, G. Li, Y. Zhao, and Z. Jin, “Multi-task learning based pre-trained language model for code completion,” in Proc. 35th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Sep. 2020, pp. 473–485.

[58]

J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi, “Modeling task relationships in multi-task learning with multi-gate mixture-of-experts,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jul. 2018, pp. 1930–1939.

[59]

H. Tang, J. Liu, M. Zhao, and X. Gong, “Progressive layered extraction (PLE): A novel multi-task learning (MTL) model for personalized recommendations,” in Proc. 14th ACM Conf. Recommender Syst., Sep. 2020, pp. 269–278.

[60]

X. Meng, B. P. Miller, and S. Jha, “Adversarial binaries for authorship identification,” 2018, arXiv:1809.08316.

Cited By

Qiu BHuo J(2024)Quantitative Stylistic Analysis of Middle Chinese Texts Based on the Dissimilarity of Evolutive Core Word UsageACM Transactions on Asian and Low-Resource Language Information Processing10.1145/366579423:7(1-22)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.1145/3665794

Index Terms

SepBIN: Binary Feature Separation for Better Semantic Comparison and Authorship Verification
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Neural networks
2. Security and privacy

Index terms have been assigned to the content through auto-classification.

Recommendations

Perfect binary codes: constructions, properties, and enumeration

Properties of nonlinear perfect binary codes are investigated and several new constructions of perfect codes are derived from these properties. An upper bound on the cardinality of the intersection of two perfect codes of length n is presented, and ...
On non-antipodal binary completely regular codes

Binary non-antipodal completely regular codes are characterized. Using a result on nonexistence of nontrivial binary perfect codes, it is concluded that there are no unknown nontrivial non-antipodal completely regular binary codes with minimum distance ...
Semantic Binary Codes
ICMR '16: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval

Fast Image Retrieval is required for many applications like Image Search and Shopping, especially for large datasets. Hashing addresses this problem by learning compact binary codes for images and using them as direct addresses into hash tables. In ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Information Forensics and Security

IEEE Transactions on Information Forensics and Security Volume 19, Issue

2024

9628 pages

ISSN:1556-6013

Issue’s Table of Contents

1556-6021 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.

Publisher

IEEE Press

Publication History

Published: 09 November 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Qiu BHuo J(2024)Quantitative Stylistic Analysis of Middle Chinese Texts Based on the Dissimilarity of Evolutive Core Word UsageACM Transactions on Asian and Low-Resource Language Information Processing10.1145/366579423:7(1-22)Online publication date: 28-May-2024
https://dl.acm.org/doi/10.1145/3665794

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents