[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

SepBIN: Binary Feature Separation for Better Semantic Comparison and Authorship Verification

Published: 09 November 2023 Publication History

Abstract

Binary semantic comparison and authorship verification are critical in many security applications. They respectively focus on the functional semantic features and developers&#x2019; programming style features of binary code, which are usually mixed without clear demarcation. Recently, researchers have proposed learning-based approaches for intelligent binary analysis. They generally addressed single tasks with hand-crafted feature sets or neural binary encoders, which suffer performance bottlenecks due to the noise in mixed features. This paper proposes <monospace>SepBIN</monospace>, a novel neural network framework that exploits the intrinsic correlation of binary semantic comparison and authorship verification tasks and automatically separates semantic and stylistic binary features. We first construct a strong backbone binary encoder, then utilize preliminary decomposition subnets and the flexible gating-based feature fusion mechanism to distill pure semantic-related and style-related binary representations, and further improve their quality by a feature reconstruction module. The overall <monospace>SepBIN</monospace> model is optimized by a multi-objective joint optimization strategy. We conduct extensive experiments on Google Code Jam (GCJ) datasets in different languages and scales. Results show that <monospace>SepBIN</monospace> simultaneously benefits binary semantic comparison and authorship verification tasks through the effective binary semantic-style feature separation mechanism, and provides multi-perspectives interpretability for the performance gains. For state-of-the-art approaches with different binary encoders, <monospace>SepBIN</monospace> can adaptively improve them with the designed separation modules. Furthermore, we adopt a pretraining-finetuning strategy to effectively transfer <monospace>SepBIN</monospace>&#x2019;s separation capability in real-world applications, including APT malware homology detection and binary semantic comparison against code obfuscations.

References

[1]
I. U. Haq and J. Caballero, “A survey of binary code similarity,” ACM Comput. Surv., vol. 54, no. 3, pp. 1–38, Apr. 2022.
[2]
A. Marcelli, M. Graziano, X. Ugarte-Pedrero, Y. Fratantonio, M. Mansouri, and D. Balzarotti, “How machine learning is solving the binary function similarity problem,” in Proc. 31st USENIX Secur. Symp., 2022, pp. 2099–2116.
[3]
A. C. Islamet al., “De-anonymizing programmers via code stylometry,” in Proc. 24th USENIX Secur. Symp., 2015, pp. 255–270.
[4]
Q. Song, Y. Zhang, L. Ouyang, and Y. Chen, “BinMLM: Binary authorship verification with flow-aware mixture-of-shared language model,” in Proc. IEEE Int. Conf. Softw. Anal., Evol. Reengineering (SANER), Mar. 2022, pp. 1023–1033.
[5]
B. Liuet al., “\alphaDiff: Cross-version binary code similarity detection with DNN,” in Proc. 33rd ACM/IEEE Int. Conf. Automated Softw. Eng., Sep. 2018, pp. 667–678.
[6]
S. H. H. Ding, B. C. M. Fung, and P. Charland, “Asm2Vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization,” in Proc. IEEE Symp. Secur. Privacy (SP), May 2019, pp. 472–489.
[7]
Y. Duan, X. Li, J. Wang, and H. Yin, “DeepBinDiff: Learning program-wide code representations for binary diffing,” in Proc. Netw. Distrib. Syst. Secur. Symp., 2020, pp. 1–16.
[8]
X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, “Neural network-based graph embedding for cross-platform binary code similarity detection,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Oct. 2017, pp. 363–376.
[9]
A. Caliskanet al., “When coding style survives compilation: De-anonymizing programmers from executable binaries,” in Proc. Netw. Distrib. Syst. Secur. Symp., 2018, pp. 1–15.
[10]
S. Alrabaee, E. B. Karbab, L. Wang, and M. Debbabi, “Bineye: Towards efficient binary authorship characterization using deep learning,” in Proc. Comput. Secur. (ESORICS), 2019, pp. 47–67.
[11]
L. Massarelli, G. A. Di Luna, F. Petroni, R. Baldoni, and L. Querzoni, “Safe: Self-attentive function embeddings for binary similarity,” in Proc. Detection Intrusions Malware Vulnerability Assessment (DIMVA), 2019, pp. 309–329.
[12]
V. Kalgutkar, R. Kaur, H. Gonzalez, N. Stakhanova, and A. Matyukhina, “Code authorship attribution: Methods and challenges,” ACM Comput. Surv., vol. 52, no. 1, pp. 1–36, Jan. 2020.
[13]
J. Gao, X. Yang, Y. Fu, Y. Jiang, and J. Sun, “VulSeeker: A semantic learning based vulnerability seeker for cross-platform binary,” in Proc. 33rd IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Sep. 2018, pp. 896–899.
[14]
F. Zuo, X. Li, P. Young, L. Luo, Q. Zeng, and Z. Zhang, “Neural machine translation inspired binary code similarity comparison beyond function pairs,” in Proc. Netw. Distrib. Syst. Secur. Symp., 2019, pp. 1–15.
[15]
L. Massarelli, G. A. Di Luna, F. Petroni, L. Querzoni, and R. Baldoni, “Investigating graph embedding neural networks with unsupervised features extraction for binary analysis,” in Proc. Workshop Binary Anal. Res., 2019, pp. 1–11.
[16]
S. Vandenhende, S. Georgoulis, W. Van Gansbeke, M. Proesmans, D. Dai, and L. Van Gool, “Multi-task learning for dense prediction tasks: A survey,” 2020, arXiv:2004.13379.
[17]
K. Redmond, L. Luo, and Q. Zeng, “A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis,” in Proc. Workshop Binary Anal. Res., 2019, pp. 1–8.
[18]
F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Trans. Neural Netw., vol. 20, no. 1, pp. 61–80, Jan. 2008.
[19]
K. X. W. H. J. Leskovec and S. Jegelka, “How powerful are graph neural networks?,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019, pp. 1–17.
[20]
Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow, “Gated graph sequence neural networks,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2016, pp. 1–20.
[21]
X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 510–519.
[22]
Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard, “Attentional feature fusion,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2021, pp. 3559–3568.
[23]
J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?,” in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 3320–3328.
[24]
J. D. M.-W. C. Kenton and L. K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. Conf. Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol. (NAACL-HLT), 2019, pp. 4171–4186.
[25]
Google Code Jam Programming Competition. Accessed: Oct. 23, 2022. [Online]. Available: https://codingcompetitions.withgoogle.com/codejam/
[26]
N. Rosenblum, X. Zhu, and B. P. Miller, “Who wrote this code? Identifying the authors of program binaries,” in Computer Security—ESORICS. Leuven, Belgium: Springer, 2011, pp. 172–189.
[27]
B. Alsulami, E. Dauber, R. Harang, S. Mancoridis, and R. Greenstadt, “Source code authorship attribution using long short-term memory based networks,” in Computer Security—ESORICS. Berlin, Germany: Springer, 2017, pp. 65–82.
[28]
M. Abuhamad, T. AbuHmed, A. Mohaisen, and D. Nyang, “Large-scale and language-oblivious code authorship identification,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Oct. 2018, pp. 101–114.
[29]
S. Alrabaee, P. Shirani, L. Wang, M. Debbabi, and A. Hanna, “Decoupling coding habits from functionality for effective binary authorship attribution,” J. Comput. Secur., vol. 27, no. 6, pp. 613–648, Oct. 2019.
[30]
E. Bogomolov, V. Kovalenko, Y. Rebryk, A. Bacchelli, and T. Bryksin, “Authorship attribution of source code: A language-agnostic approach and applicability in software engineering,” in Proc. 29th ACM Joint Meeting Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., Aug. 2021, pp. 932–944.
[31]
P. Junod, J. Rinaldini, J. Wehrli, and J. Michielin, “Obfuscator-LLVM—Software protection for the masses,” in Proc. IEEE/ACM 1st Int. Workshop Softw. Protection, May 2015, pp. 3–9.
[32]
C. Boot, “Applying supervised learning on malware authorship attribution,” Ph.D. dissertation, Inst. Comput. Inf. Sci., Radboud Univ. Nijmegen, Nijmegen, The Netherlands, 2019.
[33]
M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch geometric,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2019, pp. 1–9.
[34]
Free Reversing Toolkit. Accessed: Oct. 5, 2023. [Online]. Available: https://www.radare.org/
[35]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, arXiv:1412.6980.
[36]
S. Alrabaee, M. Debbabi, and L. Wang, “CPA: Accurate cross-platform binary authorship characterization using LDA,” IEEE Trans. Inf. Forensics Security, vol. 15, pp. 3051–3066, 2020.
[37]
W. Huang and J. W. Stokes, “MtNet: A multi-task neural network for dynamic malware classification,” in Proc. Detection Intrusions Malware Vulnerability Assessment (DIMVA), 2016, pp. 399–418.
[38]
J. Liu, Y. Shen, and H. Yan, “Functions-based CFG embedding for malware homology analysis,” in Proc. 26th Int. Conf. Telecommun. (ICT), Apr. 2019, pp. 220–226.
[39]
S. Cesare, Y. Xiang, and W. Zhou, “Control flow-based malware variant detection,” IEEE Trans. Dependable Secure Comput., vol. 11, no. 4, pp. 307–317, Jul./Aug. 2014.
[40]
Q. Song, Y. Zhang, B. Wang, and Y. Chen, “Inter-BIN: Interaction-based cross-architecture IoT binary similarity comparison,” IEEE Internet Things J., vol. 9, no. 20, pp. 20018–20033, Oct. 2022.
[41]
Z. Xu, B. Chen, M. Chandramohan, Y. Liu, and F. Song, “SPAIN: Security patch analysis for binaries towards understanding the pain and pills,” in Proc. IEEE/ACM 39th Int. Conf. Softw. Eng. (ICSE), May 2017, pp. 462–472.
[42]
U. Kargén and N. Shahmehri, “Towards robust instruction-level trace alignment of binary code,” in Proc. 32nd IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Oct. 2017, pp. 342–352.
[43]
W. M. Khoo, A. Mycroft, and R. Anderson, “Rendezvous: A search engine for binary code,” in Proc. 10th Work. Conf. Mining Softw. Repositories (MSR), May 2013, pp. 329–338.
[44]
B. H. Ng and A. Prakash, “Expose: Discovering potential binary code re-use,” in Proc. IEEE 37th Annu. Comput. Softw. Appl. Conf., Jul. 2013, pp. 492–501.
[45]
M. R. Farhadi, B. C. M. Fung, P. Charland, and M. Debbabi, “BinClone: Detecting code clones in malware,” in Proc. 8th Int. Conf. Softw. Secur. Rel. (SERE), Jun. 2014, pp. 78–87.
[46]
H. Huang, A. M. Youssef, and M. Debbabi, “BinSequence: Fast, accurate and scalable binary code reuse detection,” in Proc. ACM Asia Conf. Comput. Commun. Secur., Apr. 2017, pp. 155–166.
[47]
E. Kirda, C. Kruegel, G. Banks, G. Vigna, and R. Kemmerer, “Behavior-based spyware detection,” in Proc. 15th USENIX Secur. Symp., 2006, p. 694.
[48]
J. Ming, D. Xu, Y. Jiang, and D. Wu, “BinSim: Trace-based semantic binary diffing via system call sliced segment equivalence checking,” in Proc. 26th USENIX Secur. Symp., 2017, pp. 253–270.
[49]
L. Luo, J. Ming, D. Wu, P. Liu, and S. Zhu, “Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection,” in Proc. 22nd ACM SIGSOFT Int. Symp. Found. Softw. Eng., Nov. 2014, pp. 389–400.
[50]
Z. Yu, R. Cao, Q. Tang, S. Nie, J. Huang, and S. Wu, “Order matters: Semantic-aware neural networks for binary code similarity detection,” in Proc. AAAI Conf. Artif. Intell., vol. 34, 2020, pp. 1145–1152.
[51]
S. Alrabaee, N. Saleem, S. Preda, L. Wang, and M. Debbabi, “OBA2: An onion approach to binary code authorship attribution,” Digit. Invest., vol. 11, pp. 94–103, May 2014.
[52]
I. Rosenberg, G. Sicard, and E. David, “DeepAPT: Nation-state apt attribution using end-to-end deep neural networks,” in Proc. Int. Conf. Artif. Neural Netw. (ICANN) Cham, Switzerland: Springer, 2017, pp. 91–99.
[53]
S. Sebastian and J. Caballero, “Towards attribution in mobile markets: Identifying developer account polymorphism,” in Proc. ACM SIGSAC Conf. Comput. Commun. Secur., Oct. 2020, pp. 771–785.
[54]
X. Meng, B. Miller, and K.-S. Jun, “Identifying multiple authors in a binary program,” in Proc. Eur. Symp. Res. Comput. Secur., Aug. 2017, pp. 286–304.
[55]
M. Abuhamad, T. Abuhmed, D. Nyang, and D. Mohaisen, “Multi-X: Identifying multiple authors from source code files,” Proc. Privacy Enhancing Technol., vol. 2020, no. 3, pp. 25–41, Jul. 2020.
[56]
D. Wang, Y. Yu, S. Li, W. Dong, J. Wang, and L. Qing, “MulCode: A multi-task learning approach for source code understanding,” in Proc. IEEE Int. Conf. Softw. Anal., Evol. Reengineering (SANER), Mar. 2021, pp. 48–59.
[57]
F. Liu, G. Li, Y. Zhao, and Z. Jin, “Multi-task learning based pre-trained language model for code completion,” in Proc. 35th IEEE/ACM Int. Conf. Automated Softw. Eng. (ASE), Sep. 2020, pp. 473–485.
[58]
J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi, “Modeling task relationships in multi-task learning with multi-gate mixture-of-experts,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jul. 2018, pp. 1930–1939.
[59]
H. Tang, J. Liu, M. Zhao, and X. Gong, “Progressive layered extraction (PLE): A novel multi-task learning (MTL) model for personalized recommendations,” in Proc. 14th ACM Conf. Recommender Syst., Sep. 2020, pp. 269–278.
[60]
X. Meng, B. P. Miller, and S. Jha, “Adversarial binaries for authorship identification,” 2018, arXiv:1809.08316.

Cited By

View all
  • (2024)Quantitative Stylistic Analysis of Middle Chinese Texts Based on the Dissimilarity of Evolutive Core Word UsageACM Transactions on Asian and Low-Resource Language Information Processing10.1145/366579423:7(1-22)Online publication date: 28-May-2024

Index Terms

  1. SepBIN: Binary Feature Separation for Better Semantic Comparison and Authorship Verification
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image IEEE Transactions on Information Forensics and Security
          IEEE Transactions on Information Forensics and Security  Volume 19, Issue
          2024
          9628 pages

          Publisher

          IEEE Press

          Publication History

          Published: 09 November 2023

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 05 Jan 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Quantitative Stylistic Analysis of Middle Chinese Texts Based on the Dissimilarity of Evolutive Core Word UsageACM Transactions on Asian and Low-Resource Language Information Processing10.1145/366579423:7(1-22)Online publication date: 28-May-2024

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media