More Web Proxy on the site http://driver.im/

research-article

Open access

Improving Security Tasks Using Compiler Provenance Information Recovered At the Binary-Level

Authors:

Manos Antonakakis,

Fabian MonroseAuthors Info & Claims

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

Pages 2695 - 2709

https://doi.org/10.1145/3576915.3623098

Published: 21 November 2023 Publication History

Abstract

The complex optimizations supported by modern compilers allow for compiler provenance recovery at many levels. For instance, it is possible to identify the compiler family and optimization level used when building a binary, as well as the individual compiler passes applied to functions within the binary. Yet, many downstream applications of compiler provenance remain unexplored. To bridge that gap, we train and evaluate a multi-label compiler provenance model on data collected from over 27,000 programs built using LLVM 14, and apply the model to a number of security-related tasks. Our approach considers 68 distinct compiler passes and achieves an average F-1 score of 84.4%. We first use the model to examine the magnitude of compiler-induced vulnerabilities, identifying 53 information leak bugs in 10 popular projects. We also show that several compiler optimization passes introduce a substantial amount of functional code reuse gadgets that negatively impact security. Beyond vulnerability detection, we evaluate other security applications, including using recovered provenance information to verify the correctness of Rich header data in Windows binaries (e.g., forensic analysis), as well as for binary decomposition tasks (e.g., third party library detection).

References

[1]

Saed Alrabaee, Mourad Debbabi, and Lingyu Wang. 2019. On the feasibility of binary authorship characterization. Digital Investigation 28 (2019), S3--S11.

Digital Library

[2]

Vytautas Astrauskas, Christoph Matheja, Federico Poli, Peter Müller, and Alexander J. Summers. 2020. How do programmers use unsafe Rust? Proceedings of the ACM on Programming Languages 4 (2020), 1--27.

Digital Library

[3]

Yechan Bae, Youngsuk Kim, Ammar Askar, Jungwon Lim, and Taesoo Kim. 2021. RUDRA: finding memory safety bugs in Rust at the ecosystem scale. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles. 84--99.

Digital Library

[4]

Joseph Biden. 2021. Executive Order 14028: Improving the Nations's Cybersecurity. Federal Register.

[5]

Andrea Bittau, Adam Belay, Ali Mashtizadeh, David Mazières, and Dan Boneh. 2014. Hacking Blind. In IEEE Symposium on Security and Privacy. 227--242.

[6]

Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment (2008), P10008.

[7]

Michael D Brown, Matthew Pruett, Robert Bigelow, Girish Mururu, and Santosh Pande. 2021. Not so fast: understanding and mitigating negative impacts of compiler optimizations on code reuse gadget sets. Proceedings of the ACM on Programming Languages 5 (2021), 1--30.

Digital Library

[8]

Aylin Caliskan, Fabian Yamaguchi, Edwin Dauber, Richard E. Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan. 2018. When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries. In Annual Network and Distributed System Security Symposium.

[9]

cgzones. 2023. Memory erasure followups. https://github.com/linux-pam/linux-pam/pull/599. Accessed: 2023-09-07.

[10]

Yu Chen, Zhiqiang Shi, Hong Li, Weiwei Zhao, Yiliang Liu, and Yuansong Qiao. 2019. HIMALIA: Recovering compiler optimization levels from binaries by deep learning. In Intelligent Systems and Applications: Proceedings of the 2018 Intelligent Systems Conference Volume 1. 35--47.

[11]

Aaron Clauset, Mark E. J. Newman, and Cristopher Moore. 2004. Finding community structure in very large networks. Physical review. E, Statistical, nonlinear, and soft matter physics 70 (2004).

[12]

Frank Denis. 2023. LibHydrogen. https://github.com/jedisct1/libhydrogen. Accessed: 2023-04-17.

[13]

Vijay D'Silva, Mathias Payer, and Dawn Song. 2015. The correctness-security gap in compiler optimization. In IEEE Security and Privacy Workshops. 73--87.

[14]

Yufei Du, Ryan Court, Kevin Snow, and Fabian Monrose. 2022. Automatic Recovery of Fine-grained Compiler Artifacts at the Binary Level. In USENIX Annual Technical Conference. 853--868.

[15]

Robert J Erbes. 2022. How to Build SBOM from Binaries: A Round About Story. Technical Report. Idaho National Lab., Idaho Falls, ID (United States).

[16]

GCC Team. 2020. GCC 10 Release Series - Changes, New Features, and Fixes. https://gcc.gnu.org/gcc-10/changes.html. Accessed: 2023-04=27.

[17]

GCC Team. 2021. GCC 11 Release Series - Changes, New Features, and Fixes. https://gcc.gnu.org/gcc-11/changes.html. Accessed: 2023-04=27.

[18]

Global Research & Analysis Team, Kaspersky Lab. 2018. The devil's in the Rich header. https://securelist.com/the-devils-in-the-rich-header/84348/. Accessed: 2023-03-02.

[19]

Google. 2022. OSS-Fuzz: Continuous Fuzzing for Open Source Software. https: //github.com/google/oss-fuzz. Accessed: 2022-08-29.

[20]

Irfan Ul Haq and Juan Caballero. 2021. A Survey of Binary Code Similarity. Comput. Surveys 54, 3, Article 51 (apr 2021).

[21]

Xu He, Shu Wang, Yunlong Xing, Pengbin Feng, Haining Wang, Qi Li, Songqing Chen, and Kun Sun. 2022. BinProv: Binary Code Provenance Identification without Disassembly. International Symposium on Research in Attacks, Intrusions and Defenses (2022).

Digital Library

[22]

Hex Rays. 2023. IDA Pro - Hex Rays. https://hex-rays.com/ida-pro/

[23]

Michael J Hohnka, Jodi A Miller, Kenrick M Dacumos, Timothy J Fritton, Julia D Erdley, and Lyle N Long. 2019. Evaluation of compiler-induced vulnerabilities. Journal of Aerospace Information Systems 16, 10 (2019), 409--426.

[24]

Andrei Homescu, Michael Stewart, Per Larsen, Stefan Brunthaler, and Michael Franz. 2012. Microgadgets: Size Does Matter in Turing-Complete Return-Oriented Programming. In Workshop on Offensive Technologies.

[25]

JFrog. 2022. Conan, the C/C Package Manager. https://conan.io.

[26]

Vishal Karande, Swarup Chandra, Zhiqiang Lin, Juan Caballero, Latifur Khan, and Kevin Hamlen. 2018. BCD: Decomposing binary code into components using graph-based clustering. In Proceedings of the Asia Conference on Computer and Communications Security. 393--398.

Digital Library

[27]

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems, Vol. 30.

[28]

Benjamin Lamowski, Carsten Weinhold, Adam Lackorzynski, and Hermann Härtig. 2017. Sandcrust: Automatic Sandboxing of Unsafe Components in Rust. Proceedings of the Workshop on Programming Languages and Operating Systems (2017).

Digital Library

[29]

Michael Larabel. 2021. LLVM Clang 13 Performance Is In Great Shape For Intel Xeon "Ice Lake". https://www.phoronix.com/review/intel-icelake-clang13. Accessed: 2023-04-27.

[30]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization. 75--86.

[31]

Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. 2022. How Machine Learning Is Solving the Binary Function Similarity Problem. In USENIX Security Symposium. 2099--2116.

[32]

Niko Matsakis. 2016. Introducing MIR. https://blog.rust-lang.org/2016/04/19/ MIR.html. Accessed: 2023-04-29.

[33]

Samuel Mergendahl, Nathan Burow, and Hamed Okhravi. 2022. Cross-Language Attacks. Network and Distributed System Security Symposium (2022).

[34]

Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In International Conference on Learning Representations.

[35]

Mark EJ Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Physical review E 69, 2 (2004), 026113.

[36]

Michalis Papaevripides and Elias Athanasopoulos. 2021. Exploiting Mixed Binaries. ACM Transactions on Privacy and Security 24, 2, Article 7 (jan 2021), 29 pages.

Digital Library

[37]

Sangdon Park, Xiang Cheng, and Taesoo Kim. 2022. Unsafe's Betrayal: Abusing Unsafe Rust in Binary Reverse Engineering toward Finding Memory-safety Bugs via Machine Learning. ArXiv abs/2211.00111 (2022).

[38]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.

Digital Library

[39]

Davide Pizzolotto and Katsuro Inoue. 2021. Identifying Compiler and Optimization Level in Binary Code from Multiple Architectures. IEEE Access (2021).

[40]

Michal Poslu?ny and Peter Kálnai. 2020. Rich Headers: Leveraging this mysterious artifact of the PE format. In Proceedings of the Virus Bulletin Conference.

[41]

Usha Nandini Raghavan, Réka Albert, and Soundar R. T. Kumara. 2007. Near linear time algorithm to detect community structures in large-scale networks. Physical review. E, Statistical, nonlinear, and soft matter physics 76 (2007).

[42]

Ashkan Rahimian, Paria Shirani, Saed Alrbaee, Lingyu Wang, and Mourad Debbabi. 2015. Bincomp: A stratified approach to compiler provenance attribution. Digital Investigation 14 (2015), S146--S155.

Digital Library

[43]

ReFirmLabs. 2021. Binwalk. https://github.com/ReFirmLabs/binwalk. Accessed: 2022-08-29.

[44]

Ryan Roemer, Erik Buchanan, Hovav Shacham, and Stefan Savage. 2012. Return-oriented programming: Systems, languages, and applications. ACM Transactions on Information and System Security 15, 1 (2012), 1--34.

Digital Library

[45]

Nathan Rosenblum, Xiaojin Zhu, and Barton P Miller. 2011. Who wrote this code? identifying the authors of program binaries. In European Symposium on Research in Computer Security. 172--189.

[46]

Nathan E Rosenblum, Barton P Miller, and Xiaojin Zhu. 2010. Extracting compiler provenance from program binaries. In ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering. 21--28.

Digital Library

[47]

Sascha Schirra. 2022. Ropper. https://github.com/sashs/Ropper. Accessed: 2022-11-07.

[48]

Laurent Simon, David Chisnall, and Ross Anderson. 2018. What you get is what you C: Controlling side effects in mainstream C compilers. In IEEE European Symposium on Security and Privacy. 1--15.

[49]

Kevin Z. Snow, Fabian Monrose, Lucas Davi, Alexandra Dmitrienko, Christopher Liebchen, and Ahmad-Reza Sadeghi. 2013. Just-In-Time Code Reuse: On the Effectiveness of Fine-Grained Address Space Layout Randomization. In IEEE Symposium on Security and Privacy. 574--588.

[50]

Wei Tang, Yanlin Wang, Hongyu Zhang, Shi Han, Ping Luo, and Dongmei Zhang. 2022. LibDB: An Effective and Efficient Framework for Detecting Third-Party Libraries in Binaries. International Conference on Mining Software Repositories (2022), 423--434.

Digital Library

[51]

The Rust Team. [n.d.]. Overview of the compiler - Rust Compiler Development Guide. https://rustc-dev-guide.rust-lang.org/overview.html. Accessed: 2023-08-04.

[52]

Zhenzhou Tian, Yaqian Huang, Borun Xie, Yanping Chen, Lingwei Chen, and Dinghao Wu. 2021. Fine-Grained Compiler Identification With Sequence-Oriented Neural Modeling. IEEE Access 9 (2021), 49160--49175.

[53]

R. Tsoupidi, R. Castañeda Lozano, E. Troubitsyna, and P. Papadimitratos. 2023. Se-curing Optimized Code Against Power Side Channels. In IEEE Computer Security Foundations Symposium. 242--257.

[54]

Ruoyu Wang, Yan Shoshitaishvili, Antonio Bianchi, Aravind Machiry, John Grosen, Paul Grosen, Christopher Krügel, and Giovanni Vigna. 2017. Ramblr: Making Reassembly Great Again. In Network and Distributed System Security Symposium.

[55]

Shuai Wang, Pei Wang, and Dinghao Wu. 2016. UROBOROS: Instrumenting Stripped Binaries with Static Reassembling. 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering 1 (2016), 236--247.

[56]

George D Webster, Bojan Kolosnjaji, Christian von Pentz, Julian Kirsch, Zachary D Hanif, Apostolis Zarras, and Claudia Eckert. 2017. Finding the needle: A study of the PE32 Rich header and respective malware triage. In Detection of Intrusions and Malware, and Vulnerability Assessment. 119--138.

[57]

David Williams-King, Hidenori Kobayashi, Kent Williams-King, Graham Patterson, Frank Spano, Yu Jian Wu, Junfeng Yang, and Vasileios P. Kemerlis. 2020. Egalito: Layout-Agnostic Binary Recompilation. In Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 133--147.

[58]

Hui Xu, Zhuangbin Chen, Mingshen Sun, Yangfan Zhou, and Michael R Lyu. 2021. Memory-Safety Challenge Considered Solved? An In-Depth Study with All Rust CVEs. ACM Transactions on Software Engineering and Methodology 31, 1 (2021), 1--25.

[59]

Jianhao Xu, Kangjie Lu, Zhengjie Du, Zhu Ding, Linke Li, Qiushi Wu, Mathias Payer, and Bing Mao. 2023. Silent Bugs Matter: A Study of Compiler-Introduced Security Bugs. In USENIX Security Symposium.

[60]

Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In ACM SIGSAC Conference on Computer and Communications Security. 363--376.

[61]

Can Yang, Zhengzi Xu, Hongxu Chen, Yang Liu, Xiaorui Gong, and Baoxu Liu. 2022. ModX: binary level partially imported third-party library detection via program modularization and semantic matching. In Proceedings of the International Conference on Software Engineering. 1393--1405.

Digital Library

[62]

Zhaomo Yang, Brian Johannesmeyer, Anders Trier Olesen, Sorin Lerner, and Kirill Levchenko. 2017. Dead store elimination (still) considered harmful. In USENIX Security Symposium. 1025--1040.

[63]

Nusrat Zahan, Elizabeth Lin, Mahzabin Tamanna, William Enck, and Laurie Williams. 2023. Software Bills of Materials Are Required. Are We There Yet? IEEE Security & Privacy 21, 2 (2023), 82--88.

Digital Library

[64]

Dan Zhang, Ping Luo, Wei Tang, and Min Zhou. 2020. OSLDetector: Identifying Open-Source Libraries through Binary Analysis. In International Conference on Automated Software Engineering. 1312--1315.

Digital Library

Cited By

Gao YLiang LLi YLi RWang Y(2024)Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and FusionElectronics10.3390/electronics1309169213:9(1692)Online publication date: 27-Apr-2024
https://doi.org/10.3390/electronics13091692
Dietrich JWhite TAbdollahpour MWen EHassanshahi BTorres-Arias SDe Carli LZhang Y(2024)BinEq - A Benchmark of Compiled Java Programs to Assess Alternative BuildsProceedings of the 2024 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses10.1145/3689944.3696162(15-25)Online publication date: 19-Nov-2024
https://dl.acm.org/doi/10.1145/3689944.3696162
Jang HMurodova NKoo H(2024)ToolPhet: Inference of Compiler Provenance From Stripped Binaries With Emerging Compilation ToolchainsIEEE Access10.1109/ACCESS.2024.335509812(12667-12682)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3355098

Index Terms

Improving Security Tasks Using Compiler Provenance Information Recovered At the Binary-Level

Recommendations

Source-Level Compiler Optimizations for High-Level Synthesis
SEEDA-CECNSM '16: Proceedings of the SouthEast European Design Automation, Computer Engineering, Computer Networks and Social Media Conference

With high-level synthesis becoming the preferred method for hardware design, tools that operate on high-level programming languages and optimize hardware output are crucial for successful synthesis. In high-level synthesis, conventional programming ...
Improving compiler-runtime separation with XIR
VEE '10

Intense research on virtual machines has highlighted the need for flexible software architectures that allow quick evaluation of new design and implementation techniques. The interface between the compiler and runtime system is a principal factor in the ...
Improving Compiler Construction Using Formal Methods

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

November 2023

3722 pages

ISBN:9798400700507

DOI:10.1145/3576915

General Chairs:
Weizhi Meng
Technical University of Denmark
,
Christian D. Jensen
Technical University of Denmark
,
Program Chairs:
Cas Cremers
CISPA Helmholtz Center for Information Security
,
Engin Kirda
Khoury College of Computer Sciences

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSAC: ACM Special Interest Group on Security, Audit, and Control

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 November 2023

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CCS '23

Sponsor:

SIGSAC

CCS '23: ACM SIGSAC Conference on Computer and Communications Security

November 26 - 30, 2023

Copenhagen, Denmark

Acceptance Rates

Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

Upcoming Conference

CCS '25

Sponsor:
sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 13 - 17, 2025

Taipei , Taiwan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
717
Total Downloads

Downloads (Last 12 months)642
Downloads (Last 6 weeks)55

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gao YLiang LLi YLi RWang Y(2024)Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and FusionElectronics10.3390/electronics1309169213:9(1692)Online publication date: 27-Apr-2024
https://doi.org/10.3390/electronics13091692
Dietrich JWhite TAbdollahpour MWen EHassanshahi BTorres-Arias SDe Carli LZhang Y(2024)BinEq - A Benchmark of Compiled Java Programs to Assess Alternative BuildsProceedings of the 2024 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses10.1145/3689944.3696162(15-25)Online publication date: 19-Nov-2024
https://dl.acm.org/doi/10.1145/3689944.3696162
Jang HMurodova NKoo H(2024)ToolPhet: Inference of Compiler Provenance From Stripped Binaries With Emerging Compilation ToolchainsIEEE Access10.1109/ACCESS.2024.335509812(12667-12682)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3355098

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents