[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3576915.3623098acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article
Open access

Improving Security Tasks Using Compiler Provenance Information Recovered At the Binary-Level

Published: 21 November 2023 Publication History

Abstract

The complex optimizations supported by modern compilers allow for compiler provenance recovery at many levels. For instance, it is possible to identify the compiler family and optimization level used when building a binary, as well as the individual compiler passes applied to functions within the binary. Yet, many downstream applications of compiler provenance remain unexplored. To bridge that gap, we train and evaluate a multi-label compiler provenance model on data collected from over 27,000 programs built using LLVM 14, and apply the model to a number of security-related tasks. Our approach considers 68 distinct compiler passes and achieves an average F-1 score of 84.4%. We first use the model to examine the magnitude of compiler-induced vulnerabilities, identifying 53 information leak bugs in 10 popular projects. We also show that several compiler optimization passes introduce a substantial amount of functional code reuse gadgets that negatively impact security. Beyond vulnerability detection, we evaluate other security applications, including using recovered provenance information to verify the correctness of Rich header data in Windows binaries (e.g., forensic analysis), as well as for binary decomposition tasks (e.g., third party library detection).

References

[1]
Saed Alrabaee, Mourad Debbabi, and Lingyu Wang. 2019. On the feasibility of binary authorship characterization. Digital Investigation 28 (2019), S3--S11.
[2]
Vytautas Astrauskas, Christoph Matheja, Federico Poli, Peter Müller, and Alexander J. Summers. 2020. How do programmers use unsafe Rust? Proceedings of the ACM on Programming Languages 4 (2020), 1--27.
[3]
Yechan Bae, Youngsuk Kim, Ammar Askar, Jungwon Lim, and Taesoo Kim. 2021. RUDRA: finding memory safety bugs in Rust at the ecosystem scale. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles. 84--99.
[4]
Joseph Biden. 2021. Executive Order 14028: Improving the Nations's Cybersecurity. Federal Register.
[5]
Andrea Bittau, Adam Belay, Ali Mashtizadeh, David Mazières, and Dan Boneh. 2014. Hacking Blind. In IEEE Symposium on Security and Privacy. 227--242.
[6]
Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment (2008), P10008.
[7]
Michael D Brown, Matthew Pruett, Robert Bigelow, Girish Mururu, and Santosh Pande. 2021. Not so fast: understanding and mitigating negative impacts of compiler optimizations on code reuse gadget sets. Proceedings of the ACM on Programming Languages 5 (2021), 1--30.
[8]
Aylin Caliskan, Fabian Yamaguchi, Edwin Dauber, Richard E. Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan. 2018. When Coding Style Survives Compilation: De-anonymizing Programmers from Executable Binaries. In Annual Network and Distributed System Security Symposium.
[9]
cgzones. 2023. Memory erasure followups. https://github.com/linux-pam/linux-pam/pull/599. Accessed: 2023-09-07.
[10]
Yu Chen, Zhiqiang Shi, Hong Li, Weiwei Zhao, Yiliang Liu, and Yuansong Qiao. 2019. HIMALIA: Recovering compiler optimization levels from binaries by deep learning. In Intelligent Systems and Applications: Proceedings of the 2018 Intelligent Systems Conference Volume 1. 35--47.
[11]
Aaron Clauset, Mark E. J. Newman, and Cristopher Moore. 2004. Finding community structure in very large networks. Physical review. E, Statistical, nonlinear, and soft matter physics 70 (2004).
[12]
Frank Denis. 2023. LibHydrogen. https://github.com/jedisct1/libhydrogen. Accessed: 2023-04-17.
[13]
Vijay D'Silva, Mathias Payer, and Dawn Song. 2015. The correctness-security gap in compiler optimization. In IEEE Security and Privacy Workshops. 73--87.
[14]
Yufei Du, Ryan Court, Kevin Snow, and Fabian Monrose. 2022. Automatic Recovery of Fine-grained Compiler Artifacts at the Binary Level. In USENIX Annual Technical Conference. 853--868.
[15]
Robert J Erbes. 2022. How to Build SBOM from Binaries: A Round About Story. Technical Report. Idaho National Lab., Idaho Falls, ID (United States).
[16]
GCC Team. 2020. GCC 10 Release Series - Changes, New Features, and Fixes. https://gcc.gnu.org/gcc-10/changes.html. Accessed: 2023-04=27.
[17]
GCC Team. 2021. GCC 11 Release Series - Changes, New Features, and Fixes. https://gcc.gnu.org/gcc-11/changes.html. Accessed: 2023-04=27.
[18]
Global Research & Analysis Team, Kaspersky Lab. 2018. The devil's in the Rich header. https://securelist.com/the-devils-in-the-rich-header/84348/. Accessed: 2023-03-02.
[19]
Google. 2022. OSS-Fuzz: Continuous Fuzzing for Open Source Software. https: //github.com/google/oss-fuzz. Accessed: 2022-08-29.
[20]
Irfan Ul Haq and Juan Caballero. 2021. A Survey of Binary Code Similarity. Comput. Surveys 54, 3, Article 51 (apr 2021).
[21]
Xu He, Shu Wang, Yunlong Xing, Pengbin Feng, Haining Wang, Qi Li, Songqing Chen, and Kun Sun. 2022. BinProv: Binary Code Provenance Identification without Disassembly. International Symposium on Research in Attacks, Intrusions and Defenses (2022).
[22]
Hex Rays. 2023. IDA Pro - Hex Rays. https://hex-rays.com/ida-pro/
[23]
Michael J Hohnka, Jodi A Miller, Kenrick M Dacumos, Timothy J Fritton, Julia D Erdley, and Lyle N Long. 2019. Evaluation of compiler-induced vulnerabilities. Journal of Aerospace Information Systems 16, 10 (2019), 409--426.
[24]
Andrei Homescu, Michael Stewart, Per Larsen, Stefan Brunthaler, and Michael Franz. 2012. Microgadgets: Size Does Matter in Turing-Complete Return-Oriented Programming. In Workshop on Offensive Technologies.
[25]
JFrog. 2022. Conan, the C/C Package Manager. https://conan.io.
[26]
Vishal Karande, Swarup Chandra, Zhiqiang Lin, Juan Caballero, Latifur Khan, and Kevin Hamlen. 2018. BCD: Decomposing binary code into components using graph-based clustering. In Proceedings of the Asia Conference on Computer and Communications Security. 393--398.
[27]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Advances in Neural Information Processing Systems, Vol. 30.
[28]
Benjamin Lamowski, Carsten Weinhold, Adam Lackorzynski, and Hermann Härtig. 2017. Sandcrust: Automatic Sandboxing of Unsafe Components in Rust. Proceedings of the Workshop on Programming Languages and Operating Systems (2017).
[29]
Michael Larabel. 2021. LLVM Clang 13 Performance Is In Great Shape For Intel Xeon "Ice Lake". https://www.phoronix.com/review/intel-icelake-clang13. Accessed: 2023-04-27.
[30]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization. 75--86.
[31]
Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. 2022. How Machine Learning Is Solving the Binary Function Similarity Problem. In USENIX Security Symposium. 2099--2116.
[32]
Niko Matsakis. 2016. Introducing MIR. https://blog.rust-lang.org/2016/04/19/ MIR.html. Accessed: 2023-04-29.
[33]
Samuel Mergendahl, Nathan Burow, and Hamed Okhravi. 2022. Cross-Language Attacks. Network and Distributed System Security Symposium (2022).
[34]
Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In International Conference on Learning Representations.
[35]
Mark EJ Newman and Michelle Girvan. 2004. Finding and evaluating community structure in networks. Physical review E 69, 2 (2004), 026113.
[36]
Michalis Papaevripides and Elias Athanasopoulos. 2021. Exploiting Mixed Binaries. ACM Transactions on Privacy and Security 24, 2, Article 7 (jan 2021), 29 pages.
[37]
Sangdon Park, Xiang Cheng, and Taesoo Kim. 2022. Unsafe's Betrayal: Abusing Unsafe Rust in Binary Reverse Engineering toward Finding Memory-safety Bugs via Machine Learning. ArXiv abs/2211.00111 (2022).
[38]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825--2830.
[39]
Davide Pizzolotto and Katsuro Inoue. 2021. Identifying Compiler and Optimization Level in Binary Code from Multiple Architectures. IEEE Access (2021).
[40]
Michal Poslu?ny and Peter Kálnai. 2020. Rich Headers: Leveraging this mysterious artifact of the PE format. In Proceedings of the Virus Bulletin Conference.
[41]
Usha Nandini Raghavan, Réka Albert, and Soundar R. T. Kumara. 2007. Near linear time algorithm to detect community structures in large-scale networks. Physical review. E, Statistical, nonlinear, and soft matter physics 76 (2007).
[42]
Ashkan Rahimian, Paria Shirani, Saed Alrbaee, Lingyu Wang, and Mourad Debbabi. 2015. Bincomp: A stratified approach to compiler provenance attribution. Digital Investigation 14 (2015), S146--S155.
[43]
ReFirmLabs. 2021. Binwalk. https://github.com/ReFirmLabs/binwalk. Accessed: 2022-08-29.
[44]
Ryan Roemer, Erik Buchanan, Hovav Shacham, and Stefan Savage. 2012. Return-oriented programming: Systems, languages, and applications. ACM Transactions on Information and System Security 15, 1 (2012), 1--34.
[45]
Nathan Rosenblum, Xiaojin Zhu, and Barton P Miller. 2011. Who wrote this code? identifying the authors of program binaries. In European Symposium on Research in Computer Security. 172--189.
[46]
Nathan E Rosenblum, Barton P Miller, and Xiaojin Zhu. 2010. Extracting compiler provenance from program binaries. In ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering. 21--28.
[47]
Sascha Schirra. 2022. Ropper. https://github.com/sashs/Ropper. Accessed: 2022-11-07.
[48]
Laurent Simon, David Chisnall, and Ross Anderson. 2018. What you get is what you C: Controlling side effects in mainstream C compilers. In IEEE European Symposium on Security and Privacy. 1--15.
[49]
Kevin Z. Snow, Fabian Monrose, Lucas Davi, Alexandra Dmitrienko, Christopher Liebchen, and Ahmad-Reza Sadeghi. 2013. Just-In-Time Code Reuse: On the Effectiveness of Fine-Grained Address Space Layout Randomization. In IEEE Symposium on Security and Privacy. 574--588.
[50]
Wei Tang, Yanlin Wang, Hongyu Zhang, Shi Han, Ping Luo, and Dongmei Zhang. 2022. LibDB: An Effective and Efficient Framework for Detecting Third-Party Libraries in Binaries. International Conference on Mining Software Repositories (2022), 423--434.
[51]
The Rust Team. [n.d.]. Overview of the compiler - Rust Compiler Development Guide. https://rustc-dev-guide.rust-lang.org/overview.html. Accessed: 2023-08-04.
[52]
Zhenzhou Tian, Yaqian Huang, Borun Xie, Yanping Chen, Lingwei Chen, and Dinghao Wu. 2021. Fine-Grained Compiler Identification With Sequence-Oriented Neural Modeling. IEEE Access 9 (2021), 49160--49175.
[53]
R. Tsoupidi, R. Castañeda Lozano, E. Troubitsyna, and P. Papadimitratos. 2023. Se-curing Optimized Code Against Power Side Channels. In IEEE Computer Security Foundations Symposium. 242--257.
[54]
Ruoyu Wang, Yan Shoshitaishvili, Antonio Bianchi, Aravind Machiry, John Grosen, Paul Grosen, Christopher Krügel, and Giovanni Vigna. 2017. Ramblr: Making Reassembly Great Again. In Network and Distributed System Security Symposium.
[55]
Shuai Wang, Pei Wang, and Dinghao Wu. 2016. UROBOROS: Instrumenting Stripped Binaries with Static Reassembling. 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering 1 (2016), 236--247.
[56]
George D Webster, Bojan Kolosnjaji, Christian von Pentz, Julian Kirsch, Zachary D Hanif, Apostolis Zarras, and Claudia Eckert. 2017. Finding the needle: A study of the PE32 Rich header and respective malware triage. In Detection of Intrusions and Malware, and Vulnerability Assessment. 119--138.
[57]
David Williams-King, Hidenori Kobayashi, Kent Williams-King, Graham Patterson, Frank Spano, Yu Jian Wu, Junfeng Yang, and Vasileios P. Kemerlis. 2020. Egalito: Layout-Agnostic Binary Recompilation. In Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 133--147.
[58]
Hui Xu, Zhuangbin Chen, Mingshen Sun, Yangfan Zhou, and Michael R Lyu. 2021. Memory-Safety Challenge Considered Solved? An In-Depth Study with All Rust CVEs. ACM Transactions on Software Engineering and Methodology 31, 1 (2021), 1--25.
[59]
Jianhao Xu, Kangjie Lu, Zhengjie Du, Zhu Ding, Linke Li, Qiushi Wu, Mathias Payer, and Bing Mao. 2023. Silent Bugs Matter: A Study of Compiler-Introduced Security Bugs. In USENIX Security Symposium.
[60]
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In ACM SIGSAC Conference on Computer and Communications Security. 363--376.
[61]
Can Yang, Zhengzi Xu, Hongxu Chen, Yang Liu, Xiaorui Gong, and Baoxu Liu. 2022. ModX: binary level partially imported third-party library detection via program modularization and semantic matching. In Proceedings of the International Conference on Software Engineering. 1393--1405.
[62]
Zhaomo Yang, Brian Johannesmeyer, Anders Trier Olesen, Sorin Lerner, and Kirill Levchenko. 2017. Dead store elimination (still) considered harmful. In USENIX Security Symposium. 1025--1040.
[63]
Nusrat Zahan, Elizabeth Lin, Mahzabin Tamanna, William Enck, and Laurie Williams. 2023. Software Bills of Materials Are Required. Are We There Yet? IEEE Security & Privacy 21, 2 (2023), 82--88.
[64]
Dan Zhang, Ping Luo, Wei Tang, and Min Zhou. 2020. OSLDetector: Identifying Open-Source Libraries through Binary Analysis. In International Conference on Automated Software Engineering. 1312--1315.

Cited By

View all
  • (2024)Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and FusionElectronics10.3390/electronics1309169213:9(1692)Online publication date: 27-Apr-2024
  • (2024)BinEq - A Benchmark of Compiled Java Programs to Assess Alternative BuildsProceedings of the 2024 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses10.1145/3689944.3696162(15-25)Online publication date: 19-Nov-2024
  • (2024)ToolPhet: Inference of Compiler Provenance From Stripped Binaries With Emerging Compilation ToolchainsIEEE Access10.1109/ACCESS.2024.335509812(12667-12682)Online publication date: 2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security
November 2023
3722 pages
ISBN:9798400700507
DOI:10.1145/3576915
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 November 2023

Check for updates

Author Tags

  1. compiler-introduced security bugs
  2. correctness-security gap

Qualifiers

  • Research-article

Conference

CCS '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)642
  • Downloads (Last 6 weeks)55
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and FusionElectronics10.3390/electronics1309169213:9(1692)Online publication date: 27-Apr-2024
  • (2024)BinEq - A Benchmark of Compiled Java Programs to Assess Alternative BuildsProceedings of the 2024 Workshop on Software Supply Chain Offensive Research and Ecosystem Defenses10.1145/3689944.3696162(15-25)Online publication date: 19-Nov-2024
  • (2024)ToolPhet: Inference of Compiler Provenance From Stripped Binaries With Emerging Compilation ToolchainsIEEE Access10.1109/ACCESS.2024.335509812(12667-12682)Online publication date: 2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media