[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2020408.2020419acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Supervised learning for provenance-similarity of binaries

Published: 21 August 2011 Publication History

Abstract

Understanding, measuring, and leveraging the similarity of binaries (executable code) is a foundational challenge in software engineering. We present a notion of similarity based on provenance -- two binaries are similar if they are compiled from the same (or very similar) source code with the same (or similar) compilers. Empirical evidence suggests that provenance-similarity accounts for a significant portion of variation in existing binaries, particularly in malware. We propose and evaluate the applicability of classification to detect provenance-similarity. We evaluate a variety of classifiers, and different types of attributes and similarity labeling schemes, on two benchmarks derived from open-source software and malware respectively. We present encouraging results indicating that classification is a viable approach for automated provenance-similarity detection, and as an aid for malware analysts in particular.

References

[1]
Aliser worm. http://www.sophos.com/security/analyses/viruses-and-spyware/w32aliserd%am.html.
[2]
M. Apel, C. Bockermann, and M. Meier. Measuring similarity of malware behavior. In Proc. of LCN, 2009.
[3]
M. Braverman, J. Williams, and Z. Mador. Microsoft security intelligence report: January--June 2006, 2006. http://microsoft.com/downloads/details.aspx?FamilyId=1C443104--5B3F-4C3%A-868E-36A553FE2A02.
[4]
L. Breiman. Random Forests. Machine Learning, 45(1):5--32, 2001.
[5]
D. Brumley and J. Newsome. Alias analysis for assembly. Technical report Carnegie Mellon University-CS-06--180{R}, Carnegie Mellon University, Pittsburgh, 2006.
[6]
J. Caballero, N. M. Johnson, S. McCamant, and D. Song. Binary code extraction and interface identification for security applications. Technical report UCB/EECS-2009--133, University of California, Berkeley, Berkeley, CA, October 2009.
[7]
S. Choi, H. Park, H. il Lim, and T. Han. A Static Birthmark of Binary Executables Based on API Call Structure. In Proc. of ASIAN, 2007.
[8]
C. Cohen and J. Havrilla. Function Hashing for Malicious Code Analysis, 2009. www.cert.org/research/2009research-report.pdf.
[9]
T. Dullien and R. Rolles. Graph-based comparison of Executable Objects. In Proc. of SSTIC, 2005.
[10]
H. Flake. Structural Comparison of Executable Objects. In Proc. of DMIVA, 2004.
[11]
D. Gao, M. K. Reiter, and D. X. Song. BinHunt: Automatically Finding Semantic Differences in Binary Programs. In Proc. of ICICS, 2008.
[12]
M. Hayes, A. Walenstein, and A. Lakhotia. Evaluation of malware phylogeny modelling systems using automated variant generation. Journal in Computer Virology (JCV), 5(4):335--343, November 2009.
[13]
X. Hu, T. Chiueh, and K. G. Shin. Large-scale malware indexing using function-call graphs. In Proc. of CCS, 2009.
[14]
IDA Pro. http://www.hex-rays.com/idapro.
[15]
R. Linger, S. Prowell, and K. Sayre. Computing the behavior of malicious code with function extraction technology. In Proc. of CSIIRW, 2009.
[16]
ROSE. http://rosecompiler.org.
[17]
N. E. Rosenblum, B. P. Miller, and X. Zhu. Extracting compiler provenance from program binaries. In Proc. of PASTE, 2010.
[18]
A. Sæbjørnsen, J. Willcock, T. Panas, D. J. Quinlan, and Z. Su. Detecting code clones in binary executables. In Proc. of ISSTA, 2009.
[19]
Symantec. Symantec internet security threat report: Trends for January 06--June 06, 2006. http://www.symantec.com/enterprise/threatreport/index.jsp.
[20]
A. Walenstein and A. Lakhotia. The Software Similarity Problem in Malware Analysis. In Duplication, Redundancy, and Similarity in Software, volume 06301 of Dagstuhl Seminar Proceedings, 2007.
[21]
A. Walenstein, M. Venable, M. Hayes, C. Thompson, and A. Lakhotia. Exploiting Similarity Between Variants to Defeat Malware: "Vilo" Method for Comparing and Searching Binary Programs. In Proc. of BLACKHAT DC, 2007.
[22]
WEKA website. http://www.cs.waikato.ac.nz/ml/weka.
[23]
Y. Ye, T. Li, Y. Chen, and Q. Jiang. Automatic malware categorization using cluster ensemble. In Proc. of KDD, 2010.

Cited By

View all
  • (2023)BinAlign: Alignment Padding Based Compiler Provenance RecoveryInformation Security and Privacy10.1007/978-3-031-35486-1_26(609-629)Online publication date: 15-Jun-2023
  • (2022)A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and FeaturesACM Computing Surveys10.1145/348686055:1(1-41)Online publication date: 17-Jan-2022
  • (2020)Binary Analysis OverviewBinary Code Fingerprinting for Cybersecurity10.1007/978-3-030-34238-8_2(7-44)Online publication date: 1-Mar-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2011
1446 pages
ISBN:9781450308137
DOI:10.1145/2020408
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. binary similarity
  2. classification
  3. software provenance

Qualifiers

  • Research-article

Conference

KDD '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)BinAlign: Alignment Padding Based Compiler Provenance RecoveryInformation Security and Privacy10.1007/978-3-031-35486-1_26(609-629)Online publication date: 15-Jun-2023
  • (2022)A Survey of Binary Code Fingerprinting Approaches: Taxonomy, Methodologies, and FeaturesACM Computing Surveys10.1145/348686055:1(1-41)Online publication date: 17-Jan-2022
  • (2020)Binary Analysis OverviewBinary Code Fingerprinting for Cybersecurity10.1007/978-3-030-34238-8_2(7-44)Online publication date: 1-Mar-2020
  • (2019)Software Birthmark Design and Estimation: A Systematic Literature ReviewArabian Journal for Science and Engineering10.1007/s13369-019-03718-944:4(3905-3927)Online publication date: 16-Jan-2019
  • (2018)Beyond Precision and RecallProceedings of the Eighth ACM Conference on Data and Application Security and Privacy10.1145/3176258.3176306(354-365)Online publication date: 13-Mar-2018
  • (2018)Reviving Sequential Program Birthmarking for Multithreaded Software Plagiarism DetectionIEEE Transactions on Software Engineering10.1109/TSE.2017.268838344:5(491-511)Online publication date: 1-May-2018
  • (2018)On the Effectiveness of Code Normalization for Function Identification2018 IEEE 23rd Pacific Rim International Symposium on Dependable Computing (PRDC)10.1109/PRDC.2018.00045(241-251)Online publication date: Dec-2018
  • (2018)Malware Economics and its Implication to Anti-Malware Situational Awareness2018 International Conference On Cyber Situational Awareness, Data Analytics And Assessment (Cyber SA)10.1109/CyberSA.2018.8551388(1-8)Online publication date: Jun-2018
  • (2017)BinSign: Fingerprinting Binary Functions to Support Automated Analysis of Code ExecutablesICT Systems Security and Privacy Protection10.1007/978-3-319-58469-0_23(341-355)Online publication date: 4-May-2017
  • (2015)Software Plagiarism Detection with Birthmarks Based on Dynamic Key Instruction SequencesIEEE Transactions on Software Engineering10.1109/TSE.2015.245450841:12(1217-1235)Online publication date: 1-Dec-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media