[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1014052.1014105acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Learning to detect malicious executables in the wild

Published: 22 August 2004 Publication History

Abstract

In this paper, we describe the development of a fielded application for detecting malicious executables in the wild. We gathered 1971 benign and 1651 malicious executables and encoded each as a training example using n-grams of byte codes as features. Such processing resulted in more than 255 million distinct n-grams. After selecting the most relevant n-grams for prediction, we evaluated a variety of inductive methods, including naive Bayes, decision trees, support vector machines, and boosting. Ultimately, boosted decision trees outperformed other methods with an area under the roc curve of 0.996. Results also suggest that our methodology will scale to larger collections of executables. To the best of our knowledge, ours is the only fielded application for this task developed using techniques from machine learning and data mining.

References

[1]
D. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms. Machine Learning, 6:37--66, 1991.
[2]
A. Aiken. MOSS: A system for detecting software plagiarism. Software, Department of Computer Science, University of California, Berkeley, http://www.cs.berkeley.edu/~aiken/moss.html, 1994.
[3]
Anonymous. Maximum Security. Sams Publishing, Indianapolis, IN, 4th edition, 2003.
[4]
B. Bauer and R. Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36:105--139, 1999.
[5]
B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fourth Workshop on Computational Learning Theory, pages 144--152, New York, NY, 1992. ACM Press.
[6]
L. Breiman. Arcing classifiers. The Annals of Statistics, 26:801--849, 1998.
[7]
M. Christodorescu and S. Jha. Static analysis of executables to detect malicious patterns. In Proceedings of the Twelfth USENIX Security Symposium, Berkeley, CA, 2003. Advanced Computing Systems Association.
[8]
W. Cohen. Fast effective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115--123, San Francisco, CA, 1995. Morgan Kaufmann.
[9]
T. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40:139--158, 2000.
[10]
P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103--130, 1997.
[11]
C. Drummond and R. Holte. Explicitly representing expected cost: An alternative to ROC representation. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 198--207, New York, NY, 2000. ACM Press.
[12]
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the Seventh International Conference on Information and Knowledge Management, pages 148--155, New York, NY, 1998. ACM Press.
[13]
E. Durning-Lawrence. Bacon is Shake-speare. The John McBride Company, New York, NY, 1910.
[14]
Y. Freund and R. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 148--156, San Francisco, CA, 1996. Morgan Kaufmann.
[15]
A. Gray, P. Sallis, and S. MacDonell. Software forensics: Extending authorship analysis techniques to computer programs. In Proceedings of the Third Biannual Conference of the International Association of Forensic Linguists, pages 1--8, Birmingham, UK, 1997. International Association of Forensic Linguists.
[16]
D. Grossman and O. Frieder. Information retrieval: Algorithms and heuristics. Kluwer Academic Publishers, Boston, MA, 1998.
[17]
D. Hand, H. Mannila, and P. Smyth. Principles of data mining. MIT Press, Cambridge, MA, 2001.
[18]
A. Jain, R. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:4--37, 2000.
[19]
H. Jankowitz. Detecting plagiarism in student Pascal programs. Computer Journal, 31:1--8, 1988.
[20]
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the Tenth European Conference on Machine Learning, pages 487--494, Berlin, 1998. Springer-Verlag.
[21]
J. Kephart, G. Sorkin, W. Arnold, D. Chess, G. Tesauro, and S. White. Biologically inspired defenses against computer viruses. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 985--996, San Francisco, CA, 1995. Morgan Kaufmann.
[22]
B. Kjell, W. Woods., and O. Frieder. Discrimination of authorship using visualization. Information Processing and Management, 30:141--150, 1994.
[23]
I. Krsul. Authorship analysis: Identifying the author of a program. Master's thesis, Purdue University, West Lafayette, IN, 1994.
[24]
I. Krsul and E. Spafford. Authorship analysis: Identifying the authors of a program. In Proceedings of the Eighteenth National Information Systems Security Conference, pages 514--524, Gaithersburg, MD, 1995. National Institute of Standards and Technology.
[25]
R. Lo, K. Levitt, and R. Olsson. MCF: A malicious code filter. Computers & Security, 14:541--566, 1995.
[26]
M. Maron and J. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7:216--244, 1960.
[27]
G. McGraw and G. Morisett. Attacking malicious code: A report to the Infosec Research Council. IEEE Software, pages 33--41, September/October 2000.
[28]
C. Metz, Y. Jiang, H. MacMahon, R. Nishikawa, and X. Pan. ROC software. Web page, Kurt Rossmann Laboratories for Radiologic Image Research, University of Chicago, Chicago, IL, 2003.
[29]
P. Miller. hexdump 1.4. Software, http://gd.tuwien.ac.at/softeng/Aegis/hexdump.html, 1999.
[30]
T. Mitchell. Machine Learning. McGraw-Hill, New York, NY, 1997.
[31]
D. Opitz and R. Maclin. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11:169--198, 1999.
[32]
J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. Burges, and S. Mika, editors, Advances in Kernel Methods---Support Vector Learning. MIT Press, Cambridge, MA, 1998.
[33]
J. Platt. Probabilities for SV machines. In P. Bartlett, B. Schölkopf, D. Schuurmans, and A. Smola, editors, Advances in Large-Margin Classifiers, pages 61--74. MIT Press, Cambridge, MA, 2000.
[34]
F. Provost and T. Fawcett. Robust classification for imprecise environments. Machine Learning, 42:203--231, 2001.
[35]
J. Quinlan. C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco, CA, 1993.
[36]
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 AAAI Workshop, Menlo Park, CA, 1998. AAAI Press. Technical Report WS-98-05.
[37]
M. Schultz, E. Eskin, E. Zadok, and S. Stolfo. Data mining methods for detection of new malicious executables. In Proceedings of the IEEE Symposium on Security and Privacy, pages 38--49, Los Alamitos, CA, 2001. IEEE Press.
[38]
S. Soman, C. Krintz, and G. Vigna. Detecting malicious Java code using virtual machine auditing. In Proceedings of the Twelfth USENIX Security Symposium, Berkeley, CA, 2003. Advanced Computing Systems Association.
[39]
E. Spafford and S. Weeber. Software forensics: Can we track code to its authors? Computers & Security, 12:585--595, 1993.
[40]
J. Swets and R. Pickett. Evaluation of diagnostic systems: Methods from signal detection theory. Academic Press, New York, NY, 1982.
[41]
G. Tesauro, J. Kephart, and G. Sorkin. Neural networks for computer virus recognition. IEEE Expert, 11:5--6, August 1996.
[42]
I. Witten and E. Frank. Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco, CA, 2000.
[43]
Y. Yang and J. Pederson. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 412--420, San Francisco, CA, 1997. Morgan Kaufmann.

Cited By

View all
  • (2024)Application of Machine Learning Models for Malware Classification With Real and Synthetic DatasetsInternational Journal of Information Security and Privacy10.4018/IJISP.35651318:1(1-23)Online publication date: 7-Nov-2024
  • (2024)A Machine Learning-Based PE Header Analysis for Malware DetectionInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAR615(1671-1676)Online publication date: 1-Apr-2024
  • (2024)Attention-Based Malware Detection Model by Visualizing Latent Features Through Dynamic Residual Kernel NetworkSensors10.3390/s2424795324:24(7953)Online publication date: 12-Dec-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
August 2004
874 pages
ISBN:1581138881
DOI:10.1145/1014052
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 August 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. concept learning
  2. data mining
  3. malicious software
  4. security

Qualifiers

  • Article

Conference

KDD04

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)5
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Application of Machine Learning Models for Malware Classification With Real and Synthetic DatasetsInternational Journal of Information Security and Privacy10.4018/IJISP.35651318:1(1-23)Online publication date: 7-Nov-2024
  • (2024)A Machine Learning-Based PE Header Analysis for Malware DetectionInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAR615(1671-1676)Online publication date: 1-Apr-2024
  • (2024)Attention-Based Malware Detection Model by Visualizing Latent Features Through Dynamic Residual Kernel NetworkSensors10.3390/s2424795324:24(7953)Online publication date: 12-Dec-2024
  • (2024)Deep learning trends and future perspectives of web security and vulnerabilitiesJournal of High Speed Networks10.3233/JHS-23003730:1(115-146)Online publication date: 1-Jan-2024
  • (2024)Case Study: Neural Network Malware Detection Verification for Feature and Image DatasetsProceedings of the 2024 IEEE/ACM 12th International Conference on Formal Methods in Software Engineering (FormaliSE)10.1145/3644033.3644372(127-137)Online publication date: 14-Apr-2024
  • (2024)Web Spoofing Prevention: Machine Learning Based Client-Side Defence2024 2nd International Conference on Sustainable Computing and Smart Systems (ICSCSS)10.1109/ICSCSS60660.2024.10624881(1098-1104)Online publication date: 10-Jul-2024
  • (2024)Malware Detection for Portable Executables Using a Multi-input Transformer-Based Approach2024 International Conference on Computing, Networking and Communications (ICNC)10.1109/ICNC59896.2024.10556067(778-782)Online publication date: 19-Feb-2024
  • (2024)Enhancing Malicious Code Detection With Boosted N-Gram Analysis and Efficient Feature SelectionIEEE Access10.1109/ACCESS.2024.347616412(147400-147421)Online publication date: 2024
  • (2024)Hybrid Input Model Using Multiple Features From Surface Analysis for Malware DetectionIEEE Access10.1109/ACCESS.2024.345267512(121198-121207)Online publication date: 2024
  • (2024)Evolutionary feature selection for machine learning based malware classificationEngineering Science and Technology, an International Journal10.1016/j.jestch.2024.10176256(101762)Online publication date: Aug-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media