[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3440749.3442661acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicfndsConference Proceedingsconference-collections
research-article
Open access

An Extended Benchmark System of Word Embedding Methods for Vulnerability Detection

Published: 13 May 2021 Publication History

Abstract

Security researchers have used Natural Language Processing (NLP) and Deep Learning techniques for programming code analysis tasks such as automated bug detection and vulnerability prediction or classification. These studies mainly generate the input vectors for the deep learning models based on the NLP embedding methods. Nevertheless, while there are many existing embedding methods, the structures of neural networks are diverse and usually heuristic. This makes it difficult to select effective combinations of neural models and the embedding techniques for training the code vulnerability detectors. To address this challenge, we extended a benchmark system to analyze the compatibility of four popular word embedding techniques with four different neural networks, including the standard Bidirectional Long Short-Term Memory (Bi-LSTM), the Bi-LSTM applied attention mechanism, the Convolutional Neural Network (CNN), and the classic Deep Neural Network (DNN). We trained and tested the models by using two types of vulnerable function datasets written in C code. Our results revealed that the Bi-LSTM model combined with the FastText embedding technique showed the most efficient detection rate on a real-world but not on an artificially constructed dataset. Further comparisons with the other combinations are also discussed in detail in our result.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16). 265–283.
[2]
Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 1–37.
[3]
Paul E Black. 2018. Juliet 1.3 Test Suite: Changes From 1.2. US Department of Commerce, National Institute of Standards and Technology.
[4]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.
[5]
CVE. 2019. Common Vulnerabilities and Exposures website. https://cve.mitre.org/.
[6]
Guisheng Fan, Xuyang Diao, Huiqun Yu, Kang Yang, and Liqiong Chen. 2019. Software defect prediction via attention-based recurrent neural network. Scientific Programming 2019 (2019).
[7]
Yong Fang, Yongcheng Liu, Cheng Huang, and Liang Liu. 2020. FastEmbed: Predicting vulnerability exploitation possibility based on ensemble machine learning algorithm. Plos one 15, 2 (2020), e0228439.
[8]
Jacob A Harer, Louis Y Kim, Rebecca L Russell, Onur Ozdemir, Leonard R Kosta, Akshay Rangamani, Lei H Hamilton, Gabriel I Centeno, Jonathan R Key, Paul M Ellingwood, 2018. Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1803.04497(2018).
[9]
Jordan Henkel, Shuvendu K Lahiri, Ben Liblit, and Thomas Reps. 2018. Code vectors: understanding programs through embedded abstracted symbolic traces. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 163–174.
[10]
Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882(2014).
[11]
Maciej Kula. 2019. A python implementation of GloVe: glove-python. https://github.com/maciejkula/glove-python.
[12]
Zhen Li, Deqing Zou, Jing Tang, Zhihao Zhang, Mingqian Sun, and Hai Jin. 2019. A comparative study of deep learning-based vulnerability detection system. IEEE Access 7(2019), 103184–103197.
[13]
Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681(2018).
[14]
Guanjun Lin, Wei Xiao, Jun Zhang, and Yang Xiang. 2019. Deep Learning-Based Vulnerable Function Detection: A Benchmark. In International Conference on Information and Communications Security. Springer, 219–232.
[15]
Guanjun Lin, Jun Zhang, Wei Luo, Lei Pan, Olivier De Vel, Paul Montague, and Yang Xiang. 2019. Software vulnerability discovery via learning multi-domain knowledge bases. IEEE Transactions on Dependable and Secure Computing (2019).
[16]
Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. 2009. Introduction to Information Retrieval. Cambridge University Press.
[17]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781(2013).
[18]
Serguei A Mokhov, Joey Paquet, and Mourad Debbabi. 2014. The use of NLP techniques in static code analysis to detect weaknesses and vulnerabilities. In Canadian Conference on Artificial Intelligence. Springer, 326–332.
[19]
NSCLab. 2020. Cyber Code Intelligence GitHub website. https://github.com/cybercodeintelligence/CyberCI.
[20]
NVD. 2019. National Vulnerability Database website. https://nvd.nist.gov/.
[21]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
[22]
Michael Pradel and Koushik Sen. 2017. Deep learning to find bugs. TU Darmstadt, Department of Computer Science (2017).
[23]
Achyudh Ram, Ji Xin, Meiyappan Nagappan, Yaoliang Yu, Rocío Cabrera Lozoya, Antonino Sabetta, and Jimmy Lin. 2019. Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits. arXiv preprint arXiv:1911.07620(2019).
[24]
Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50.
[25]
Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. 2018. Automated vulnerability detection in source code using deep representation learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 757–762.
[26]
SARD. 2019. Software Assurance Reference Dataset project. https://samate.nist.gov/SRD/.
[27]
Jürgen Schmidhuber and Sepp Hochreiter. 1997. Long short-term memory. Neural Comput 9, 8 (1997), 1735–1780.

Cited By

View all
  • (2024)Parameter-efficient fine-tuning of pre-trained code models for just-in-time defect predictionNeural Computing and Applications10.1007/s00521-024-09930-536:27(16911-16940)Online publication date: 1-Sep-2024

Index Terms

  1. An Extended Benchmark System of Word Embedding Methods for Vulnerability Detection
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICFNDS '20: Proceedings of the 4th International Conference on Future Networks and Distributed Systems
    November 2020
    313 pages
    ISBN:9781450388863
    DOI:10.1145/3440749
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 May 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CNN
    2. Deep Learning
    3. LSTM
    4. Vulnerability Detection
    5. Word Embedding

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICFNDS '20

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)99
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 12 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Parameter-efficient fine-tuning of pre-trained code models for just-in-time defect predictionNeural Computing and Applications10.1007/s00521-024-09930-536:27(16911-16940)Online publication date: 1-Sep-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media