AI models of code have made significant progress over the past few years. However, many models are actually not learning task-relevant source code features. Instead, they often fit non-relevant but correlated data, leading to a lack of robustness and generalizability, and limiting the subsequent practical use of such models. In this work, we focus on improving the model quality through signal awareness, i.e., learning the relevant signals in the input for making predictions. We do so by leveraging the heterogeneity of code samples in terms of their signal-to-noise content. We perform an end-to-end exploration of model signal awareness, comprising: (i) uncovering the reliance of AI models of code on task-irrelevant signals, via prediction-preserving input minimization; (ii) improving models’ signal awareness by incorporating the notion of code complexity during model training, via curriculum learning; (iii) improving models’ signal awareness by generating simplified signal-preserving programs and augmenting them to the training dataset; and (iv) presenting a novel interpretation of the model learning behavior from the perspective of the dataset, using its code complexity distribution. We propose a new metric to measure model signal awareness, Signal-aware Recall, which captures how much of the model’s performance is attributable to task-relevant signal learning. Using a software vulnerability detection use-case, our model probing approach uncovers a significant lack of signal awareness in the models, across three different neural network architectures and three datasets. Signal-aware Recall is observed to be in the sub-50s for models with traditional Recall in the high 90s, suggesting that the models are presumably picking up a lot of noise or dataset nuances while learning their logic. With our code-complexity-aware model learning enhancement techniques, we are able to assist the models toward more task-relevant learning, recording up-to 4.8× improvement in model signal awareness. Finally, we employ our model learning introspection approach to uncover the aspects of source code where the model is facing difficulty, and we analyze how our learning enhancement techniques alleviate it.
1 Introduction
Over the past few years, AI models have made significant progress in source code understanding tasks, such as function and variable naming [4, 5, 9, 71], code summarization [6, 41, 51, 83], code recommendation [57, 81], code completion [23, 66, 70], defect detection [26, 34, 37, 69, 74, 92], bug fixing [110, 111], amongst others [11, 117]. The wide availability of open source codebases has fueled this progress, and we have started to see the adoption of such AI models of code in software development workflows [7, 36, 107]. Ever more sophisticated models are emerging and pushing the state of the art rapidly. Although each new model improves upon its predecessor’s prediction performance in terms of F1 and accuracy measures, what remains relatively unexplored is whether the models are picking up the correct signals to arrive at their predictions.
From a model’s perspective, it might be doing an excellent job of learning to differentiate samples belonging to separate classes (e.g., “buggy” or “healthy” code in a vulnerability detection setting). But, as we observe, it may very well be the case that it does so by picking up noise or certain nuances from the dataset for their predictions, which are not representative to the task at hand. This includes learning unexpected correlations between samples and certain keywords or programming constructs like ifs/loops/variable names, which may be more prevalent in one sample class than the other. Learning class separators in this way could yield great classification performance, which may be perfectly acceptable in a theoretical or statistical setting. But, models influenced by such dataset nuances, as opposed to capturing task-relevant aspects of source code, are prone to failures when applied in real-world settings [8, 30, 85]. Ideally, a reliable vulnerability detection model, for example, would learn features or aspects of code directly related to or indicative of bugs, such as pairing mallocs and frees to detect potential resource leaks in program branches. Or, a lack of value-range validation of a variable prior to its use as a memory allocation size argument, indicating a potential buffer overflow vulnerability.
We refer to this aspect of the model’s quality signal awareness, i.e., learning the relevant signals in the input for making predictions, and explore it using a software vulnerability detection use-case. What used to be a domain traditionally dominated by static and dynamic analysis has recently witnessed assistance and competition from AI models. The high false positives of static analyzers and the lack of completeness of dynamic analysis are a few reasons promoting the entry of AI into this field [60, 109, 121]. However, unlike the rules and path/flow analysis of static analyzers and the execution tracing of dynamic analysis, it remains unclear as to what signals the AI models actually pick up for detecting vulnerabilities in source code. AI-based detectors, potentially learning task-irrelevant features, can lead to a false sense of security when applied to real-world usage, as substitutes for traditional source code analyzers. By enabling measurement of the model signal awareness, we can use it to guide the development of models that not only exhibit high accuracy but are also based on features relevant to the given task.
In this work, we perform a comprehensive exploration of model signal awareness—including its measurement, enhancement, and introspection. Our experience experimenting with existing models led us to doubt their quality in terms of what they are actually learning (Figure 1; Section 2). We thus started exploring ways to figure out if the models are capturing task-relevant signals. To assess model quality from this perspective, we developed a data-driven technique while treating the model as a black-box. This uncovered a lack of signal awareness in the current models, which motivated us to assist them toward focusing more on task-relevant aspects of source code (Figure 2; Section 2). We observed that not all code snippets are the same; some are more “noisy” than others—containing code not directly relevant to the learning task at hand. We targeted reducing this noise whilst preserving task-relevant signals. To this end, the concept of code complexity offered us a way to distinguish the different training samples, and we explored ways to introduce this code-complexity awareness in the models. We developed two code-centric approaches to guide model learning in this manner, and observed a subsequent improvement in the model signal awareness. To visualize how this altered model training scheme might have improved its signal awareness, we devised a way to introspect model learning from the perspective of the dataset. The code-level interpretations and developer-friendly insights generated by our approach can help assert trust in the models as they integrate with software development workflows. Figure 3 presents a view of all our approaches co-existing within the model training pipeline, described in detail in Section 4.
Fig. 1.
Fig. 2.
Fig. 3.
Briefly, the central idea of our model-probing approach is to first reduce the original source code input to a trained model into a minimal snippet, without changing the model prediction. This represents the minimal excerpt of the original code, which the model requires to arrive and maintain its original prediction. Then, taking the example of vulnerability detection, the model’s signal-awareness is determined by verifying if the minimal snippet contains the same vulnerability as the original code (Figure 4; Section 4.2). To improve model learning, we incorporate the notion of code complexity in the model training process, by (i) presenting programs in increasing order of their code complexity to the models, and (ii) generating simplified signal-preserving programs and augmenting them to the training dataset. Finally, we deduce model learning behaviour by comparing the code complexity of the samples grouped by their prediction accuracy from the model.
Fig. 4.
Results show substantial improvements in a model’s signal awareness when assisted with our approaches, across different datasets and models. Complexity-ranked training can provide up to 32% boost to the model’s Signal-aware Recall whereas program-simplification-based augmentation surpasses it, realizing up to \(4.8\times\) improvement. We use our data-driven model introspection approach to analyze these model learning improvements achieved. Amongst the model learning insights provided by our approach, is the issue of the models facing difficulty in understanding bigger (and more complex) code samples, which interestingly alleviates as augmentation levels increase.
Being data-driven, all our approaches are independent of the model learning algorithm, the learning task, and the source code programming language employed. We apply our approaches on three different neural network architectures operating on different popular representations of source code in the AI domain: (i) a convolutional neural network (CNN) treating code as a photo, (ii) a recurrent neural network (RNN) treating code as a linear sequence of tokens, and (iii) a graph neural network (GNN) operating on code as a graph. We use these models for vulnerability detection over three different datasets, in each case observing lack of signal awareness and subsequent enhancement with our approaches, while maintaining model classification performance. In our experiments, we use the Infer tool [32] to verify bug existence in the reduced samples. However, our approaches are independent of the specific bug-checker being employed (alternatives in Sections 7.3 and 8.3).
This article is an extension of our previous work [104], which presented our prediction-preserving input minimization (P2IM) approach to uncover the reliance of AI models of source code on task-irrelevant signals. Continuing along the AI model reliability theme, in this extended version, we build atop the findings of P2IM model-probing, to assist model learning. While P2IM uncovered a lack of signal awareness in AI models of code while hinting toward models’ reliance on task-irrelevant features, in this follow-up work, we focus on improving model reliability and trustworthiness. Specifically, we incorporate the concept of code complexity in model training, and introduce two approaches to improve model signal awareness, using the signal-aware recall (SAR) metric developed with P2IM. In this resulting work as well, we stick with the data-driven and SE-driven model exploration principles developed in P2IM. Additionally, in our previous position paper [105], we presented a categorization of the various reliability issues affecting AI models of code into the different stages of an AI pipeline, and called for efforts from the research community to ensure model quality. In this article, we concretize our early ideas on model accountability and traceability.
The main contributions of our overall work are as follows:
•
We develop a new model evaluation metric to measure how well a model captures task-specific signals. This enables a more fair model evaluation and meaningful comparison.
•
We expose a lack of signal awareness in AI models of code.
•
We improve the model signal awareness using SE concepts to assist AI understanding of source code.
•
We tailor AI’s curriculum learning for the source code domain, by coupling it with the notion of code complexity.
•
We show the superiority of targeted augmentation with simplified programs, over generic augmentation, to improve model learning.
•
We present a novel perspective for deducing model learning behavior using code complexity of the dataset.
This article is organized as follows. We first discuss our motivation behind this work in Section 2, and then present a brief background on AI models for source code, as well as the Delta Debugging technique, in Section 3. Then, we present the details of our approaches in Section 4. Section 5 describes the experiment configuration, while Section 6 presents the results of our model signal awareness measurement, enhancement and introspection. Sections 7, 8, and 9 cover relevant discussions, threats to validity, and practical implications of our work, respectively. Section 10 covers related work, and finally Section 11 concludes this article.
2 Motivation
The rapid proliferation of AI for source code understanding is leading to ever more sophisticated models, which are getting bigger and better with each successive iteration. Accompanying their rapid proliferation is an emerging scrutiny over the models’ reliability along multiple dimensions such as data duplication bias and labeling quality amongst others [3, 17, 53]. We have been experimenting with the state of the art to assess its quality from a practical perspective. We noticed that the models seem to suffer from weak generalizability and robustness—known AI frailties. These concerns become ever more important if the models are to be applied to sensitive tasks, such as ensuring code security. We observed the same weakness across several different models and datasets in a vulnerability detection setting, which led us to doubt the quality of the models in terms of what is it that they are actually learning. We thus started exploring ways to probe the models’ signal awareness.
Signal awareness is different than correctness—a model can learn a perfect separator between buggy and healthy code, but it can very well arrive at the separator by picking up dataset nuances, as opposed to real vulnerability signals. This can be caused when, for example, the model picks up unexpected correlations between code samples and sample lengths, or variable names, or certain programming constructs, which may happen to differ for buggy and healthy samples in a particular dataset. Learning a separator based on these non-representative signals, which may lead to great-looking performance numbers, is perfectly acceptable from a model’s perspective. The model is indeed doing its job of learning to classify, but this provides a false sense of security.
Signal-awareness or verifying if the models are learning the correct logic relevant to code analysis is crucial to generate trust in models if they are to be put into the field in competition to, or alongside, traditional static and dynamic analyzers. Furthermore, it adds an important measure of model quality, beyond the traditional statistical analysis measures, which can more fairly compare and guide improvements across model evolutions. In this work, we uncover this signal-awareness aspect of AI models of code, and quantify how much impact it has on their robustness as well as reported performance numbers.
A typical example of weak robustness is when an image classifier’s verdict on an input image changes on adding minor noise to the image imperceptible to the naked eye [16, 39, 106]. As shown in Figure 1, we observe the same issue with AI models of vulnerability detection, where even a 99 F1 model flips its prediction on only slightly different code variants. However, it should be reasonable to expect a high-quality model, which correctly picks up the real vulnerability signals, to demonstrate prediction robustness, if it is ever to be trusted in a practical security setting. Although such “perturbations” can be taken to be an adversarial attack on the model [16, 39, 106, 118], and similarly there can be defenses against such attacks such as training the model with several different code variants. But, this line of thought is complementary to our work. These observations merely triggered our suspicion around a disconnect between the reported model performance numbers versus the actual task-aware learning, similar in spirit to other prevailing doubts regarding model quality [3, 18, 53].
Our goal is not to discredit AI models of code by reiterating their brittleness, but to uncover and quantify how much impact task-agnostic training has on their reported performance numbers. Our hope is that revealing these shortcomings would motivate future research toward more task-aware learning, potentially guided by the SAR-Recall divide. We believe in the potential of AI for source code understanding to overcome this “weakness” of lacking signal-awareness with appropriate “guidance” during training. The model itself is doing its job well, learning how best to separate samples based on available input features, to reach a local minima in the loss landscape. But, as shown in Figure 2, there can exist other local minima in this landscape, which can offer similar model performance, but which rely on features more in sync with the task expectation. If we can somehow nudge the model to reach an alternate minima, while using domain-specific assistance, then perhaps the model’s signal awareness can be improved. This issue is widely seen in ML, and is typically solved by white-box or domain-specific approaches such as robust training and adversarial training [12, 38, 76, 124]. In this work, we explore code-centric, data-driven approaches to assist AI models of code in focusing more on task-relevant aspects of source code. We do this by leveraging SE techniques to transform the training data, incorporating the notion of complexity of code into model learning.
3 Background
In this section, we first briefly describe three different neural network models that have been popularly employed for learning over source code, each operating at a different code representation. These include (i) a convolutional neural network treating code as a photo, (ii) a recurrent neural network treating code as a linear sequence of tokens, and (iii) a graph neural network operating on code as a graph. Later, we shall evaluate the signal awareness of these models, and its subsequent enhancement with our techniques, using our proposed SAR metric. After the model descriptions, we cover the basics of the Delta Debugging technique [123], which we use to reduce the input program samples to measure the models’ signal-awareness, and also to simplify the samples for our augmentation technique. Finally, we briefly describe the concept of Curriculum Learning, which we use in our complexity-ranked training approach.
3.1 AI Models of Code
Convolutional neural networks (CNNs) learn on image inputs. These are made up of convolutional and pooling layers. The former act as filters to extract features from input images, learning increasingly complex patterns when the neural network becomes progressively deeper. Pooling layers, however, downsample the features to intensify the signal and control the size of the neural network. This is done by selecting the most strongly activated neurons (i.e., max-pooling) or taking their average (i.e., mean-pooling). CNNs have been successfully employed in computer vision tasks such as image recognition and object detection [47, 61, 89, 100]. These have also been used in the context of vulnerability detection with slight modifications [92]. In such a setting, the source code tokens are first projected into an embedding space and then fed into a CNN as a real-value matrix, similar to an image.
Recurrent neural networks (RNNs) are designed to learn over sequential inputs, such as text and audio [72, 93]. A “working memory” is maintained and updated by a series of input, output, and forget “gates,” using the current input and the previous memory at each step [22, 50]. Depending on the use case, RNNs can emit vector representations at each step, or emit one final representation for a complete sequence. RNNs have been applied to source code by treating it as sequences of tokens. Usually, one final representation is extracted for a piece of code in the context of vulnerability detection [69, 92].
Graph neural networks (GNNs) have gained popularity due to their unique ability to learn over graph-structured data like social network graphs and molecular structures [35, 125]. Most GNNs are made up of three modules: (i) message passing, which decides how information is exchanged among nodes via edges, (ii) message aggregation, which determines how each node combines the received messages, and (iii) message updating, which controls how each node updates their representation after one cycle of information propagation [35, 59, 67]. Using GNNs on source code is a natural fit, since multiple forms of graphs can be constructed on top of source code, such as abstract syntax tree, data flow graph, and control flow graph. They have achieved state-of-the-art performance on multiple software engineering tasks, including vulnerability detection [129] and code summarization [64].
3.2 Delta Debugging
Delta Debugging (DD) [123] was first proposed to minimize long crash-inducing bug reports for Mozilla’s web browser. The input to DD is a sequence satisfying some predefined oracle. The goal is to find a subset of the input (e.g., the bug report) satisfying the following two requirements: (1) the subset leads to the same outcome (e.g., browser crash); and (2) not a single element can be removed to preserve the outcome. Such a subset is called 1-minimal. DD uses an iterative split-and-test algorithm [123] to reduce an input sequence. In the best-case scenario, DD works like a binary search, which can systematically and efficiently identify the 1-minimal. In Section 4.2, we describe how we adopt DD to reduce input program samples for measuring model signal-awareness, as well as to simplify the samples for our augmentation technique.
3.3 Curriculum Learning
Curriculum learning [10] is a training strategy that mimics the way humans learn, by gradually introducing easier tasks before tackling more complex ones. The basic idea behind curriculum learning is to use a sequence of tasks, ordered by increasing level of difficulty, to train a model. By doing so, the model can learn more effectively and achieve better performance than training on a random task order. One key advantage of curriculum learning is that it can help prevent the model from getting stuck in suboptimal solutions, since it starts by learning simpler patterns before moving to more complex ones. Moreover, curriculum learning can reduce the amount of data needed for training, since the model is gradually introduced to the complexity of the task.
In this work, we tailor curriculum learning toward source code modeling, by coupling it with the notion of code complexity, as described in Section 4.3. There exist several well-established metrics in SE to measure code complexity, including relatively straightforward counters, as well as higher-order cyclomatic and halstead metrics. Examples of the former include number of lines of code, decision points, loops, functions, classes, and so on. Building atop basic counters like if-counts, cyclomatic complexity measures the number of linearly independent paths in a program, arising from control flow statements. Halstead complexity, however, measures concepts such as difficulty of the program to understand, and coding time required, amongst others [45], in terms of the number of operators and operands in the program. It provides abstract measures of potential error rate and maintenance effort. While we experiment with a few of these metrics to highlight the potential of complexity-ranked training, our goal is not to advocate for one over the other, in this work.
4 Design
Figure 3 (Section 1) presents a high-level overview of all our data-driven approaches co-existing within the typical AI modeling pipeline. Specifically, we add one model probing approach, two model assistance approaches and one model introspection approach to the pipeline. As shown in the Figure, our augmentation technique resides in the dataset curation phase, expanding the training set with simplified programs. Next in the pipeline resides our complexity-ranking approach, applicable during the model training phase, and operating atop either the original dataset or the augmented dataset, independently. Finally, once the model has been trained, either via generic training or complexity-ranked training, our model probing and introspection approaches come into play. The former measures the model signal awareness, while the latter deduces its learning behavior, respectively. All approaches can exist independently of each other, with the next section presenting a brief overview of each of them. Then, in the subsequent sections, we present the details of our model probing approach, followed by the model learning assistance approaches, and finally the introspection approach.
4.1 Overview
Signal-awareness Measurement. Our first data-driven approach—Prediction-Preserving Input Minimization (P2IM)—systematically evaluates a model’s ability to capture real signals. We borrow the popular fault isolation technique from the Software Engineering (SE) domain called Delta Debugging [123]. The central idea of our approach is to first reduce the original source code input to a trained model into a minimal snippet, without changing the model prediction. This represents the minimal excerpt of the original code, which the model requires to arrive and maintain its original prediction. Then, the model’s signal-awareness is determined by verifying if the minimal snippet exhibits the same task profile as the original code. Taking the example of vulnerability detection, a lack of signal awareness in the model is uncovered if the minimal snippet (which the model predicts to be vulnerable) does not in fact contain the original vulnerability. Figure 6 in Section 4.2 demonstrates this with an example walk-through of P2IM. Additionally, we present a new metric called Signal Aware Recall (SAR) to measure how well a model captures task-specific signals.
Fig. 5.
Fig. 6.
When probing existing AI models of code for their signal awareness, we observe Signal-aware Recall in the sub-50s for models with traditional Recall in the high 90s. Only less than half of the True Positives predicted by the models can be attributed to task-relevant signal learning, suggesting that the models are presumably picking up a lot of noise or dataset nuances while learning their logic. Note that we are not suggesting a shortcoming in the Recall metric. In fact, the expectation is for the model’s SAR to reach its Recall in the ideal scenario where the model truly captured task-specific signals. SAR serves as a signal-aware supplement to the common metrics toolbox of model quality measurement.
While probing model signal-awareness, we maintain utmost fairness to the model. We keep the model untouched, and never alter its training process or change its training distribution at all, which is performed on the original unmodified dataset. The model’s Recall and SAR are evaluated on the dataset’s original test-set itself. We treat the trained model as a black box and query it for its prediction on iteratively smaller versions for each vulnerable sample from the test-set. With such input “perturbations,” our goal is not to craft programs for adversarial attacks [16, 39, 106, 118], but to uncover the signal-awareness of a black-box model in a data-driven manner via controlled queries. That is, at a very superficial level, we ask the trained model if it feels a code sample “AAAA” is vulnerable, where “A” represents any atomic code chunk granularity; then if it feels “AAA” is vulnerable, then if it feels “AA” is vulnerable, and so on until its verdict changes. We then test if the minimal reduced version, which preserves the model’s verdict, actually contains the original ground-truth vulnerability. To ensure model fairness, we only feed valid compilable reduced samples to the model for its verdict, without introducing any new bugs.
Signal-awareness Enhancement. With P2IM revealing the shortcomings of current task-agnostic training, we next explore the option of nudging the training process to learn models with greater signal awareness, whilst maintaining model performance. As opposed to white-box or domain-specific approaches such as robust training and adversarial training [12, 38, 76, 124], we explore code-centric, data-driven approaches to guide the models in focusing more on task-relevant aspects of source code. We do so by transforming the training data using SE techniques in two different ways.
We approach model signal-awareness enhancement by marrying the SE concept of code complexity with the AI technique of curriculum learning [10]. Specifically, we introduce the notion of complexity metrics during training, and feed program samples to the model in increasing order of their code complexity. The intuition is that by presenting “easier” examples first, it would improve the model’s chances of picking up task-relevant signals, helping it to sift-through noise with more complex examples down the line.
Next, we present an alternate data-augmentation-based approach to assist model learning, with the notion of complexity of code being implicit, in contrast to the previous approach. Here, we incorporate SE assistance into AI model training by customizing Delta Debugging to generate simplified program samples. The intuition is that by adding smaller and potentially de-noised code samples to the training dataset, while preserving their task profile, the model can learn relevant signals better. Within the vulnerability detection setting, in particular, preserving existing bugs while generating simplified programs is the key difference in our approach. This is in contrast to existing source-level bug-seeding-based augmentation methods, which can lead to previously unseen bugs [13, 29, 78, 84, 86, 91, 112].
Model Learning Introspection. We continue our data-driven model exploration to visualize how the altered model training process might have improved its signal awareness. We leverage the unique opportunity afforded by the source code setting to develop a code-complexity-driven black-box model introspection approach. Specifically, we analyze a model’s predictions from the dataset perspective, beyond the statistical measures of the model prediction accuracy. The intuition is to analyze the characteristics of the samples that the model predicted correctly versus the mispredicted ones. By using code complexity metrics to group samples by prediction accuracy, our approach uncovers model learning behavior in terms of the aspects of code the model can grasp, and where it faces difficulties. We believe that such code-centric model learning insights offered by our approach are more developer friendly, as compared to the black-box quantitative measures of model performance and existing model explanation approaches (Section 10).
4.2 Prediction Preserving Input Minimization
Programming defects are an inevitable reality in software creation. Vulnerabilities arise when such defects fall in a security-related subset such as null pointer dereference, buffer overflow, use-after-free, amongst others. Static analyzers detect these vulnerabilities either by reasoning about the possible execution behaviours over a program model, or by matching defect-specific rules. Dynamic analysis, however, directly executes the program, exploring different execution paths to concretely expose the defects. Unlike the traditional analyzers, the logic of AI models of code is implicit, and not directly perceptible. In this section, we present our approach toward understanding this logic, while treating the models as black box entities. Given explaining what an AI model is learning is still an open problem, especially in the context of source code understanding, we frame our exploration in terms of detecting if the models are learning the vulnerability-relevant signals.
Our intuition behind data-driven model probing comes from the failure-inducing input simplification idea of Delta Debugging (DD) described in Section 3.2, while replacing the Mozilla target with the prediction model, and the failure-inducing bug reports with vulnerable program samples. DD’s process of identifying a minimal sub-sequence of the input, which leads to the same output, then translates to identifying the minimal sub-program, which preserves the model’s prediction. The model’s signal-awareness is then determined by testing the minimal sub-program for the original vulnerability existence. Note that while we use DD to reduce program samples, our approach is not reliant on it. DD offers an efficient reduction solution and can be substituted by other alternatives, as discussed in Section 7.5.
4.2.1 P2IM Workflow.
Figure 5 depicts the overall flow behind our prediction-preserving input minimization (P2IM) approach. The sequence of operations is as follows:
Step 1.P2IM takes as input a trained model and a program sample, which the model predicts to be vulnerable. This sample comes from the test-set of the dataset used for model training itself.
Steps 2–4.P2IM then iteratively keeps reducing the sample and querying the model for its prediction on the reduced subprogram, so long as the model maintains its vulnerability prediction. This process continues till a minimal snippet, called 1-minimal, is extracted from the program sample, such that removing even a single token from it would change the model prediction. To efficiently and systematically reduce a program sample, we follow the same procedure as outlined in Section 3.2, wherein DD reduces the program sample at the level of source code tokens, iteratively splitting the sample till a valid 1-minimal sub-program is extracted. The iterative DD process is driven by an oracle, which decides whether or not an intermediate reduced subprogram should be picked for subsequent reductions. To produce a valid and prediction-preserving 1-minimal subprogram, as shown in Algorithm 1, we customize the oracle to require the reduced subprograms satisfy the following properties (lines 3 and 4):
•
Vulnerable prediction. A subprogram is selected for further processing only if it preserves the model’s vulnerability prediction.
•
Valid program. We also enforce that the reduced subprogram is valid and compilable. Although we may be able to reduce more tokens by dropping this requirement, it is unfair to abuse the model by throwing random code at it, since it has (hopefully) been trained on valid programs as inputs. This additionally enables cross-verification with traditional program analysis tools.
•
Vulnerability type. Beyond just asserting the correctness of the reduced subprogram as above, it is also verified for either possessing the same bug location and type as the original sample, or being bug-free, contingent on the oracle’s quality (Section 5.4). This is done to maintain fairness to the model, by not querying it with subprograms that have any additional bugs introduced by the minimization process itself.
Step 5. The 1-minimal produced in the previous step is a valid subprogram that the model predicts to be vulnerable. It represents the bare minimum excerpt of the input sample, which the model needs to arrive at and stick to its original prediction. The requirements specified in line 4 of Algorithm 1 ensure that the 1-minimal either has the same original vulnerability or is vulnerability-free, without any new vulnerabilities being introduced by the minimization procedure. Then, if this minimal excerpt indeed contains the same vulnerability as the original sample, P2IM treats it as a sign of the model learning vulnerability-specific signals. However, if the corresponding vulnerability from the original sample is missing in the 1-minimal, then this points toward the model capturing noise or features not relevant to vulnerabilities.
4.2.2 P2IM Examples.
We present two examples to demonstrate how P2IM minimizes program snippets, produces valid 1-minimal, and determines if a model learns vulnerability related signals.
Model learned real signals. Figure 6 shows an example walk-through of P2IM. In the original program, a buffer overflow can be triggered in “buf[b]=1;” when b>=10. We list the key steps to show how P2IM gradually adjusts the granularity of editing and reduces the tokens. In iterations #2–3, the token sequence is split into 2 parts but none can be a valid program. Hence, in iterations #4–7, DD operates on a finer granularity and the sequence is split into four parts. Since P2IM cannot find any valid subprograms, it doubles the sequence split to eight parts, and finds a valid subprogram (iteration #13). Since the model predicts it to be vulnerable, statement “a+3;” can be removed. After that, P2IM finds iterations #15, #34, and #46 are valid subprograms with vulnerable predictions so that more tokens can be reduced. Given there are no smaller valid subprograms, iteration #46 is the prediction preserving 1-minimal. More importantly, iteration #46 is indeed vulnerable so P2IM counts it in favor of the model correctly capturing vulnerability-related signals.
Model missed real signals. Figure 4, used in the Introduction Section, shows an example of P2IM catching a model incorrectly learning non-vulnerability signals. In the original program, the sink variable “data” points to a buffer of size 10, while the size of the source variable “source” is 11. Therefore, a buffer overflow can occur at line 11. The model correctly predicts this program as being vulnerable. However, the model considers its 1-minimal as being vulnerable as well, which however is missing the culprit assignment operations. This means the model does not even need the statement where the buffer overflow actually occurs, to make the vulnerable prediction. P2IM regards this as the model not learning vulnerability-specific signals.
4.3 Code-complexity-ranked Training
We shall later use our aforementioned P2IM approach to highlight a lack of signal awareness in today’s AI models of code. We now present the first of our data-driven approaches to improve model signal awareness by transforming the training data.
Our model training approach marries the AI concept of curriculum learning [10] with the SE notion of source code complexity. The idea is to present the learner with simpler code samples initially during training, and to increase sample complexity progressively, influenced by the human learning procedure. Coming from the original training set, these initial samples still contain the same traits that we want the model to learn, while being relatively easier than their counterparts. This can improve the model’s chances of picking task-relevant signals better, with less interference from potential “noise” existing in more-complex samples- in the form of statements or constructs not relevant to the task at hand. The intention is that equipped with the knowledge of the right “signals” to look for, the model will be better able to sift through noise in the rest of the samples, and refine its learning while maintaining task-awareness.
Figure 7 presents our complexity-ranking training approach. The first step is the extraction of code complexity metrics from the training set samples. We used Frama-C [25] and Lizard [73] tools to extract relatively straightforward counters such as sloc, ifs, loops, as well as higher-order cyclomatic and halstead (volume, difficulty, effort) complexity metrics, measuring concepts such as linearly independent paths and coding effort required, amongst others. These commonly used code metrics serve to highlight the potential of complexity-ranked training, and can be replaced by other appropriate metrics of choice.
Fig. 7.
Next, the samples are ranked in the increasing order of their corresponding metric(s) score. Finally, the samples are then fed to the model for training in their complexity-ranked order (e.g., difficulty 12 \(\rightarrow\) 17 in the left half of Figure 9), as opposed to generic random-sampling-based training. As we show in Section 6.2, introducing code complexity awareness into model training in this manner, can improve the model’s signal awareness significantly. The different complexity metrics options, in this setting, serve as tunable knobs to influence model learning.
Fig. 8.
Fig. 9.
4.4 Augmentation via Program Simplification
For our second model learning enhancement approach, we use a data-augmentation route with the notion of complexity of code being implicit, in contrast to the previous approach. However, instead of just offering more examples, we “simplify” them while preserving the signals.
Figure 8 presents the overall flow. We again borrow the Delta Debugging (DD) technique as in our P2IM approach, with one exception—our program simplification approach is model-agnostic and does not include the model in the DD reduction process. Specifically, while signal-probing preserves the original prediction of the model under test, and only then tests whether or not the original bug is present in the 1-minimal, program-simplification preserves the bug while successively simplifying an input program, independent of any model. For each input sample, the intermediate subprograms generated during the reduction cycle, which satisfy certain prerequisites, are emitted to serve as augmentation candidates. Each valid iteration of the reduction cycle generates code smaller than the parent. The successive denoising achieved as a result, carries the potential of assisting the model in learning the relevant signals better, when trained on the augmented dataset. Figure 9 shows the effect of our augmentation approach adding simpler examples, in terms of the dataset complexity distribution. Note that while we use DD to simplify program samples, our approach is independent of the reduction engine being employed. Section 7.5 presents other alternatives.
Recall that the iterative DD process is driven by an oracle, which decides whether or not an intermediate reduced subprogram should be picked for subsequent reductions. Similar to our P2IM approach, we customize the oracle with a Verifier and a Labeler to require the reduced subprograms to satisfy the following properties, described in an example setting of vulnerability detection:
•
Valid program (Verifier). We enforce that the reduced subprogram is valid and compilable, to ensure models are not trained on incorrect code later on.
•
Vulnerability type (Labeler). We additionally check the reduced subprogram for either possessing the same bug as the original sample, or being bug-free. By ensuring no new bug gets introduced during reduction cycle, it maintains dataset integrity, so as not to emit samples with out-of-dataset labels.
Each reduced subprogram satisfying above properties is correspondingly labeled and emitted, and the reduction cycle continues. Figure 10 illustrates an example reduction.
Fig. 10.
The overall reduction cycle results in multiple simplified samples being generated from each training sample, with each valid iteration generating code smaller than the parent. The final step is to assist model learning by adding these smaller, potentially de-noised code samples into the training mix.
4.4.1 Discussion.
Use-Cases. While we use program simplification to improve signal awareness, our approach can be used as a standalone augmentation scheme to either reduce overfitting- by adding all generated samples to the training set, or to reduce class imbalance—by adding only minority-class generated samples. It can also be combined with our complexity-ranked training approach, by (i) applying the latter atop the augmented set to assist model learning (Section 6.4), or (ii) training and comparing the model on subsets ordered by program simplicity to verify model capacity and quality.
Labeler Customization. While we presented our approach in the setting of vulnerability detection, it is independent of the target source code understanding task. The labeler is what tailors our approach to a particular task, and controls the quality of the generated samples. For our experiment settings and the datasets considered in this article (Section 5.1), we found the Infer analyzer [32] to work quite well (Infer mechanics as in Section 5.4). However, our approach is not reliant on it—Section 8.3 discusses other alternatives.
4.5 Model Learning Introspection
Moving beyond model learning enhancement, but while still maintaining the notion of code complexity, we tackle another awareness-related aspect of AI models of code– uncovering the black-box model learning. Unlike the few existing model explanation approaches, we approach model learning analysis from the dataset’s perspective, using well-defined source code characteristics to deduce model learning. The intuition is to compare the common characteristics of the samples which the model predicted correctly vs. those that it got wrong, and using it to highlight the aspects of code the model is able to grasp vs. those where it’s facing difficulties.
Figure 11 shows the overall flow of our model learning introspection approach, outlined in Algorithm 2. Given a trained model and the corresponding model predictions for the test set samples, we first1 extract their complexity metrics as discussed in Section 4.3. Then, we group the samples by their prediction accuracy, such as True Positives (TP; e.g., samples correctly predicted by the model as being ‘buggy’) and False Negatives (FN; e.g., samples incorrectly predicted by the model as being ‘healthy’). For each group, we generate a distribution of the samples with respect to their complexity values. This is shown in the Figure as a histogram of sample counts for different complexity values, ranging from 7 to 13 in this synthetic example. Finally, we compare and contrast the complexity metrics distributions across groups, to concretely highlight the differing aspects of code grasped or missed by the model. Figure 11 shows an example where the model almost always predicts high complexity samples (with complexity values 12 and 13) incorrectly. This is indicated by the highlighted FN bars having negligible presence in the TP samples distribution.
Fig. 11.
Such data-driven model prediction analysis can be used to drive a multitude of use-cases, such as:
(1)
Dataset segmentation and introspection: Which samples are easy vs. hard for prediction? What are the characteristics that make samples easy or hard to predict? Are there any commonalities, biases or limitations in the dataset affecting different models?
(2)
Decipher learned model-logic: What aspects of code are the model able to grasp, and where is it facing difficulties?
(3)
Drive code-centric model evolution: Target code characteristics common to mispredictions in training. Trace model understanding improvements across hyperparameter tuning iterations.
(4)
Design space evaluation from dataset perspective: Trace model improvement techniques such as data augmentation, curriculum learning, active learning, and adversarial training.
(5)
Code-centric model comparison: Assess task-suitability across models, especially useful across models with similar performance measures (F1, accuracy, etc.)
In our experiments, we cover two of these use-cases (2 and 4)- figuring out how model learning evolves, in the context of model signal awareness, across the iterations of our aforementioned program-simplification-based augmentation.
5 Experiment Configuration
We now describe the setup we use, in terms of the datasets, models, and metrics, for evaluating our approaches.
5.1 Datasets
Signal awareness measurement, as described in Section 4.2, requires datasets with ground truth bug locations beyond class labels alone. Datasets from Draper [92] and Devign [129] are excluded, because they do not specify bug locations. Samples from VulDeePecker [69] and SySeVR [68] are slices converted into linear sequences, not valid compilable code, which models are trained upon and thus excluded from experiments. Therefore, we use the following three datasets, which contain this bug-level information.
5.1.1 Juliet.
The Juliet Test Suite [82] contains synthetic examples with different vulnerability types, designed for testing static analyzers. From its test cases, we extract almost 32K training functions, amongst which 30% are vulnerable. Samples tagged as “bad,” and with clear bug information as per Juliet’s manifest.xml file, are labeled as 1, while the ones with a “good” tag are labeled as 0.
5.1.2 s-bAbI.
The s-bAbI synthetic dataset [95] contains syntactically-valid C programs with non-trivial control flow, focusing solely on the buffer overflow vulnerability. We used the s-bAbI generator to create a balanced dataset of almost 40K training functions. Samples with “UNSAFE” tag are labeled as 1, and those with “SAFE” tag as 0.
5.1.3 D2A.
Different from the synthetic s-bAbI and Juliet datasets, we also include a real-word dataset with bug location and bug type information, which we derive from the D2A dataset [127]. D2A is trace-level dataset built over multiple Github projects—OpenSSL, FFMpeg, HTTPD, Nginx, and libtiff. It is generated by using differential analysis atop the Infer static analyzer outputs of consecutive versions before and after bug-fixing commits. From D2A’s traces, we derive function-level samples. From each before-fix trace associated with a bug, we extract the functions specified by the reported bug locations and label them as 1. We also extract the corresponding functions in the after-fix trace and label those patched by the corresponding commit as 0. After deduplication, we have 6,728 functions in total.
5.2 Models
We use neural networks as our AI model baselines. This is driven by two main reasons. First, existing work [92, 129] has already demonstrated the superiority of neural-network-based deep learning over classical machine learning models (e.g., random forest and XGboost) of source code. And second, deep learning has a more pronounced black-box nature as opposed to classical machine learning. Their ability to automatically learn input features, while enables them to learn complex functions, but it hampers their interpretability. This is in contrast to classical machine learning approaches, which can better offer visibility into their learned logic, while operating upon explicitly defined features.
We apply our approaches on three different neural network architectures, which have been popularly employed for vulnerability detection. These operate upon different representations of source code.
CNN: This model treats source code as a photo and tries to learn the pictorial relationship between source code tokens and underlying bugs. Similar to Reference [92], we apply token normalization before feeding data into the model. This involves normalizing the function names and variable names to fixed tokens such as Func and Var. We set the embedding layer dimension as 13, followed by a 2D-convolutional layer with input channel as 1, output channel as 512, and kernel size as (9, 13). The final prediction is generated by a three-layer multilayer perceptron (MLP) with output dimensions being 64, 16, and 2.
RNN: This model treats code as a linear sequence of tokens and tries to detect bugs in source code using the temporal relationship between its tokens. We base our RNN implementation on [68]. The input function is also normalized during preprocessing, the same as the CNN model. We set the embedding layer dimension as 500, followed by a two-layer bi-directional GRU module with hidden size equals to 256, the final prediction is generated by a single-layer MLP.
GNN: Instead of borrowing techniques from image and time-series domain, this model operates at a more natural graph-level representation of source code, as per [103, 129]. It tries to learn vulnerability signatures in terms of relationships between nodes and edges of a Code Property Graph [116]. Following [129], we do not apply token normalization during preprocessing. We set the embedding size as 64, followed by a GGNN layer [67] with hidden size 256 and 5 unrolling time steps. The node representations are obtained via summation of all node tokens’ embedding, and the graph representation read-out is constructed as a global attention layer. The final prediction is generated by a 2-layer MLP with output dimensions 256 and 2.
The models are trained over the datasets presented in Section 5.1 using a 80:10:10 train: validate:test split, with the vulnerability detection problem framed as a binary classification task—predicting program samples as healthy (label 0) or buggy (label 1). For all of the models, we set dropout rate as 0.2 during training, and used the Adam [58] optimizer. We tuned learning rate in \(\lbrace 10^{-3}, 10^{-4}\rbrace\) and batch size in \(\lbrace 24, 36, 128, 256, 512\rbrace\). Models are trained to minimize cross entropy loss, with class weights calculated from the training set. We save the checkpoint with the least validation loss across epochs, with early stopping employed (patience = 10). Results are averages across multiple runs.
5.3 New Metric: Signal-aware Recall
We use the following methodology for testing the model signal awareness. The trained models will assign labels to the code samples in the test set as either being vulnerable or not, based upon the predicted class probabilities. After prediction, all vulnerable samples in the test set fall under either True Positives (TP; samples correctly predicted by the model as being “buggy”) or False Negatives (FN; samples incorrectly predicted by the model as being “healthy”). To test if a model picked the right signals for its prediction, we subject each TP predicted by the model to P2IM reduction. We first query the model for its prediction on each TP sample’s 1-minimal version. Then we check the 1-minimal for the presence (TP’) or absence (FN’) of the original program sample’s bug. This way, we subdivide TP into signal-aware TP’ and signal-agnostic FN’.
Operating atop vulnerable samples (TP+FN), signal-awareness measurement then boils down to counting how often, when a model predicts a samples to be vulnerable (TP), it uses the right signals (TP’) to arrive at its prediction. In this setting, the model Recall serves as the best metric to target, to fairly compare different models on the same dataset. This is because the number of vulnerable samples = TP + FN = TP’ + FN’ + FN will be the same for all models for the same dataset. Additionally, signal awareness measurement on other groups, e.g., False Positives (samples incorrectly predicted by the model as being “buggy”), is not very useful as it does not give any insights on whether the model is learning any vulnerability-specific signals.
Following from the above discussion, we present a new metric—Signal-aware Recall (SAR)—to measure the model signal awareness. While Recall = TP/(TP + FN), SAR is defined as TP’/(TP’ + FN’ + FN), where TP = TP’ + FN’. Recall measures the proportion of the vulnerable samples that the model predicts correctly, while SAR measures for how many of those cases does the model capture the right signals to arrive at its prediction. By definition, SAR \(\le\) Recall, and the expectation is to observe a shortening of this gap in the experiments, with our signal awareness enhancement approaches. We use a relative SAR:Recall ratio (\(\le \! 1\)) in our experiments, encapsulating how much of the model’s Recall is attributable to task-relevant signal learning. As shown later in Table A.1, model performance does not get compromised in our experiments, while we strive toward improving the model’s SAR:Recall. Note that Recall and SAR for all model training configurations—baseline, complexity-ranked, as well as dataset-augmented—are evaluated on the dataset’s original untouched test-set itself.
Table 1.
Dataset
Model
Accuracy
F1
Precision
Recall
SAR
SAR:Recall %
s-bAbI
CNN
94.3
94.2
95.1
93.4
42.0
45.0
s-bAbI
RNN
94.9
94.8
97.1
92.7
41.9
45.2
s-bAbI
GNN
92.8
92.8
92.2
93.4
42.0
45.0
Juliet
CNN
97.9
96.4
93.2
99.9
49.4
49.4
Juliet
RNN
97.9
96.3
93.1
99.8
49.4
49.5
Juliet
GNN
99.9
99.9
99.9
99.9
49.6
49.6
D2A
CNN
60.7
59.4
64.2
55.2
25.9
46.9
D2A
RNN
57.1
60.0
58.3
61.8
28.9
46.8
D2A
GNN
59.3
64.2
59.4
69.8
31.8
45.6
Table 1. Comparing AI Models of Code Using Standard as Well as Proposed Signal-aware Recall (SAR) Metric
Note the drop in model quality (Recall \(\rightarrow\) SAR) when probing it for signal awareness.
Table 2.
Dataset
TP Overlap %
TP’ Overlap %
s-bAbI
95.3
91.0
Juliet
97.4
96.9
D2A
49.0
49.3
Table 2. Overlap Between the Subset of Samples with the Same Prediction Across Different Models
Shown is the overlap percentage for the TP (as well as signal-aware TP’) subsets across the {CNN, RNN, GNN} models for each dataset.
5.4 Infer as a Checker
SAR measurement using P2IM requires a checker to test for bug existence in the 1-minimal. For the datasets considered in this article (Section 5.1), we use the Infer analyzer [32], with fallback to line-based bug matching for samples with differing Infer verdict and the original bug. While using Infer, at each iteration of the reduction cycle, we compare Infer’s analysis of the reduced subprogram with that of the original program sample, to ensure that the reduced subprogram is either bug-free, or possesses only the same bug as the original program sample. The latter is detected by a hit for the original bug in Infer’s preexisting.json and a miss in introduced.json analysis comparison files. Although Infer as a checker is a fortunate fit given our target datasets, P2IM is independent of the specific bug-checker being employed; alternative checkers are discussed in Sections 7.3 and 8.3.
5.5 Research Questions
Following are the main research questions we aim to explore with our experiments:
(1)
How task-relevant is model learning, measured in terms of model signal awareness?
(2)
What impact does complexity-ranked training have on model learning behavior, and does it improve model signal awareness?
(3)
What impact does program-simplification-based augmentation have on model signal awareness, and is it better than generic augmentation?
(4)
What sort of model learning deduction can be obtained by leveraging the dataset’s code complexity distribution, and is it more insightful than usual statistical measures?
For RQ1, we run each True-positive sample from the dataset’s test set through our P2IM pipleline, generating the corresponding 1-minimal for each sample. Then, we check the 1-minimal for the presence or absence of the original sample’s bug, using the Infer tool [32] (Section 5.4). Signal awareness is then calculated using the SAR metric defined in Section 5.3. For RQ2, we first extract the code complexity metrics from the training set samples, using Frama-C [25] and Lizard [73]. Then, we train the models on the training set samples ranked in the increasing order of the complexity scores, recording the validation loss trend across epochs. Finally, model signal awareness is measured using the SAR metric. For RQ3, we first subject the training set samples through our program simplification pipeline. From amongst the simplified programs thus generated, randomly selected samples are added to the original training set. The models are then trained upon the augmented set using different augmentation levels, with their signal awareness measured using the SAR metric. Finally, for RQ4, the test-set samples are first grouped by their prediction accuracy (into TP’ and FN’ as defined in Section 5.3). Then, their code complexity metrics are extracted as in RQ2. This is repeated for different levels of augmentations as in RQ3. Finally, the complexity distribution for the TP’ groups (similarly, FN’ groups) for the different augmentation levels are compared, to trace how model signal awareness improves with augmentation. More details for each experiment are presented in the respective Results subsections.
6 Results
6.1 Measuring Model Signal Awareness with P2IM
Table 1 compares the performance of the three models under test, upon the three datasets as described earlier, as we probe them for signal awareness with our P2IM approach. Included are the common measures of the models’ classification performance. The model reproductions achieve performance similar to previous work [92, 103], including on the real-world dataset with performance comparable to that of Reference [129], which also creates its dataset from Github but lacks bug location information.
The focus is on how Recall compares with the proposed SAR metric. As can be seen, even for the simple synthetic s-bAbI dataset, which targets only one vulnerability type (buffer overflow), models with high 90+ Recall can only achieve sub-50 Signal-aware Recall across the board. This indicates that the models are picking up features not relevant to vulnerability detection, presumably learning dataset nuances that inflate their reported performance measures. The results are similar for the other datasets as well, with the SAR:Recall column showing that only less than half of the True Positives predicted by the models can be attributed to task-relevant signal learning. As for the amount of sample reductions achieved by P2IM, without the models changing their predictions, we observe average reduction rates of 24.9% for s-bAbI, 31.1% for Juliet and 35.7% for D2A, with the rates staying similar across the different models for the same dataset.
Table 2 shows the overlap between the subset of the program samples with the same prediction across different models. On the two synthetic datasets, the overlap percentage is \(\gt\)90% for the True-positive samples (TP), as well as for the signal-aware True Positives (TP’), across all three models. This is unexpected, since the models use vastly different architectures. Combined with the low SAR values from Table 1, this suggests that the high performance on synthetic datasets is significantly influenced by dataset-specific nuances, which all three models are picking up in a very similar way, and missing real vulnerability signals. The corresponding overlap on the real-world dataset is much lower, partially due to the fact that there are more varieties and fewer artificial nuances in real-world data for models to pick up, which however contributes to their performance drop when compared to synthetic datasets, as shown in Table 1.
As a takeaway, with the addition of SAR to the existing arsenal of model performance metrics, it becomes possible to measure how much of the model’s learning is actually task-aware. This can additionally provide a more fair comparison, and more accurate improvement guidance across model evolution.
6.1.1 Runtime Performance.
For scalability, we run P2IM reduction across multiple samples in parallel. The runtime cost increases with the sample size, varying across the datasets as: {min: 1.4, avg: 6.2, max: 25.9} seconds for s-bAbI samples, {min: 0.4, avg: 25.5, max: 572.9} seconds for Juliet samples, and {min: 2.9, avg: 119.6, max: 884.6} seconds for D2A samples.
Summary:Although existing vulnerability detection models show high classification performance, their signal awareness is low. Their reliance on task-irrelevant features adds to existing model reliability concerns.
6.2 Enhancing Model Signal Awareness with Code-complexity-ranked Training
Figure 12 shows how an AI model’s signal awareness can be significantly improved by presenting it source code samples in the increasing order of code complexity. Shown is the SAR:Recall ratio for a GNN model on the s-bAbI dataset for different training configurations, including natural random-sampling-based training (baseline), and four complexity-ranked training schemes based upon sloc, volume, difficulty, and effort complexity metrics, respectively. As can be seen, complexity-ranked training can significantly boost the model’s signal awareness, with difficulty-ordered training achieving a 32% improvement (SAR:Recall increasing from 45 to 59.5). However, CNN and RNN models show only minimal improvements (Appendix Figure A.3).
Fig. 12.
Figure 13 shows how the model learning changes with complexity-ranked training for the GNN model on the s-bAbI dataset. It shows the validation accuracy curves (i.e., model’s interim accuracy on the validation set as it progresses along its training rounds or “epochs”) for the different training configurations—natural training, and the four complexity-ranked training schemes. As can be seen, natural training quickly reaches quite close to its peak performance, whereas the learning is relatively slow with the complexity-ranked schemes. Although all eventually reach similar looking peaks, the crucial difference is the minima reached by these training configurations. Even though both natural and complexity-ranked training schemes reach a 90%+ accuracy (and F1, Recall), the latter is much more task-aware, as shown before in Figure 12’s signal-aware recall (SAR) values. This can be attributed to the model learning, albeit slowly, better signals from “easier” examples first, empowering it to sift-through noise with more complex examples later.
Fig. 13.
This altered training behavior is also seen in the other models and datasets, as shown in Appendix Figure A.2. Although complexity-ranking alters the model’s learning route, and by itself can offer some assistance to the models, it does not seem to always help the model reach a more task-relevant minima. The baseline SAR:Recall, obtained with the natural training scheme, remains unchanged for the Juliet dataset even with complexity-ranked training, while D2A shows only a 3.5% improvement. But as shall be shown later (Section 6.4), complexity-ranking still has some benefits to offer in those settings.
Summary:The simplicity of complexity-ranked training, together with the altered model learning, offers some assistance to model signal awareness, although not universally.
6.3 Enhancing Model Signal Awareness with Program-simplification-based Augmentation
Our program simplification approach results in the generation of \(9\times\) more samples for s-bAbI, \(9.6\times\) for Juliet, and \(53\times\) for D2A, as a factor of the base dataset size. The varying levels of augmentation are due to the difference in the datasets’ sample sizes, which tend to be much bigger for the real-world D2A dataset, as compared to s-bAbI and Juliet (median sloc 36 versus 9). The bigger the input code sample, the more the number of reduction iterations performed by Delta Debugging, resulting in potentially more valid intermediate samples being generated.
Training the models over these additional (and simplified) samples yields even greater signal awareness improvements than achieved with complexity-ranked training. This can be seen in Figure 14, showing SAR:Recall values achieved with different levels of augmentation for the GNN model over the Juliet dataset. The x-axis shows the proportion of samples (in percentage of the base dataset size) randomly selected (repeated and averaged) from the generated set, and added to the base dataset for training, with the leftmost point (\(x=0\)) referring to the baseline model performance (same as shown in Figure 12). As can be seen, model SAR improves dramatically across the augmentation levels, crossing 60% of Recall when all generated samples are additionally used for training (a \(4.8\times\) improvement over baseline). By presenting the model with smaller samples, while still containing the characteristics relevant to the task at hand (e.g., bugs), it seems to be helping the model focus more on task-relevant aspects of code, and less on noise or dataset nuances. The usual model quality metrics (F1, Recall, etc.) maintain their similar high (90+) values in all augmentation configurations, while SAR increases drastically, as shown in Appendix Table A.1. More concrete insights on how model learning is changing under the covers shall be revealed in Section 6.5, highlighting the potential of our introspection approach.
Fig. 14.
The trend is the same for the other datasets, with more augmentation yielding in greater signal awareness improvements, as shown in Appendix Figure A.4. While SAR:Recall improvements for s-bAbI reach 113%, D2A records only modest SAR:Recall improvement of 13.3%. Interestingly, the model Recall also gains by 22.1% during augmentation, while the Recalls in the case of s-bAbI and Juliet datasets are stably high. This suggests the base dataset is not sufficiently large for model training. Just with derived simplified examples alone, we are able to guide the model to correctly capture more signals and thus improve both the classic model recall and signal awareness recall. However, this also points to the diversity of the real-world programs, and the challenge of generating sufficient simplified programs that can help models distinguish various signals from noises in such a diverse code base. Note that the sophisticated methodology behind D2A curation makes it non-trivial to collect more samples and enlarge the training set [127]. It uses differential analysis based on Github commit history, for filtering the false positives from (presumably) a state-of-the-art Infer static analyzer.
These performance gains are not just due to fact that there are extra samples to train upon. This can be seen in Figure 15, which shows the SAR:Recall values obtained with generic augmentation, compared to our approach, for a few representative augmentation levels for the s-bAbI dataset. As can be seen, just adding more samples to the training set does not necessarily increase the model’s signal awareness, unlike our simplified program sample augmentation. The fact that the code samples generated by our approach are smaller and potentially simpler than the original samples, is crucial to the model being able to better capture task-relevant signals and sift through noise during training. The results are the same for the Juliet dataset, with generic augmentation not improving upon the baseline SAR at all, irrespective of the augmentation level, very much unlike our augmentation approach. As for D2A, since it is already very limited in size, so there is not enough hold-out data to test generic augmentation.
Fig. 15.
Unlike with complexity-ranked training, the signal awareness of CNN (48%) and RNN (98%) models is also given a boost with our augmentation approach (Appendix Figure A.5).
Summary:Dataset augmentation with simplified, de-noised program samples assists models in learning task-relevant signals better, while maintaining model performance.
6.4 Enhancing Model Signal Awareness with Hybrid Training
The two approaches—complexity-ranked training and program-simplification-based augmentation—are complementary to each other, and can potentially be combined together for different use-cases. These include schemes such as (i) selecting only the more complex samples to be simplified, or (ii) ordering the augmented samples in the order of their code complexity metrics during training, or (iii) training and comparing the model separately on subsets ordered by complexity, for verifying model capacity and quality, amongst others. We experiment with one such hybrid setting to explore the potential for even more gains to be had in the model signal awareness, by performing complexity-ranking atop the training dataset augmented with simplified program samples. Additional signal-awareness boost with such a combination was most evident for the Juliet dataset, as shown in Figure 16 for a few example augmentation configurations. This is in stark contrast with the inability of complexity-ranking by itself to improve model’s signal awareness for Juliet (Section 6.2). One hypothesis for this is the expanded metric range, which opens up post-simplification, as depicted in Figure 9, improving ranking granularity and thus new sample ordering during training.
Fig. 16.
6.5 Introspecting Model Learning
So far, we have shown the potential of different data-driven approaches to improve model signal awareness. Although we have conjectured the possible reasons behind such improvements, we have not yet probed the model black-box. In this section, we present the results of our code-complexity-driven model introspection approach to analyze model evolution across augmentation iterations.
Our introspection approach deduces model learning behavior by comparing the complexity distributions of the test-set samples, grouped by their prediction accuracy. Recall that the signal-awareness measurement results in the correctly predicted test-set samples being divided into SAR-TP and SAR-FN,2 depending upon whether or not the model captured the right signals to arrive at its otherwise “correct” vulnerability prediction. We compare the code complexity distribution of the SAR-TP and SAR-FN samples, to interpret model learning from the dataset’s perspective. While Section 6.3 showed that training on datasets augmented with simplified programs greatly improves model’s signal awareness, the goal here is to go one step beyond and trace how model understanding of source code improves with augmentation.
Figure 17 presents this comparison using sloc distribution across s-bAbI augmentation iterations. Comparing the leftmost pair of SAR-TP versus SAR-FN sloc distributions reveals the first insight regarding the baseline model (0% augmentation) facing trouble understanding bigger code samples. This can be seen in terms of the high occurrence count of sloc = {12,13} samples in the SAR-FN plot, with an extremely small presence in the SAR-TP counterpart. Repeating the comparison across different augmentation iterations enables tracing how the model understanding of source code evolves. Specifically, the particular code-size weakness improves as augmentation increases, as can be seen with the rising SAR-TP ‘skyline’ (i.e., sample occurrence counts), and correspondingly the falling SAR-FN skyline, most evident for sloc = {12,13} samples. This leads to an intriguing insight about augmentation helping the model better understand bigger code samples. This is especially interesting, because the model learning was generic—the model was not explicitly aware of code size or complexity of samples. It is only after the fact that we analyze the model’s prediction performance from the perspective of the test set’s code complexity, that we uncover these findings. Furthermore, it is not that the base dataset did not have enough large-sized samples for the model to train upon—the base training set consists of around 17% each of sloc = {12,13} samples. Each augmentation iteration introduces more samples for the model to train upon, also implying more quantity of de-noised low-complexity samples (resulting from program simplification; Figure 9), gradually improving the model’s chances to learn relevant signals.
Fig. 17.
Different insights can be derived by changing the complexity metrics employed for model introspection. For example, Appendix Figure A.10 presents a similar model evolution analysis for the Juliet dataset, but from the perspective of the cyclomatic complexity (cc) metric (# independent paths in the program). As for the D2A dataset, the modest 13.3% signal-awareness improvements recorded with augmentation, precludes observation of any meaningful evolution trends or insights during its dataset-complexity-driven analysis.
Appendix A.4 presents yet another kind of insight about special classes of samples not affected by augmentations—samples always captured correctly by the model, and those that are consistently mispredicted.
Summary:Model introspection from the dataset perspective yields code-centric insights into model learning behavior, beyond the usual generic measures of model quality.
7 Discussion
7.1 DD in P2IM Versus DD for Program Simplification
Note that while we use Delta Debugging (DD) for both—signal-awareness measurement as well as program simplification—there is no measurement bias. Unlike P2IM, our program simplification approach does not include the model in the DD reduction process. While P2IM preserves the original prediction of the model-under-test while reducing an input program, and only then tests whether or not the original bug is present in the 1-minimal, program-simplification preserves the bug while successively simplifying an input program, independent of any model. Model signal-awareness gets tested as before on the original test-set samples, not on any new simplified samples. The reduced programs used to query the model during signal probing are not “leaked” to the augmented training dataset.
7.2 1-minimal Versus Global Minimum in P2IM
It is a possibility that the 1-minimal reached by P2IM is not the global minimum for the original program. And the global minimum may not contain the vulnerability-related signals as captured in a local 1-minimal. Since computing the global minimum is impractical as it requires an exponential number of tests, this precludes measuring a model’s signal awareness precisely. Thus, we take the conservative approach of giving the benefit of the doubt to the model, giving it credit for capturing vulnerability signals based on the 1-minimal reached, even when it may not actually be doing so (based on the global minimum). This results in an upper bound measurement of the model’s signal awareness (SAR). Nevertheless, as shown in Section 6.1, even measuring the upper bound itself is sufficient to highlight the problems of the models not picking up the correct signals during learning.
7.3 P2IM Checker Quality
Model signal awareness measurement with P2IM is bounded by the quality of the checker used to verify bug existence in the reduced subprograms. This can be dataset dependent and can include (i) the original dataset labeler, (ii) a line-based bug matcher, which gives the benefit of doubt to the model on partial matches, or (iii) a good static analyzer tuned toward high recall. While the Infer analyzer [32] as a checker is a fortunate fit given our target datasets, P2IM is not reliant on it. While Infer provides for a more accurate SAR bound, a similar lack of signal awareness in the models is still detected, albeit with a looser bound (Section A.1) by replacing Infer with purely a line-based bug matcher (less accurate; more model favoring). Although using the original dataset labeler might be even more accurate, and expands the dataset and task applicability of P2IM, but it can be a harder task especially with human-in-the-loop kind of labelers. Finally, the existence of a perfect checker precludes the need for AI for code analysis. Yet, to introduce some accountability in today’s AI models of code, we argue for at least a SAR-like sanity check.
7.4 Model Applicability
Section 5.2 presents our rationale behind using neural networks as the AI model baselines in this article, as opposed to classical machine learning approaches. While we apply our approaches on popular CNN, RNN, and GNN models of source code, being data-driven and independent of the learning algorithm and the model internals, our approaches are applicable to other AI models as well. Specifically, in addition to the treatment of code as an image, sequence of tokens, and a graph, as with these architectures, there is another avenue of treating code as natural language. Approaches such as BERT and Transformers [28, 113], which have shown great promise in the NLP domain, have been also recently been ported to the source code domain [33, 55]. Since these are even more complex than the “vanilla” neural network architectures, the black-box nature of their learning is even more pronounced. We plan to apply our model-agnostic approaches to analyze the signal awareness of these architectures as well.
7.5 Reduction Engine Alternatives
In this work, we use Delta Debugging (DD) [123] to reduce and simplify program samples. Our P2IM and program-simplification-based augmentation approaches, however, are not reliant on it. The core ideas behind our approaches remain independent of the specific program reduction engine employed—specifically, (i) testing signal existence in minimal prediction-preserving snippet, and (ii) augmenting training with simpler programs while preserving signals. DD offers an efficient reduction solution and can be substituted by existing alternatives such as HDD [79], Perses [101], C-Reduce [88], amongst others. Simpler alternatives can be employed at the cost of lower efficiency, such as a linear, brute-force or randomized schemes for source code tokens/statements selection for reduction. Irrespective of the reduction scheme, the more important components are correctness validation (e.g., via a compiler) and original task-profile checking (Section 8.3).
8 Threats to Validity
8.1 Model Bias
During augmentation, we are adding additional data to the training set, which can be considered as sampling around the original data points [19], but in the discrete program space and with an accurate labeler. Theoretically, this may add learning difficulties for models due to the further complicated decision boundary as the data distribution shifts away from the test set. But more importantly, it would also help the model to learn real signals, therefore improving its reliability. In practice, we observed that the model performance did not deteriorate with our simplified-program augmentation, while its signal awareness increased dramatically (Table A.1). Note that we ensure that no new bugs get introduced during the simplification process, so the dataset integrity is maintained in terms of not introducing samples with out-of-dataset labels. Complexity-ranked training, however, does not change the dataset composition, only sample ordering during training. The only bias it can add is to remove the non-relevant features during the initial model training, which, however, should get normalized when the harder samples are eventually introduced later on in the model training cycle.
8.2 Dataset Constraints
Within the context of vulnerability detection, many of the existing datasets do not contain enough information to facilitate evaluating model signal awareness. Signal awareness measurement, as described in Section 4.2, requires ground truth bug information beyond class labels alone, which are missing in the Draper [92] and Devign [129] datasets. VulDeePecker [69] and SySeVR [68] samples are not valid compilable code, which the models are trained upon, but slices converted into linear sequences, and are thus excluded from experiments. However, s-bAbI [95], Juliet [82] and D2A [127] do in fact contain bug-level information, and are thus considered in this work. Nevertheless, this additional ground-truth availability constraint on the datasets, necessitated by signal-awareness measurements, may limit the generalizability of this article’s observations.
8.3 Task Oracle
In this work, we use a vulnerability detection setting for a comprehensive analysis of model signal awareness. Our approaches, however, are independent of the target source code understanding task. The component that tailors our P2IM and program-simplification approaches to specific tasks is the task-profile checker or the task oracle. Some options, highlighting a cost versus quality trade-off, include: human domain expert, original dataset labeler, line-based code-feature matcher, static analyzer, fuzzer, amongst others. Focusing on AI modeling of source code vulnerability detection, amongst other code understanding tasks (Section 1), enables leveraging readily available SE tools to ensure correctness during task-profile verification, i.e., existence of ground truth vulnerability in reduced programs. Nevertheless, it can be the case that for some tasks, a relevant oracle is not directly available (e.g., for code summarization task), efficient (e.g., fuzzer), or feasible (e.g., human domain experts), limiting the applicability of our techniques.
9 Practical Implications
The ready availability of large code bases to train upon, and the machine learning success in natural language understanding, have promoted the entry of AI into the source code analysis space. AI promises to understand the semantics of the code and alleviate the shortcomings of traditional code analysis approaches. There has been a rapid proliferation of AI models for source code understanding, and they have already started making there way into application deployment workflows, such as code scanning for security verification [107]. However, as we we have shown in this work, AI models of code, despite their high F1 and accuracy scores, are actually not learning task-relevant source code features. Our observations augment the recent concerns regarding the quality of AI models of code [3, 18, 53], motivating a call for their reliability to be verified, before offsetting the similarly-scrutinized traditional code analysis toolchain.
This has implications for both software developers adopting the models for code analysis, as well as researchers creating the AI models of code. Sub-par results, if any, of such quality checks can, for example, help the former decide whether the tremendous amount of resources, which model training increasingly consumes [15, 49], are better utilized for other kinds of code analysis or testing, such as fuzzing the application longer to catch more bugs. However, we believe that a concerted effort from the research community to acknowledge and resolve these model reliability concerns will go a long way in better utilizing the potential of AI for source code understanding.
Our work augments the common metrics toolbox of model quality measurement (e.g., F1 and accuracy) with a quantitative measure of model reliability and trustworthiness—our SAR metric to measure model signal awareness. Building atop this, we highlight the potential of AI for source code understanding, by presenting data-driven and SE-assisted techniques to improve model learning quality. We hope our work motivates future research toward ensuring accountability for more task-aware learning, potentially guided by the SAR-Recall divide, as well as traceability in terms of analyzing behavioral trends across data affecting model performance.
10 Related Work
In the software engineering community, efforts have been made to detect vulnerabilities by isolating relevant statements using program-slicing-based techniques [46, 96, 108, 116]. Essentially, the main idea of such approaches is to identify the subsets of the program that introduce the defects. Recently, program slicing has also been used in AI-assisted vulnerability detection tasks [69] to extract bug relevant programming constructs. If a vulnerability detection model can capture the real signals, then it should be able to identify such subsets too. In this work, we treat models as black boxes and feed them with different subsets of a program to evaluate how well they pick up the real signals. And to efficiently and systematically generate the subsets, we borrow a popular fault isolation technique of Delta Debugging.
The most relevant work using Delta Debugging (DD) and its variant methods is software failure diagnosis and isolation [43, 79, 122, 123]. The main advantage of DD is that it can significantly reduce the number of tests needed to locate the problem. To the best of our knowledge, our approach is the first attempt of using DD to interpret and compare models’ signal awareness. A recent parallel effort [87] also uses DD to minimize inputs to AI models on software engineering tasks. Its focus is on the qualitative properties of the reduced code samples, as opposed to our SAR-based quantification of the impact of signal-agnostic training on the models’ reported performance numbers.
Our P2IM approach can be viewed as belonging to the metamorphic testing paradigm [21], applied to AI models [48, 130]. In particular, based on an input code snippet and its prediction by a model under test, we systematically construct new “tests” by minimizing the original snippet and make sure the model produces the same prediction as the original input. Then, we check the metamorphic relations among the inputs and output predictions of multiple executions. A test violation can be detected if the original input and its minimal version do not have the same vulnerability. With the same intuition as seen in ChiMerge [56] that learns discretization schemes of intervals dynamically while the \(\chi ^2\) statistics of each discretized group remain significant, we progressively group and merge inputs that are output-dependent or output-independent. Thus, while ChiMerge is a statistical way of range testing, we use the SE technique of DD as illustrated in Figure 6.
With regard to model robustness and reliability research in deep learning, metrics have been proposed for distance ratio among samples in the same class and in all other classes [54]. This is based on the intuition that, for a reliable model, inputs in the same class should be closer to each other in the model’s latent space, as compared to inputs in the other class. Extensive research has also contributed to constructing sophisticated attack/defense methods and evaluating accuracy under \(\ell _{p}\)-norms bounded adversarial perturbations [31, 39, 62, 76, 80, 97]. Different from current metrics, we directly probe the existence of true signals inside the models’ 1-minimal representations. To the best of our knowledge, our metric SAR is the first of its kind to evaluate deep learning models’ signal awareness in this domain.
Our method augments the common metrics toolbox with an alternative way to examine the model quality, giving reliability and trustworthiness to black-box AI models. P2IM brings in multiple enhancements as compared to popular perturbation-based model interpretability methods that work on individual input instances, such as LIME [90] and others as summarized in a recent survey [27]. While other approaches are able to derive parts of an input that contribute most to the model’s final prediction, unlike our method, they cannot tell if the highlighted parts are the correct task-relevant signals. Another contrasting capability P2IM offers is to quantify how well a model learns the correct signals, thereby providing a signal-aware mechanism to compare different models. This is especially useful when competing models have comparable performance on traditional metrics (e.g., F1). Furthermore, the search space, and thus the time complexity, for such approaches can be huge, given the possible combinations of the different parts of the inputs to be explored. Thanks to the DD algorithm, P2IM directly minimizes the input, significantly accelerating the search for the relevant parts of the input.
In addition to our model-probing approach, we also present model-assistance approaches to improve the models’ signal awareness. We do this by simplifying the training set samples using DD, and augmenting them back into the training mix. Augmentation methods are popular in AI in general, including domain-specific approaches such as image transformations [98], text transformations [14], data-driven approaches such as SMOTE [19], formal and empirical augmentation [63, 120], amongst others. Complementary to these approaches, our augmentation approach is focused specifically toward improving the signal awareness of source code models. A key difference is that general AI approaches usually assume the input under augmentation would keep the original labels, since it is extremely hard to check for images and texts without huge manual effort, while this assumption may not always be true. Our approach utilizes the benefits of working in the well-defined source code space; therefore, we can assure the validity and correctness of our code augmentation. In the context of vulnerability detection, preserving existing bugs while generating simplified programs is the key difference in our approach. This is in contrast to existing source-level bug-seeding-based augmentation methods, which can lead to previously unseen bugs [13, 29, 78, 84, 86, 91, 112].
Our second approach to improving model signal awareness combines code complexity with curriculum learning (CL) [10, 40, 44]. While CL has previously been applied using general complexity measures of images and texts to rank training samples in the vision and natural language domains, we use code-specific complexity measures to tailor the models toward source code understanding. Different to CL, which is data-driven, denoising approaches used in BART [65] and PLBART [2] are usually built into the model’s pretraining process, and assist model learning by mapping noisy input to noise-less input. The pretraining tasks involve text/code generation while the input sequence is randomly shuffled and blocks of tokens are masked. It requires model specific tuning and task design, whereas CL, and thus our complexity-ranked training approach, is model and task agnostic.
Finally, we also use the notion of code complexity to introspect model learning. Existing explanation approaches tend to use white-box model internals to add some transparency into the model logic. This includes probing the model’s gradient [94, 99, 102, 128] to highlight input regions most influencing the model’s prediction, or fitting interpretable surrogate models to approximate the deep learning model’s behavior [42, 75, 90], and then using the surrogate to derive the feature importance ranking for the input. The approximate nature of such mappings, from the model side back to the data, can make them misleading [1]. The prevalence of Transformer-based models, exemplified by CodeBERT [33] and CodeX [20], has given rise to the exploration of their explanation generation. Similar to existing methods, the explanation of Transformer-based models can be accomplished by means of gradient-based perturbation [24]; however, due to the discreteness of the input, the discrete search space poses a formidable obstacle for efficient explanation generation. Based on the Transformer’s attention mechanism, DietCode [126] leveraged the attention weights of the Transformer and demonstrated that discarding tokens with insignificant attention weights would result in negligible impact on the final model prediction, hence large attention weight should be considered equivalent to high importance. Nevertheless, it has been pointed out that attention is not explanation [52]; in fact, learned attention weights often display little correlation with gradient-based measures of feature significance, and it is possible to observe varying attention distributions that lead to the same predictions. Explanation methods have also been created for graphical neural networks, attributing importance to graph nodes and edges by using attention mechanisms [114, 115], or via maximizing mutual information between inputs and outputs [119]. Our model learning introspection approach is complementary to these explanation approaches as it treats models as black boxes, deducing model learning from the dataset’s perspective, based on concretely defined characteristics of source code. This empowers our approach to offer more code-centric and developer-friendly insights. This is in contrast to certain other approximation-based black-box explanation approaches [87], which do not require the inputs to remain natural or valid. Furthermore, unlike our work, these do not tie source code constructs to model signal awareness.
11 Conclusion
Using SE concepts, we developed data-driven approaches to assist AI models of code in learning task-relevant aspects of source code better. We first presented a prediction-preserving input minimization approach to probe the signal awareness of the models. We formulated a new metric to measure task-relevant learning—Signal-aware Recall (SAR), augmenting the common metrics toolbox with a signal-aware alternative to model quality measurement. SAR measurement enables asserting trust and reliability to black-box AI models. Results show a significant lack of signal awareness in the current models, across architectures and datasets. SAR amounting to only less than half of Recall, suggests that the models are presumably learning a lot of noises or dataset nuances, as opposed to capturing task-specific signals. To enhance model signal-awareness, we incorporated the notion of complexity of source code into model training. We achieved significant improvements with our complexity-ranked training and program-simplification-based augmentation approaches, while maintaining model performance. We carried the notion of complexity into model introspection, and presented code-centric insights into black-box model learning. Moving forward, we look to explore active learning approaches, helping the model with targeted SE-assistance, e.g., picking specific samples for augmentation where the model is facing difficulty learning.
Footnotes
1
The two independent tasks can actually be performed in any order: (i) grouping of test-set samples by their prediction results, and (ii) extracting their code complexity metrics.
A.1 Selecting Model Signal Awareness Baseline for Enhancement
For maximum test-set coverage, Table 1 measures SAR using a combination of checkers to test bug existence in a sample’s 1-minimal: utilizing the Infer analyzer [32] (checks existence of original bug), with fallback to line-based bug matching (checks existence of buggy line) for samples with differing Infer verdict and the original bug. As compared to Infer, line-based bug matching is less accurate, e.g., it counts even partial matches in favor of the model as shown in Figure A.1, thereby providing a looser SAR bound than Infer analysis. Even when counting partial matches in favor of the model a substantial SAR-Recall divide is still observed in Table 1, satisfying the goal of Section 6.1’s analysis—to highlight a lack of signal awareness in the models. However, to show the true impact of SAR improvements with our model learning assistance techniques, which line-based matching would mask, we instead employ the stricter Infer-based matching alone in the next set of experiments. For correctness, we focus only on the samples where Infer verdict matches the original bug. As a result of using the tighter SAR bound, the model SARs are lower than those reported in Table 1—SAR drops to 12.8 for Juliet and 21.2 for D2A, respectively. We use these values as baselines in our model learning enhancement experiments. Using a stricter checker (limited to the applicable test-set subset) thus reveals that the issue of the models relying on task-irrelevant signals is more pronounced than indicated by Section 6.1. Nevertheless, as can be seen in the last row of Table A.1, even when using the looser SAR bound our techniques still record SAR improvements, albeit masked by the checker leniency.
Table A.1.
base + 0% aug
+10%
+20%
+30%
+50%
+100%
+all
Precision
99.9
99.9
99.9
99.9
99.9
99.9
99.9
Recall
99.9
99.9
99.9
99.8
99.8
99.8
99.9
SAR
12.8
15.4
20.9
33.9
34.9
43.3
61.8
SAR’
49.6
51.1
54.7
61.1
61.2
65.3
74.8
Table A.1. Using {Juliet + GNN + augmentation} Example to Show: (i) Model Performance Is Maintained (Rows 1 and 2) as SAR Improves (row 3) with Our Augmentation Approach, and (ii) Signal Awareness Improvements Still Achieved with Our Approach, Even when Using a Looser SAR Bound (SAR’ in row 5) (Similar Results for Other Model-dataset Configurations)
Figure A.1.
A.2 Code-complexity-ranked Training
Table A.2 and Figures A.2 and A.3 present additional results for some more example configurations, for our complexity-ranked training experiments.
Figure A.2.
Table A.2.
Ranking Metrics
SAR:Recall
% Improvement
Random (baseline)
45.0
—
Volume
47.5
5.6
SLOC
55.9
24.2
Effort
56.4
25.3
Difficulty
59.5
32.2
Table A.2. Model SAR Improvements with Complexity-ranked Training Across Different Complexity Metrics [GNN; s-bAbI]
Figure A.3.
A.3 Augmentation via Program Simplification
Figures A.4, A.5, A.6, A.7, A.8, and A.9 present additional results for some more example configurations, for our program-simplification-based augmentation experiments.
Figure A.4.
Figure A.5.
Figure A.6.
Figure A.7.
Figure A.8.
Figure A.9.
A.4 Model Learning Introspection
A.4.1 Augmentation Evolution.
Figure A.10 presents a similar model evolution analysis for the Juliet dataset, as in Section 6.5, but from the perspective of the cyclomatic complexity (cc) metric (# of independent paths in the program). The same augmentation-driven “rising/falling skyline” behavior can be seen in the cc distributions of TP and FN samples, revealing the model-learning dynamic about improved understanding of code structure with more augmentation. What does not change, however, is the occurrence counts for more complex samples (cc\(\gt\) 8) in FNs, signifying the need for potential white-box model enhancement beyond just data-driven simplified-program augmentation.
Here, we use the same code complexity distribution comparison approach as in Section 6.5, but this time to examine the characteristics of samples for which the model learning behavior remains invariant to augmentation levels. We define two categories of such samples as follows. As before, for each model \(M_i\) trained under an augmentation setting (i.e., base dataset + \(X\%\) simplified samples), subsequent signal-awareness measurement results in its true positives being divided into SAR-TP\(_i\) and SAR-FN\(_i\), depending on if \(M_i\) captured the real signals or not. Then, the two special classes in focus are: (i) AlwaysTP \(:= \bigcap _{i=1}^{n}\)SAR-TP\(_i\), samples always captured correctly by the model; (ii) AlwaysFN \(:= \bigcap _{i=1}^{n}\)SAR-FN\(_i\) that are consistently mispredicted. The intersection of SAR-TP\(_i\) or SAR-FN\(_i\) allows us to focus on samples that are not affected by the augmentations, and thus examine the characteristics of both- the straightforward and challenging samples- for a model architecture.
Figure A.11(a) compares the difficulty complexity metrics distribution of the AlwaysTP samples for the s-bAbI dataset, versus the AlwaysFN group. It provides an interesting insight into the model learning behavior from the point of view of the difficulty of the program to understand (in terms operator and operand usage volune). Across all augmentation iterations, the model is more easily able to correctly capture less difficult code samples (difficulty = {12,14}), while consistently mis-predicted samples tend to be harder (difficulty = {16,17}). This does not mean that the latter category is never predicted correctly by the model. The model eventually learns to be able to predict them with sufficient augmentation, as we saw in the augmentation evolution Figure 17, but the ones that still remain mispredicted possess the higher metric scores.
Figure A.11.
Figure A.11(b) shows the corresponding behavior for the model augmentation on the Juliet dataset, this time taking the example of the cyclomatic complexity (cc) metric (# independent paths in the program). The model is able to somewhat capture low complexity samples (cc\(\lt\) 5) as opposed to the more complex ones (cc\(\gt\) 10), where it fails consistently, irrespective of augmentation assistance. The separation between the distributions for AlwaysTP and AlwaysFN is not as clear as for s-bAbI, since model learning is not as good for Juliet to begin with, with almost half of the test-set samples are consistently mispredicted for Juliet despite augmentation.
Using other metrics such as ifs, loops, and so on, yields more fine grained insight into model learning. For ex., for s-bAbI, across augmentation iterations, the model is more easily able to correctly capture code samples containing no loops (77%), while consistently mis-predicted samples tend to have a loop in them (88%).
References
[1]
Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. 2018. Sanity checks for saliency maps. In Proceedings of the NeurIPS.
Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. In Proceedings of the ACL.
Johannes Bader, Sonia Seohyun Kim, Frank Sifei Luan, Satish Chandra, and Erik Meijer. 2021. AI in software engineering at facebook. IEEE Softw. 38, 4 (2021), 52–61.
Abdul Ali Bangash, Hareem Sahar, Abram Hindle, and Karim Ali. 2020. On the time-based conclusion stability of cross-project defect prediction models. Empirical Software Engineering 25, 6 (2020), 5047–5083.
Rohan Bavishi, Michael Pradel, and Koushik Sen. 2018. Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts. Retrieved from https://arXiv:1809.05193.
Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016. An Analysis of Deep Neural Network Models for Practical Applications. Retrieved from https//arXiv:1605.07678.
Saikat Chakraborty, Rahul Krishna, Yangruibo Ding, and Baishakhi Ray. 2020. Deep learning based vulnerability detection: Are we there yet? Retrieved from https://arXiv:2009.07235.
N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16 (2002), 321–357.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman et al. 2021. Evaluating large language models trained on code. Retrieved from https://arXiv:2107.03374.
Tsong Yueh Chen, S. C. Cheung, and Siu-Ming Yiu. 2020. Metamorphic testing: A new approach for generating next test cases. Retrieved from https://arXiv:2002.12543.
Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Denys Poshyvanyk, Massimiliano Di Penta, and Gabriele Bavota. 2021. An empirical study on the usage of bert models for code completion. In Proceedings of the MSR.
Jürgen Cito, Isil Dillig, Vijayaraghavan Murali, and Satish Chandra. 2022. Counterfactual explanations for models of code. In Proceedings of the ICSE-SEIP.
Pascal Cuoq, Florent Kirchner, Nikolai Kosmatov, Virgile Prevosto, Julien Signoles, and Boris Yakobowski. 2012. Frama-c. In Proceedings of the International Conference on Software Engineering and Formal Methods. Springer, 233–247.
Hoa Khanh Dam, Truyen Tran, Trang Pham, Shien Wee Ng, John Grundy, and Aditya Ghose. 2021. Automatic feature learning for predicting vulnerable software components. In Proceedings of the TSE.
Arun Das and Paul Rad. 2020. Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A Survey. Retrieved from https://arXiv:2006.11371.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. Retrieved from https://arXiv:1810.04805.
Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. 2019. Exploring the landscape of spatial robustness. In Proceedings of the PLMR.
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. Retrieved from https://arXiv:2002.08155.
Aayush Garg, Renzo Degiovanni, Matthieu Jimenez, Maxime Cordy, Mike Papadakis, and Yves Le Traon. 2020. Learning To predict vulnerabilities from vulnerability-fixes: A machine translation approach. Retrieved from https://arXiv:2012.11701.
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural message passing for quantum chemistry. In Proceedings of the ICML.
Mojdeh Golagha, Alexander Pretschner, and Lionel C. Briand. 2020. Can we predict the quality of spectrum-based fault localization? In Proceedings of the ICST.
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. Retrieved from https://arXiv:1412.6572.
Alex Graves, Marc G. Bellemare, Jacob Menick, Remi Munos, and K. Kavukcuoglu. 2017. Automated curriculum learning for neural networks. In Proceedings of the ICML.
David Gros, Hariharan Sezhiyan, Prem Devanbu, and Zhou Yu. 2020. Code to Comment “Translation”: Data, metrics, baselining & Evaluation. In Proceedings of the ASE.
Wenbo Guo, Dongliang Mu, Jun Xu, Purui Su, Gang Wang, and Xinyu Xing. 2018. Lemna: Explaining deep learning based security applications. In Proceedings of the CCS.
Behnaz Hassanshahi, Yaoqi Jia, Roland H. C. Yap, Prateek Saxena, and Zhenkai Liang. 2015. Web-to-application injection attacks on android: Characterization and detection. In Proceedings of the ESORICS.
Joel Hestness, Newsha Ardalani, and Gregory F. Diamos. 2019. Beyond human-level accuracy: Computational challenges in deep learning. In Proceedings of the PPoPP.
M. Jimenez, R. Rwemalika, M. Papadakis, F. Sarro, Y. Le Traon, and M. Harman. 2019. The importance of accounting for real-world labelling when predicting software vulnerabilities. In Proceedings of the FSE.
Bhavya Kailkhura, Brian Gallagher, Sookyung Kim, Anna Hiszpanski, and T. Yong-Jin Han. 2019. Reliable and explainable machine-learning methods for accelerated material discovery. NPJ Comput. Mater. 5, 1 (2019), 108.
Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. Learning and evaluating contextual embedding of source code. In Proceedings of the ICML.
Ugur Koc, Parsa Saadatpanah, Jeffrey S. Foster, and Adam A. Porter. 2017. Learning a classifier for false positive error reports emitted by static code analysis tools. In Proceedings of the MAPL.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the NeurIPS.
Cassidy Laidlaw, Sahil Singla, and Soheil Feizi. 2020. Perceptual adversarial robustness: Defense against unseen threat models. In Proceedings of the ICLR.
Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. 2020. Improved code summarization via a graph neural network. In Proceedings of the ICPC.
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the ACL.
Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2018. SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities. Retrieved from https://arXiv:1807.06756.
Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, and Y. Zhong. 2018. VulDeePecker: A Deep learning-based system for vulnerability detection. In Proceedings of the NDSS.
Kui Liu, Dongsun Kim, Tegawendé F. Bissyandé, Tae-young Kim, Kisub Kim, Anil Koyuncu, Suntae Kim, and Yves Le Traon. 2019. Learning to spot and refactor inconsistent method names. In Proceedings of the ICSE.
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. Retrieved from https://arXiv:1605.05101.
Yiling Lou, Qihao Zhu, Jinhao Dong, Xia Li, Zeyu Sun, Dan Hao, Lu Zhang, and Lingming Zhang. 2021. Boosting coverage-based fault localization via graph-based representation learning. In Proceedings of the FSE.
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In Proceedings of the ICLR.
Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. DeepFool: A simple and accurate method to fool deep neural networks. In Proceedings of the CVPR.
Phuong T. Nguyen, Juri Di Rocco, Davide Di Ruscio, Lina Ochoa, Thomas Degueule, and Massimiliano Di Penta. 2019. FOCUS: A recommender system for mining API function calls and usage patterns. In Proceedings of the ICSE.
Sheena Panthaplackel, Pengyu Nie, Milos Gligoric, Junyi Jessy Li, and Raymond J. Mooney. 2020. Learning to update natural language comments based on code changes. In Proceedings of the ACL, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.).
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. 2022. Asleep at the keyboard? Assessing the Security of GitHub copilot’s code contributions. In Proceedings of the S&P.
M. R. I. Rabin, V. J. Hellendoorn, and M. A. Alipour. 2021. Understanding neural code intelligence through program simplification. In Proceedings of the FSE.
John Regehr, Yang Chen, Pascal Cuoq, Eric Eide, Chucky Ellison, and Xuejun Yang. 2012. Test-case reduction for C compiler bugs. In Proceedings of the PLDI.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. Retrieved from https://arXiv:1506.01497.
R. Russell, L. Kim, L. Hamilton, et al. 2018. Automated vulnerability detection in source code using deep representation learning. In Proceedings of the ICMLA.
Hasim Sak, Andrew W. Senior, and Françoise Beaufays. 2014. Long short-term memory recurrent neural network architectures for large-scale acoustic modeling. In INTERSPEECH.
R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. 2017. Grad-cam: Visual explanations via gradient-based localization. InProceedings of the ICCV.
L. K. Shar and H. B. K. Tan. 2012. Predicting common web application vulnerabilities from input validation and sanitization code patterns. In Proceedings of the ASE.
Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K. Reiter. 2016. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the CCS.
Sahil Suneja, Yunhui Zheng, Yufan Zhuang, Jim Laredo, and Alessandro Morari. 2020. Learning to map source code to software vulnerability using code-as-a-graph. Retrieved from https://abs/2006.08614.
S. Suneja, Y. Zheng, Y. Zhuang, J. Laredo, and A. Morari. 2021. Probing model signal-awareness via prediction-preserving input minimization, Proceedings of the FSE.
Sahil Suneja, Yunhui Zheng, Yufan Zhuang, Jim A. Laredo, and Alessandro Morari. 2021. Towards reliable ai for source code understanding. In Proceedings of the SoCC.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. Retrieved from https://arXiv:1312.6199.
Julian Thomé, Lwin Khin Shar, Domenico Bianculli, and Lionel C. Briand. 2018. Security slicing for auditing common injection vulnerabilities. J. Syst. Softw. 137 (2018), 766–783.
Omer Tripp, Salvatore Guarnieri, Marco Pistoia, and Aleksandr Aravkin. 2014. ALETHEIA: Improving the usability of static security analysis. In Proceedings of the CCS.
M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, and D. Poshyvanyk. 2019. On learning meaningful code changes via neural machine translation. In Proceedings of the ICSE.
Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019. An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation. ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 4 (2019), 1–29.
M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk. 2019. Learning how to mutate source code from bug-fixes. In Proceedings of the ICSME.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the NeurIPS.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of the ICLR.
Tian Xie and Jeffrey C. Grossman. 2018. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett. 120, 14 (2018).
Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 2014. Modeling and discovering vulnerabilities with code property graphs. In Proceedings of the IEEE SP.
Yanming Yang, Xin Xia, David Lo, and John C. Grundy. 2020. A survey on deep learning for software engineering. Retrieved from https://arXiv:2011.14597.
Andreas Zeller and Ralf Hildebrandt. 2002. Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering 28, 2 (2002), 183–200.
Zhaowei Zhang, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. 2022. Diet code is healthy: Simplifying programs for pre-trained models of code. In Proceedings of the FSE.
Y. Zheng, S. Pujar, B. Lewis, L. Buratti, et al. 2021. D2A: A dataset built for ai-based vulnerability detection methods using differential analysis. InProceedings of the ICSE-SEIP.
Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu. 2019. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Proceedings of the NeurIPS.
Standard methods for the analysis of functional MRI data strongly rely on prior implicit and explicit hypotheses made to simplify the analysis. In this work the attention is focused on two such commonly accepted hypotheses: (i) the hemodynamic response ...
Bad smells in code are indications of low code quality representing potential threats to the maintainability and reusability of software. Code clone is a type of bad smells caused by code fragments that have the same functional ...
Highlights
Deep neural and recurrent neural networks are used most in detecting code clones.
The existing advanced automatic vulnerability detection methods based on source code are mainly learning-based, such as machine learning and deep learning. These models can capture the vulnerability pattern through learning, which is more ...