Open AccessArticle

A Novel Source Code Clone Detection Method Based on Dual-GCN and IVHFS

Haixin Yang

Zhen Li

^* and

Xinyu Guo

School of Cyber Security and Computer, Hebei University, Baoding 071002, China

Author to whom correspondence should be addressed.

Electronics 2023, 12(6), 1315; https://doi.org/10.3390/electronics12061315

Submission received: 13 February 2023 / Revised: 5 March 2023 / Accepted: 7 March 2023 / Published: 9 March 2023

(This article belongs to the Special Issue Machine Learning Methods in Software Engineering)

Download

Browse Figures

Figure 1
The overall framework of our model based on dual graph convolutional networks and interval-valued hesitant fuzzy set: (a) generation of the initial syntax and semantics representations, (b) the dual graph convolutional networks capture the relationships between nodes and enhances the features and (c) feature fusion and similarity calculation processes of code fragments A and B. "> Figure 2
The original abstract syntax tree of code fragment <math display="inline"><semantics> <mrow> <mi>s</mi> <mi>u</mi> <mi>m</mi> <mo>_</mo> <mi>f</mi> <mi>u</mi> <mi>n</mi> <mn>1</mn> <mo>.</mo> <mi>j</mi> <mi>a</mi> <mi>v</mi> <mi>a</mi> </mrow> </semantics></math>. In the abstract syntax tree, the green, white and yellow boxes represent special attribute nodes, attribute nodes and code nodes, respectively. Too many attribute nodes in the tree result in overcomplexity of the tree structure. "> Figure 3
The simplifying and grouping process of the target abstract syntax tree. The attribute nodes are removed and the size of the tree is significantly reduced. <math display="inline"><semantics> <mrow> <mi>D</mi> <mi>e</mi> <mi>c</mi> <mi>l</mi> <mi>a</mi> <mi>r</mi> <mi>a</mi> <mi>t</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> </mrow> </semantics></math> or <math display="inline"><semantics> <mrow> <mi>S</mi> <mi>t</mi> <mi>a</mi> <mi>t</mi> <mi>e</mi> <mi>m</mi> <mi>e</mi> <mi>n</mi> <mi>t</mi> </mrow> </semantics></math> nodes and their leaf nodes are grouped into the same group. Dotted boxes with different colors indicate different groups. "> Figure 4
The generation process of group representation <math display="inline"><semantics> <mi mathvariant="bold">r</mi> </semantics></math>. "> Figure 5
The CFG of source code <math display="inline"><semantics> <mrow> <mi>s</mi> <mi>u</mi> <mi>m</mi> <mo>_</mo> <mi>f</mi> <mi>u</mi> <mi>n</mi> <mn>1</mn> <mo>.</mo> <mi>j</mi> <mi>a</mi> <mi>v</mi> <mi>a</mi> </mrow> </semantics></math>. "> Figure 6
Schematic diagram of Formula (<a href="#FD9-electronics-12-01315" class="html-disp-formula">9</a>). "> Figure 7
Schematic diagram of Formula (<a href="#FD10-electronics-12-01315" class="html-disp-formula">10</a>). "> Figure 8
Time consumption of our model based on dual graph convolutional networks and interval-valued hesitant fuzzy set, as well as other baseline models in the training phase. "> Figure 9
Time consumption of our model based on dual graph convolutional networks and interval-valued hesitant fuzzy set, as well as other baseline models in the prediction phase. "> Figure 10
The feature concatenation process in Variant 5. ">

Versions Notes

Abstract

Source code clone detection, which can identify code fragments with similar functions, plays a significant role in software development and quality assurance. Existing methods either extract single syntactic or semantic information, or ignore the associated information between code statements in different structures. It is difficult for these methods to effectively detect clone pairs with similar functions. In this paper, we propose a new model based on a dual graph convolutional network (GCN) and interval-valued hesitant fuzzy set (IVHFS), which we named DG-IVHFS. Specifically, we simplified and grouped the abstract syntax tree (AST) of source code to obtain the group representations. The group representations of the AST, as well as the control flow graph (CFG) representations, were transformed into graph structures, and then we applied GCNs on them to learn dependencies between nodes. In addition, we introduced IVHFS into the model for a more comprehensive evaluation of similarity. Our experimental results demonstrated that the precision, recall, and F1-scores of DG-IVHFS on the BigCloneBench and GoogleCodeJam datasets reached 98, 97 and 97% and 98, 93 and 95%, respectively, exceeding current state-of-the-art models. Moreover, our model performed well in terms of time consumption.

Keywords:

source code clone detection; software engineering; graph convolutional network; interval-valued hesitant fuzzy set; abstract syntax tree; control flow graph

1. Introduction

Code cloning refers to two identical or similar code fragments. It enables faster software development and can save developers’ time and effort. However, the misuse of code cloning is a great problem in software development, so the research of code clone detection has received wide attention in software engineering.

Code clones include syntactic clones and semantic clones. Based on these two categories, code clones can be further separated into Types 1–4 [1]. Types 1–3 are syntactic clones and Type 4 is a semantic clone. The purpose of code clone detection is to evaluate the similarity between two code snippets. As Type 4 clones contain highly distinct clones on the syntactic level, they differ significantly in text, making it the most difficult type to detect.

1.1. Related Work

Current clone detection methods mainly include token-based, text-based, tree-based, graph-based and hybrid methods.

Token-based methods [2,3,4,5,6] transform the source code into token sequences, scan them, and find reduplicate subsequences to complete code clone detection. CCFinder [3] utilizes lexical analysis to extract token sequences from the source code and transforms them using transformation rules. All the substrings on the transformed token are used for match detection. CCAligner [4] uses a code window that considers a mismatch index and asymmetric similarity coefficient to calculate similarity.

Text-based methods [7,8,9,10,11] treat the code as a series of strings to compare two pieces of source code. NiCad [7] treats source code as code text and compares the sequences of text. Ducass et al. [8] designed a method that detected code clones by treating source code as strings and matching them for similarity assessments.

Although token-based and text-based methods need little time to execute, they cannot deal with the detection of Type 4 clones or even Type 3 clones because they do not consider code structure information. In order to overcome the disadvantages of the above clone detection methods, researchers have started to focus on the structure of source code and proposed tree-based and graph-based methods.

Tree-based methods [12,13,14,15,16,17,18,19,20] convert the code into an abstract syntax tree (AST) and perform a series of operations on the AST to obtain the structure information of the code. Deckard [12] generated feature vectors according to AST subtrees to represent the code. CDLH [14] uses an AST-based long short-term memory (LSTM) to learn hash codes while considering the syntactic and lexical information of code. ASTNN [15] splits the original AST into multiple statement subtrees and encodes them, then generates a vector representation of the AST. TreeCen [19] represents the AST into a tree graph and adopts specific analyses to transform it into a vector. Wu et al. [20] proposed a method that converted the AST into a Markov chain-based matrix and measured the distance of two matrices.

Graph-based methods [21,22,23,24,25,26] represent code as graphs to extract structure information from it, such as the control flow graph (CFG) or program dependency graph (PDG). Komondoor [21] and Krinke [22] matched subgraphs to identify the similarity of two code fragments. DeepSim [23] encodes data flow and control flow into matrices, then utilizes feedforward neural networks to detect code clones.

ASTs can effectively represent the syntactic structure of the code, but they cannot represent the control flow structure between code statements. CFGs do not contain information of low-level syntactic structures within code blocks and methods using PDG suffer from scalability problems and high time costs. Due to the limitations of trees and graphs in code representations, syntactic and semantic features of source code cannot be comprehensively obtained using graph-based or tree-based methods alone. Therefore, hybrid methods [27,28,29,30] have been proposed. FCCA [28] uses code tokens, AST and CFG to represent the code. Tree-based LSTM is used to process the AST, but it lacks the ability to explore the dependency relationships between statement nodes. FA-AST [29] takes AST as the main part and adds sequential execution,

W h i l e

loop,

F o r

loop and other basic control flows to it. However, FA-AST ignores structures such as

D o

W h i l e

statements and

C a s e

blocks, resulting in insufficient learning of semantic information in some code fragments. Moreover, current hybrid methods do not effectively deal with ASTs, resulting in important syntactic features being susceptible to interference from redundant nodes and long training times. In contrast, the model proposed in this paper simplifies AST structure while capturing the semantic and syntactic dependencies between statement nodes.

1.2. Contributions

We propose a new model DG-IVHFS, which is based on a dual graph convolutional networks (GCN) and interval-valued hesitant fuzzy set (IVHFS). A simplified and grouped AST (SGA) was constructed, and the multiple layers of GCN were applied on the SGA and CFG to capture dependencies between nodes in different representations. Specifically, we removed the redundant nodes in the AST to enhance the connections between key nodes, then divided the simplified AST into multiple groups. We used the attention mechanism and bidirectional gated recurrent unit (Bi-GRU) to obtain the group representations. In order to obtain the semantic and syntactic information of the code more comprehensively and accurately, we constructed the CFG-GCN and SGA-GCN to process CFG and group representations. The Dual-GCN enhanced the syntactic and semantic representations by aggregating features of neighbors. In addition, we used IVHFS to complete better feature fusion. For the semantic and syntactic feature matrices enhanced by Dual-GCN, we designed corresponding membership functions to utilize the information contained in each vector of the matrices, which allowed for a comprehensive similarity evaluation.

In summary, the main contributions are as follows:

Considering the complex structure of AST, we design a new representation of syntactic features that simplifies the structure of AST while preserving important syntactic information and making subsequent convolution operations more convenient.
We propose a Dual-GCN to learn the relationships between statements, which enhances the syntactic and semantic features of the source code.
We introduce IVHFS into the model and develop a new feature fusion method, which enables our model to achieve accurate similarity calculations. Subsequent ablation experiment results demonstrate the effectiveness of IVHFS.
To verify the time performance and effectiveness of DG-IVHFS, we conduct experiments on BigCloneBench and GoogleCodeJam datasets. Our experimental results demonstrate that DG-IVHFS achieves a significant improvement over other baseline methods.

The rest of this paper is organized as follows: Section 2 defines the problem. Section 3 describes our method in detail. Section 4 provides the experimental design, results, discussions, and limitations of our model. Section 5 concludes our work.

2. Problem Definition

Given the two code fragments

C_{i}

and

C_{j}

, we use the form of a triple

(C_{i}, C_{j}, y_{i j})

to represent them, where

y_{i j}

is the label. If

(C_{i}, C_{j})

is a clone pair, then

y_{i j} = 1

; otherwise

y_{i j} = 0

. For example,

s u m_f u n 1 . j a v a

in Listing and

s u m_f u n 2 . j a v a

in Listing implement the sum from 1 to

n - 1

through a

f o r

loop and a

w h i l e

loop, respectively. They perform the same semantic function, although their syntactic structures are different. Therefore, they belong to Type 4 clones, and the label between them is set to 1.

Listing 1.

s u m_f u n 1 . j a v a

Listing 2.

s u m_f u n 2 . j a v a

A training dataset

{(C_{i}, C_{j}, y_{i j}) | i, j \in n, i < j)}

with labels can be constructed from n code fragments

{C_{1}, C_{2}, \dots, C_{n}}

. We aim to train a deep learning model to learn two functions,

ϕ

and

ω

, where

ϕ

and

ω

map code fragment into syntactic feature matrix M and semantic feature matrix N, respectively, so that for any code pair

(C_{i}, C_{j})

, the similarity

S_{i j}

between

{M_{i}, N_{i}}

and

{M_{j}, N_{j}}

calculated by IVHFS theory is as close as possible to

y_{i, j}

S_{i j} = IVHFS ({M_{i}, N_{i}}, {M_{j}, N_{j}}),

(1)

where

{M_{i}, N_{i}} = {ϕ (C_{i}), ω (C_{i})}

{M_{j}, N_{j}} = {ϕ (C_{j}), ω (C_{j})}

and

S_{i j} \in [0, 1]

In the prediction phase, in order to infer whether two code fragments were a clone pair, we set a threshold k between true and false clone pairs. If the predicted similarity was greater than k, two code fragments were considered a clone pair; otherwise, they were considered a non-clone pair.

3. Material and Methods

We propose a novel code clone detection method called DG-IVHFS. The overall framework is shown in Figure 1. It includes three main parts: (a) initial representation generation, (b) Dual-GCN and (c) feature fusion and similarity calculation. Initial representation generation includes syntax representation generation and semantics representation generation.

3.1. Syntax Representation Generation

The process of syntax representation generation involves three steps: simplifying and grouping AST, group embedding and generation of dependencies.

3.1.1. Simplifying and Grouping AST

Most of the nodes in the AST are attribute nodes with weak connection information, which weaken the relationships between important nodes. Figure 2 shows the whole AST of the code fragment

s u m_f u n 1 . j a v a

. It shows that the AST structure is quite complex, even for a function with a simple

f o r

loop, and most attribute nodes are redundant. For example, the attribute node

b o d y

weakens the relationship between

M e t h o d D e c l a r a t i o n

and

F o r S t a t e m e n t

For this reason, we removed the redundant nodes from the original AST and then reconstructed the tree. We extracted the AST from source code fragments using javalang [31]. Given a group of AST nodes N, for the nodes in N, we removed them if they were attribute nodes. Significantly, we considered

D e c l a r a t i o n

and

S t a t e m e n t

as special attribute nodes, such attribute nodes were kept because they contain strong connection information. In order to assure the wholeness of the tree, we connected the child nodes of deleted nodes directly to the parent nodes of deleted nodes and rebuilt the connections between the nodes. Taking the AST in Figure 2 as an example, the simplified AST is shown in Figure 3.

Even when the AST is simplified by removing most attribute nodes, the simplified AST still has high complexity if the source code is complicated. Thus, if the GCN is directly applied to the simplified AST, the convolutional process will become complex due to the high number of AST nodes; this is not conducive to learning dependencies between nodes. Therefore, we designed an approach to group the simplified AST. Specifically, we traversed the simplified AST using a depth-first traversal algorithm to obtain the sequence of nodes

X = [x_{1}, x_{2}, \dots, x_{n}]

and mark the

D e c l a r a t i o n

and

S t a t e m e n t

nodes, then divide these marked nodes and their respective leaf nodes into a group. As shown in Figure 3, the nodes in a dotted box are a group. Finally, we can obtain the grouped node sequence

S = [s_{1}, s_{2}, \dots, s_{i}, \dots, s_{g}]

, where g represents the number of groups.

3.1.2. Group Embedding

Given a group

s_{i}

, the node embedding

E_{i} = [e_{1}, e_{2}, \dots, e_{w}]

was obtained by using word2vec [32], where w denotes the number of nodes in this group. Then, we used the attention mechanism to generate a group representation of

s_{i}

v_{t} = \tan h (W e_{t}, b),

(2)

α_{t} = \frac{e x p (v_{t}^{⊤} \cdot v_{a})}{\sum_{k = 1}^{w} e x p (v_{k}^{⊤} \cdot v_{a})},

(3)

r = \sum_{t = 1}^{w} α_{t} \cdot e_{t} .

(4)

In Formula (2), a linear layer converts

e_{t}

into the hidden representation

v_{t}

. Formula (3) uses the softmax function to generate a weight coefficient for the t-th node in the group.

v_{a}

is a randomly initialized context vector, which is constantly learned during model training. Group representation

r

was obtained by a weighted summation of all nodes in the group according to Formula (4). The generation process of group representation is shown in Figure 4.

To capture the naturalness of all groups, we fed the matrix

R = {r_{1}, r_{2}, \dots, r_{g}}

, which contained the representations of all groups, into Bi-GRU:

H = Bi - GRU (R) = {h_{1}, h_{2}, \dots, h_{g}},

(5)

where g means the number of groups generated in Section 3.1.1.

3.1.3. Dependencies Generation

Before performing graph convolution operations, the dependencies between groups need to be constructed. If the attribute nodes in two groups have a father-child relationship on the simplified AST, we believe the two groups are interdependent. For the values of the dependencies between any arbitrary two groups

s_{i}

and

s_{j}

, we have the following regulations:

A_{i j} = \{\begin{matrix} 1, & if s_{i} and s_{j} are interdependent \\ 0, & otherwise \end{matrix} .

(6)

Then, the dependencies matrix

A^{sga}

is obtained.

3.2. Semantics Representation Generation

Figure 5 gives the CFG of the source code

s u m_f u n 1 . j a v a

. The CFG describes the logical structure and execution order of code snippets, and these structures can be easily located from the CFG. Therefore, we use the CFG to capture semantic structure information of the code snippets.

We used Soot [33] to generate CFGs from source code fragments. A CFG is composed of a node sequence

X^{'} = [x_{1}^{'}, x_{2}^{'}, \dots x_{i}^{'}, \dots, x_{j}^{'}, \dots, x_{m}^{'}]

and a set of directed edges that represent the execution order of statements. The edges represent the control flows between nodes. According to these edges, we construct the adjacency matrix

A^{cfg}

of CFG. For the elements in

A^{cfg}

A_{i j}^{cfg} = 1

if there is a directed edge of node

x_{i}^{'}

pointing to node

x_{j}^{'}

, otherwise

A_{i j}^{cfg} = 0

. Then, the statements are converted into corresponding vector representations

H^{'} = {h_{1}^{'}, h_{2}^{'}, \dots, h_{m}^{'}}

using Doc2Vec [34].

3.3. Dual-GCN

In this section, we design the Dual-GCN to explore the dependencies between nodes in trees and graphs and enrich the feature representations. The Dual-GCN is composed of SGA-GCN and CFG-GCN.

3.3.1. Related Theory

Traditional convolutional neural networks and recurrent neural networks are very effective in extracting the features of Euclidean space data. However, their performance on non-Euclidean space data, such as graphs, is unsatisfactory. Therefore, the use of graph neural networks (GNN), which can effectively deal with graph structures with complex relationships, was proposed. Current GNNs can be divided into graph convolution networks, graph attention networks and graph generative networks [35].

GCN is an efficient CNN variant that directly operates on graphs [36]. For graph-structured data, GCN applies convolution operations to directly connected nodes to encode local information. For each node in the graph, it can learn more global information and update the node by aggregating its neighbors. GCN has had widespread use in software engineering [37,38] in recent years. Lu et al. [37] used graph attention network and GCN to extract feature representations from an AST. Gao et al. [38] proposed a GTsum model for a code summarization task, which used GCN to learn the feature information in a local application programming interface (API) dependency graph and AST. In this study, we use GCN to enhance the syntactic and semantic features of the source code.

3.3.2. SGA-GCN

SGA-GCN takes the group representations

H

and the dependencies matrix

A^{sga}

as input. The nodes in each layer of GCN are updated by aggregating information of their neighboring nodes, and the calculation formula is:

H_{k + 1}^{sga} = ReLU ({(D^{sga})}^{- \frac{1}{2}} {\tilde{A}}^{sga} {(D^{sga})}^{- \frac{1}{2}} H_{k}^{sga} W_{k}^{sga} + b_{k}^{sga}) .

(7)

In (7),

{\tilde{A}}^{sga} = A^{sga} + E

, where

E

denotes the unit matrix. If only

A^{sga}

is used, only the neighbors of a node are considered when multiplying with the vector matrix

H_{k}^{sga}

of all nodes, as the diagonal of

A^{sga}

is 0. To ensure that the information of the node itself is also taken into account,

E

is added to

A^{sga}

so that the diagonal elements of the matrix become 1.

{(D^{sga})}^{- \frac{1}{2}} {\tilde{A}}^{sga} {(D^{sga})}^{- \frac{1}{2}}

is similar to a symmetric Laplacian matrix and can be used for normalizing

A^{sga}

, where

D^{sga}

denotes the degree of

{\tilde{A}}^{sga}

. We chose ReLU as the activation function.

H_{k}^{sga}

is the hidden representation generated by the k layer of the GCN.

W_{k}^{sga}

and

b_{k}^{sga}

represent the weight matrix and bias vector, respectively. The final node representation obtained from the SGA-GCN is

H^{sga} = {h_{1}^{sga}, h_{2}^{sga}, h_{3}^{sga}, . . ., h_{g}^{sga}}

, with g representing the number of groups.

3.3.3. CFG-GCN

CFG-GCN models the semantic information of CFG and enhances the node representations through their neighboring nodes. The node representations

H^{'}

of CFG and the adjacency matrix

A^{cfg}

, representing control flow information, which was generated in Section 3.2, served as the initial input for CFG-GCN. The convolution operation follows the equation:

H_{k + 1}^{cfg} = ReLU ({(D^{cfg})}^{- 1} {\tilde{A}}^{cfg} H_{k}^{cfg} W_{k}^{cfg} + b_{k}^{cfg}),

(8)

where

{\tilde{A}}^{cfg} = A^{cfg} + E

and E is the unit matrix.

D^{cfg}

is the degree of

{\tilde{A}}^{cfg}

. The k layer of the GCN generates the hidden representation

H_{k}^{cfg}

W_{k}^{cfg}

and

b_{k}^{cfg}

represent the weight matrix and bias vector, respectively. The final node representation obtained from the CFG-GCN is

H^{cfg} = {h_{1}^{cfg}, h_{2}^{cfg}, h_{3}^{cfg}, \dots, h_{m}^{cfg}}

, where m represents the number of nodes in the CFG.

3.4. Feature Fusion and Similarity Calculations

3.4.1. Related Theory

In 1965, Zadeh [39] proposed the fuzzy set that used membership degree to describe uncertain information in the real world. However, it lacked the ability to solve decision-making problems with multi-attributes. Therefore, Torra [40] proposed the hesitant fuzzy set (HFS). It reflected human hesitancy objectively. Chen et al. [41] extended HFS in 2013 and proposed IVHFS, which used multiple interval values to represent the possible membership values. The introduction of interval values enhanced the degree of hesitation preference of decision information and made the decision results more reasonable and relevant to reality. IVHFS is a powerful tool for addressing the problem of interval-valued evaluations with many uncertainties. In this study, we use IVHFS to process multiple sets of code features for more comprehensive similarity evaluations.

3.4.2. Feature Fusion Based on IVHFS

In this paper, IVHFS was used to fuse multiple features of source code by applying membership functions sequentially to each vector in vector matrices. For the non-empty attribute set

X = {x_{1}, x_{2}, . . ., x_{n}}

, the IVHFS on X can be expressed as

E = {< x, {\tilde{h}}_{E} (x) > | x \in X}

{\tilde{h}}_{E} (x)

denotes the set of possible interval numbers where the element x is subordinate to E, and these possible intervals are the subsets of

[0, 1]

H_{A}^{sga} = {h_{A_{1}}^{sga}, h_{A_{2}}^{sga}, . . ., h_{A_{i}}^{sga}, . . ., h_{A_{g 1}}^{sga}}

and

H_{B}^{sga} = {h_{B_{1}}^{sga}, h_{B_{2}}^{sga}, . . ., h_{B_{j}}^{sga}, . . ., h_{B_{g 2}}^{sga}}

are the vector matrices of code fragment A and code fragment B obtained by the SGA-GCN.

g_{1}

and

g_{2}

are the number of groups in the SGA. We take code fragment A as the subject and define the membership function of syntactic features as:

U_{A_{i}}^{sga} = [f_{sga}^{-}, f_{sga}^{+}] = [e^{- d_{max} (h_{A_{i}}^{sga}, h_{B}^{sga})}, e^{- d_{min} (h_{A_{i}}^{sga}, h_{B}^{sga})}] .

(9)

In (9),

f_{sga}^{-}

and

f_{sga}^{+}

represent the upper and lower limits of membership values.

d_{max} (h_{A_{i}}^{sga}, h_{B}^{sga}) = m a x_{j = 1}^{g_{2}} \frac{h_{A_{i}}^{sga} \cdot h_{B_{j}}^{sga}}{|h_{A_{i}}^{sga}| |h_{B_{j}}^{sga}|}

. Similarly,

d_{min} (h_{A_{i}}^{sga}, h_{B}^{sga}) = m i n_{j = 1}^{g_{2}} \frac{h_{A_{i}}^{sga} \cdot h_{B_{j}}^{sga}}{|h_{A_{i}}^{sga}| |h_{B_{j}}^{sga}|}

. The schematic diagram of Formula (9) is shown in Figure 6. It can be seen from the figure that each vector

h_{A_{i}}^{sga}

H_{A}^{sga}

is calculated with all vectors in

H_{B}^{sga}

. The maximum and minimum values are selected from the calculation results, and their negative numbers are taken as the exponential of the natural logarithm e, to obtain the membership value interval

U_{A_{i}}^{sga}

H_{A}^{cfg} = {h_{A_{1}}^{cfg}, h_{A_{2}}^{cfg}, . . ., h_{A_{i}}^{cfg}, . . ., h_{A_{m 1}}^{cfg}}

and

H_{B}^{cfg} = {h_{B_{1}}^{cfg}, h_{B_{2}}^{cfg}, . . ., h_{B_{j}}^{cfg}, . . ., h_{B_{m 2}}^{cfg}}

are the vector matrices of code fragment A and code fragment B obtained by the CFG-GCN.

m_{1}

and

m_{2}

are the number of nodes in the CFG. Similarly, we take code fragment A as the subject and define the membership function of semantic features as:

U_{A_{i}}^{cfg} = [f_{cfg}^{-}, f_{cfg}^{+}] = [m i n_{j = 1}^{m_{2}} E (h_{A_{i}}^{cfg}, h_{B_{j}}^{cfg}), m a x_{j = 1}^{m_{2}} E (h_{A_{i}}^{cfg}, h_{B_{j}}^{cfg})] .

(10)

E (h_{A_{i}}^{cfg}, h_{B_{j}}^{cfg})

in (10) represents the Euclidean distance between

h_{A_{i}}^{cfg}

and

h_{B_{j}}^{cfg}

E (h_{A_{i}}^{cfg}, h_{B_{j}}^{cfg}) = {(\sum_{l = 1}^{k} {({(h_{A_{i}}^{cfg})}_{l} - {(h_{B_{j}}^{cfg})}_{l})}^{2})}^{\frac{1}{2}},

(11)

where k denotes the dimension of

h^{cfg}

. The schematic diagram of Formula (10) is shown in Figure 7. The calculation process is similar to that in Figure 6.

h_{A_{i}}^{cfg}

H_{A}^{cfg}

is calculated with all vectors in

H_{B}^{cfg}

. The maximum and minimum values in the calculated results are taken as the upper and lower limits of the membership value interval

U_{A_{i}}^{cfg}

The membership function converts the relationship between code fragments A and B into the corresponding IVHFS. With code fragment A as the subject, the IVHFS between code fragments A and B is calculated as:

S = {[U_{A_{1}}^{sga}, . . ., U_{A_{g 1}}^{sga}], [U_{A_{1}}^{cfg}, . . ., U_{A_{m 1}}^{cfg}]} .

(12)

We set the interval membership value between the code fragment and itself as

[1, 1]

to construct the ideal IVHFS

S_{p}

, which will be used as a measure of similarity between code pairs. For example, the IVHFS of code fragment A and itself can be expressed as

{[[1, 1], . . ., [1, 1]], [[1, 1], . . ., [1, 1]]}

We conducted the same experiments with code fragment B as the subject and our experimental results showed little difference. Therefore, it was shown that with either code fragment A or B as the subject, the feature information contained in each vector of the vector matrix could be fully utilized, and the syntactic feature matrix and semantic feature matrix could be evaluated comprehensively.

3.4.3. Similarity Calculation

For ideal IVHFS

S_{p}

and real IVHFS S,

U_{S_{p}} (x_{i})

and

U_{S} (x_{i})

are the two interval-valued hesitant fuzzy elements of

S_{P}

and S. The generalized interval hesitation Hamming distance of

S_{P}

and S is expressed as

d_{H a m m i n g} (S_{p}, S) = \frac{1}{k} \sum_{i = 1}^{k} \frac{1}{2 l_{x_{σ (i)}}} \sum_{j = 1}^{l_{x_{σ (i)}}} (\begin{matrix} | U_{S_{p}}^{σ (j)} {(x_{σ (i)})}^{+} - U_{S}^{σ (j)} {(x_{σ (i)})}^{+} | \\ + | U_{S_{p}}^{σ (j)} {(x_{σ (i)})}^{-} - U_{S}^{σ (j)} {(x_{σ (i)})}^{-} | \end{matrix}),

(13)

where k is the number of evaluation indicators.

U_{S_{p}}^{σ (j)} (x_{σ (i)})

and

U_{S}^{σ (j)} (x_{σ (i)})

are the j-th largest interval number of

U_{S_{p}} (x_{i})

and

U_{S} (x_{i})

l_{x_{σ} (i)}

denotes the maximum of

l_{S_{p}} (x_{i})

and

l_{S} (x_{i})

, i.e.,

l_{x_{σ} (i)} = m a x (l_{S_{p}} (x_{i}), l_{S} (x_{i}))

l_{S_{p}} (x_{i})

and

l_{S} (x_{i})

denote the number of interval values of

U_{S_{p}} (x_{i})

and

U_{S} (x_{i})

. The similarity of two IVHFS is inversely proportional to the distance between them. Therefore, the similarity formula can be expressed as

s i m (S_{p}, S) = 1 - d_{H a m m i n g} (S_{p}, S) .

(14)

Then, we can treat the output

s i m (S_{p}, S)

as the similarity value

\hat{s}

of two code fragments. The loss function is defined as binary cross-entropy:

L = \sum (- (y \cdot l o g (\hat{s}) + (1 - y) \cdot l o g (1 - \hat{s}))),

(15)

where y represents the label of two code fragments.

The DG-IVHFS is stored after all the parameters are optimized. In the prediction phase, we take two code fragments as input, then we can obtain the predicted value by the following definitions:

prediction = \{\begin{matrix} 1, & \hat{s} > δ \\ 0, & \hat{s} \leq δ \end{matrix},

(16)

where

δ

is the threshold between true and false clones.

4. Experiments

4.1. Datasets

We evaluated DG-IVHFS on two public datasets written in Java.

One of them was Google Code Jam (GCJ) [42], which contains 1669 items solving 12 competition problems. Each competition topic is designed to solve the same problem. Two code files are considered a clone pair if they solve the same problem. Otherwise, they are considered a non-clone pair.

The other dataset was BigCloneBench (BCB) [43], a widely used dataset for clone detection. The BCB consists of 25,000 items covering 10 functions and contains over 6 million clone pairs with a true label and 260,000 clone pairs with a false label. In the experiments, code fragments without labels were discarded, leaving 9134 code fragments at the end.

The basic information of clone and non-clone pairs we selected in the experiment is shown in Table 1.

4.2. Comparison Models

Tree-based methods:

CDLH [14] is an AST-based clone detection method that uses Tree-LSTM to encode AST and compares Hamming distance between hash vectors.
ASTNN [15] is an advanced AST-based clone detection method that first decomposes the AST into multiple statement trees, and then encodes them separately using recursive encoders to generate vector representations.

Graph-based method:

DeepSim [23] is a graph-based clone detection method that uses semantic matrices to encode control flow and data flow information.

Hybrid methods:

FCCA [28] is a hybrid method that retains three types of information (text, AST and CFG) in the code representations.
FA-AST [29] is a hybrid method that expands the AST with control flow edges and data flow edges to learn semantic and syntactic information.

4.3. Experimental Setting

We obtained the embedding of nodes in SGA using word2vec with the Skip-gram algorithm and the embedding dimension was set to 128. The hidden dimension in Bi-GRU was 100. Doc2vec was used to generate vectors of the nodes in the CFG, and the dimension of vectors was 300. We used the Adam [44] optimizer to train DG-IVHFS. The initial learning rate was 0.001. The mini-batch size was set to 64. The number of GCN layers was 2, which was the optimal number of layers we tested in our experiments. The threshold

δ

of code cloning was set to 0.5. To ensure the accuracy of the results, we used 10-fold cross-validation for training and testing in both two datasets. Specifically, we divided the dataset we built into 10 subsets. One was used as a test set and the remaining nine were used as training sets. We constructed 10 different experiments with different test sets. Finally, the average of these 10 results was reported. In order to ensure the fairness of the comparison, we compared our model and baseline models under the same experimental settings, and used the same evaluation metrics used in previous work: precision (P), recall (R) and F1-score (F1).

We conducted the experiments on a server with 10 cores of 3.7GHz CPU and two NVIDIA GeForce RTX 3080 GPUs.

4.4. Research Questions

We conducted the experiments and analyzed the results according to the following research questions (RQs):

RQ1: What is the overall effectiveness of DG-IVHFS compared with other baseline approaches?
RQ2: What is the time consumption of DG-IVHFS compared with other baseline approaches?
RQ3: How do the group representations obtained by simplifying and grouping the AST affect the effectiveness of DG-IVHFS?
RQ4: How does the Dual-GCN affect the effectiveness of DG-IVHFS?
RQ5: How does the IVHFS affect the effectiveness of DG-IVHFS?

4.5. Results and Discussion

4.5.1. RQ1: What Is the Overall Effectiveness of DG-IVHFS Compared with Other Baseline Approaches?

The detailed comparison results on BCB and GCJ are shown in Table 2.

From the results in Table 2, we can see that DG-IVHFS improved the F1 scores from 94 to 97% on BCB, and from 92 to 95% on GCJ. CDLH and DeepSim only modeled syntactic or semantic information of code snippets. The single feature information made it difficult for them to deal with the complicated code clone detection task. ASTNN also used single syntactic information, but it performed better than CDLH and DeepSim. This is because ASTNN divided the AST into several subtrees, which simplified the AST structure to some extent. However, it ignored the dependencies between nodes in the AST. FCCA and FA-AST used multiple features of source code, but they still did not fully explore the relationships between code statements.

Compared with these methods, we believe that there are four main reasons for the higher performance of our model. First, we obtained syntactic and semantic information from AST and CFG levels, respectively, which made the obtained code features more comprehensive and sufficient. Secondly, we simplified and grouped the AST, which enhanced the connection information between nodes. Thirdly, we constructed graph representations of the simplified and grouped AST, as well as the CFG, and modeled them using a Dual-GCN to capture better semantic and syntactic features. Finally, we used IVHFS to fully utilize the feature information contained in each vector to achieve better feature fusion.

4.5.2. RQ2: What Is the Time Consumption of DG-IVHFS Compared with Other Baseline Approaches?

We conducted experiments on the time performance of these methods on the BCB dataset. The time consumption of DG-IVHFS and the baseline models in the training and prediction phases are visually represented in Figure 8 and Figure 9.

CDLH and FCCA took the longest training time because they used a large number of RNNs with low parallelism. ASTNN used the complicated recursive traversal algorithm in the encoding of statement subtrees, which increased training time consumption. In addition, it had the longest prediction time. FA-AST used a graph matching network to transmit information between two graphs. Each node of one graph match similar nodes in the other, which leads to high complexity and long training time. Our model used a highly parallel GCN to process graph and tree structures, while simplifying and grouping the AST reduced the complexity of the convolution operations. DeepSim consumed a little less training time than our model, but it was far inferior in terms of detection effectiveness.

4.5.3. RQ3: How Do the Group Representations Obtained by Simplifying and Grouping the AST Affect the Effectiveness of DG-IVHFS?

In this section, we design Variant 1 to illustrate the importance of simplifying and grouping the AST. The node representations of the whole AST were obtained and the Bi-GRU was utilized to model the naturalness of the statements. The syntactic feature representations and adjacency matrix were fed into the syntax-based GCN. Other settings remained unchanged. The ablation experiment results are reported in Table 3.

The results show that the process of simplifying and grouping the AST to obtain group representations improves the performance of the model. This is because simplifying the AST can effectively remove redundant information and direct more attention to the important nodes. Grouping the simplified AST further simplifies the structure of AST, which is conducive to subsequent convolution operations in the GCN.

4.5.4. RQ4: How Does the Dual-GCN Affect the Effectiveness of DG-IVHFS?

In this section, we consider three variants of DG-IVHFS to explore the effectiveness of the Dual-GCN.

Variant 2: The SGA-GCN and CFG-GCN, i.e., Dual-GCN, were removed on the basis of DG-IVHFS. We used IVHFS to directly handle the original semantic and syntactic feature vectors. Other settings remained unchanged.
Variant 3: The CFG-GCN was removed on the basis of DG-IVHFS. We used IVHFS to deal with the initial semantic feature vectors, as well as the syntactic feature vectors enhanced by the SGA-GCN. Other settings remained unchanged.
Variant 4: The SGA-GCN was removed on the basis of DG-IVHFS. We used IVHFS to deal with the initial syntactic feature vectors, as well as the semantic feature vectors enhanced by the CFG-GCN. Other settings remained unchanged.

We compared our original model with the three variants in P, R and F1. The results are shown in Table 4.

The performance of Variant 2 was the worst, because it did not learn the enhanced semantic and syntactic structure information in the AST and CFG. When GCN was applied to the simplified AST on the basis of Variant 2, Variant 3 was obtained and its performance substantially improved. Variant 3 could learn the enhanced syntactic information in the AST and the connection information between nodes from the GCN. When applying the GCN to the CFG on the basis of Variant 2, Variant 4 was obtained. Variant 4 improved performance significantly more than Variant 3, indicating that the enhanced semantic information and relationships between statement nodes learned from the CFG have a great impact on code clone detection. When the SGA-GCN and CFG-GCN were added to Variant 2, DG-IVHFS was obtained and it achieved the best performance. It could fully learn the enhanced semantic information from the CFG, the enhanced syntactic information from the AST, and the relationships between nodes in the graph representations.

4.5.5. RQ5: How Does the IVHFS Affect the Effectiveness of DG-IVHFS?

In this section, we consider Variant 5 to explore the effectiveness of the IVHFS. Variant 5 was given by removing the IVHFS part from the original model DG-IVHFS. We applied average pooling to vector matrices generated by the SGA-GCN and the CFG-GCN, and concatenated the pooling results to obtain the final representation of code snippets. Then, we usee a deep neural network (DNN) classifier with a softmax layer and two hidden layers to complete clone detection. The feature concatenation process in Variant 5 is shown in Figure 10. Other settings remained unchanged.

We compare our original model DG-IVHFS with Variant 5 for P, R and F1. The results are shown in Table 5.

The experimental results show that DG-IVHFS has better performance than Variant 5. This is because simple pooling and concatenation strategies lead to the loss of feature information when semantic and syntactic features are merged. In contrast, we exploit the unique advantage of IVHFS in dealing with multiple attributes and fully utilize the feature information contained in each vector of the vector matrices. DG-IVHFS achieves better feature fusion.

4.6. Limitations

First, we only selected five baseline methods to compare because most code clone detection methods are not available. We would like to compare with more methods in the future. Second, DG-IVHFS focuses on detecting code clones on Java code fragments, so its detection performance in other languages need to be verified. Future studies should extend DG-IVHFS to accommodate other programming languages.

5. Conclusions

This paper proposes a code clone detection model based on Dual-GCN and IVHFS, which can effectively extract and fuse semantic and syntactic features of source code. We constructed the AST and CFG structures according to the code pairs. Multiple structures help the model obtain abundant feature information of source code. Considering that the complexity of AST can have negative effects on feature extraction, we removed redundant nodes in AST and designed a method to divide AST into several groups. Dual-GCN was used to extract more precise semantic and syntactic features from the CFG, as well as from the simplified and grouped AST. Finally, we utilized the IVHFS theory to process the comprehensive features extracted by Dual-GCN.

The results of these comparative experiments show that DG-IVHFS not only achieves higher performance in a code clone detection task, but also performs well in terms of time consumption. Subsequent ablation experiments showed that simplifying and grouping AST significantly improved the detection performance of the model. The Dual-GCN sufficiently learned the dependencies between code statements in different structures. The use of IVHFS effectively avoids the loss of feature information and achieves better similarity calculation.

In future work, we plan to apply our model to other software engineering tasks, explore more advanced methods of simplifying and grouping AST, and try to simplify the structure of CFG to further improve the performance of the model.

Author Contributions

Conceptualization, H.Y. and Z.L.; methodology, H.Y. and Z.L.; validation, H.Y.; formal analysis, H.Y. and Z.L.; investigation, H.Y. and Z.L.; resources, H.Y. and X.G.; data curation, H.Y.; writing—original draft preparation, H.Y.; writing—review and editing, H.Y., Z.L. and X.G.; supervision, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Hebei Province under Grant No. F2020201016. Any opinions, findings or conclusions expressed in this work are those of the authors and do not reflect the views of the funding agencies in any sense.

Acknowledgments

We thank the reviewers for their constructive comments to help us improve the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Roy, C.K.; Cordy, J.R. A survey on software clone detection research. Queens Sch. Comput. TR 2007, 541, 64–68. [Google Scholar]
Göde, N.; Koschke, R. Incremental clone detection. In Proceedings of the 13th European Conference on Software Maintenance and Reengineering, Kaiserslautern, Germany, 24–27 March 2009; pp. 219–228. [Google Scholar]
Kamiya, T.; Kusumoto, S.; Inoue, K. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 2002, 28, 654–670. [Google Scholar] [CrossRef] [Green Version]
Wang, P.; Svajlenko, J.; Wu, Y.; Xu, Y.; Roy, C.K. CCAligner: A token based large-gap clone detector. In Proceedings of the 40th International Conference on Software Engineering, Gothenburg, Sweden, 27 May–3 June 2018; pp. 1066–1077. [Google Scholar]
Li, L.; Feng, H.; Zhuang, W.; Meng, N.; Ryder, B. Cclearner: A deep learning-based clone detection approach. In Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), Shanghai, China, 17–22 September 2017; pp. 249–260. [Google Scholar]
Sajnani, H.; Saini, V.; Svajlenko, J.; Roy, C.K.; Lopes, C.V. Sourcerercc: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering, Austin, TX, USA, 14 May 2016; pp. 1157–1168. [Google Scholar]
Roy, C.K.; Cordy, J.R. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 16th IEEE International Conference on Program Comprehension, Amsterdam, The Netherlands, 10–13 June 2008; pp. 172–181. [Google Scholar]
Ducasse, S.; Rieger, M.; Demeyer, S. A language independent approach for detecting duplicated code. In Proceedings of the IEEE International Conference on Software Maintenance (ICSM), Oxford, UK, 30 August–3 September 1999; pp. 109–118. [Google Scholar]
Jadon, S. Code clones detection using machine learning technique: Support vector machine. In Proceedings of the 2016 International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, 29–30 April 2016; pp. 399–303. [Google Scholar]
Ragkhitwetsagul, C.; Krinke, J. Using compilation/decompilation to enhance clone detection. In Proceedings of the 11th International Workshop on Software Clones (IWSC), Klagenfurt, Austria, 21 February 2017; pp. 1–7. [Google Scholar]
Yu, D.; Wang, J.; Wu, Q.; Yang, J.; Wang, J.; Yang, W.; Yan, W. Detecting Java code clones with multi-granularities based on bytecode. In Proceedings of the 41st Annual Computer Software and Applications Conference (COMPSAC), Turin, Italy, 4–8 July 2017; pp. 317–326. [Google Scholar]
Jiang, L.; Misherghi, G.; Su, Z.; Glondu, S. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE), Minneapolis, MN, USA, 20–26 May 2007; pp. 96–105. [Google Scholar]
Mou, L.; Li, G.; Zhang, L.; Wang, T.; Jin, Z. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI conference on artificial intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1–7. [Google Scholar]
Wei, H.; Li, M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 3034–3040. [Google Scholar]
Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Wang, K.; Liu, X. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering (ICSE), Montréal, QC, Canada, 25–31 May 2019; pp. 783–794. [Google Scholar]
Bui, N.D.; Yu, Y.; Jiang, L. Infercode: Self-supervised learning of code representations by predicting subtrees. In Proceedings of the 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 25–28 May 2021; pp. 1186–1197. [Google Scholar]
Jo, Y.-B.; Lee, J.; Yoo, C.-J. Two-Pass technique for clone detection and type classification using tree-based convolution neural network. Appl. Sci. 2021, 11, 6613. [Google Scholar] [CrossRef]
Liang, H.; Ai, L. AST-path based compare-aggregate network for code clone detection. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
Hu, Y.; Zou, D.; Peng, J.; Wu, Y.; Shan, J.; Jin, H. TreeCen: Building tree graph for scalable semantic code clone detection. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, Oakland Center, MI, USA, 10–14 October 2022; pp. 1–12. [Google Scholar]
Wu, Y.; Feng, S.; Zou, D.; Jin, H. Detecting semantic code clones by building AST-based Markov chains model. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, Oakland Center, MI, USA, 10–14 October 2022; pp. 1–13. [Google Scholar]
Komondoor, R.; Horwitz, S. Using slicing to identify duplication in source code. In Proceedings of the International static analysis symposium, Paris, France, 16–18 July 2001; pp. 40–56. [Google Scholar]
Krinke, J. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working Conference on Reverse Engineering, Stuttgart, Germany, 2–5 October 2001; pp. 301–309. [Google Scholar]
Zhao, G.; Huang, J. Deepsim: Deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Lake Buena, Vista, FL, USA, 4–9 November 2018; pp. 141–151. [Google Scholar]
Yuan, Y.; Kong, W.; Hou, G.; Hu, Y.; Watanabe, M.; Fukuda, A. From local to global semantic clone detection. In Proceedings of the 6th International Conference on Dependable Systems and Their Applications (DSA), Harbin, China, 3–6 January 2020; pp. 13–24. [Google Scholar]
Wu, Y.; Zou, D.; Dou, S.; Yang, S.; Yang, W.; Cheng, F.; Liang, H.; Jin, H. SCDetector: Software functional clone detection based on semantic tokens analysis. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, New York, NY, USA, 21–25 September 2020; pp. 821–833. [Google Scholar]
Zou, Y.; Ban, B.; Xue, Y.; Xu, Y. CCGraph: A PDG-based code clone detector with approximate graph matching. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), New York, NY, USA, 21–25 September 2020; pp. 931–942. [Google Scholar]
Sheneamer, A.; Kalita, J. Semantic clone detection using machine learning. In Proceedings of the 15th IEEE international conference on machine learning and applications (ICMLA), Anaheim, CA, USA, 18–20 December 2016; pp. 1024–1028. [Google Scholar]
Hua, W.; Sui, Y.; Wan, Y.; Liu, G.; Xu, G. Fcca: Hybrid code representation for functional clone detection using attention networks. IEEE Trans. Reliab. 2020, 70, 304–318. [Google Scholar] [CrossRef]
Wang, W.; Li, G.; Ma, B.; Xia, X.; Jin, Z. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In Proceedings of the 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), London, ON, Canada, 18–21 February 2020; pp. 261–271. [Google Scholar]
Tufano, M.; Watson, C.; Bavota, G.; Di Penta, M.; White, M.; Poshyvanyk, D. Deep learning similarities from different representations of source code. In Proceedings of the 15th International Conference on Mining Software Repositories (MSR), Gothenburg, Sweden, 27 May–3 June 2018; pp. 542–553. [Google Scholar]
javalang. Available online: https://github.com/c2nes/javalang (accessed on 29 March 2022).
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
A Java optimization framework (Soot). Available online: https://github.com/Sable/soot (accessed on 2 March 2020).
Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
Lu, Z.; Li, R.; Hu, H.; Zhou, W. A code clone detection algorithm based on graph convolution network with AST tree edge. In Proceedings of the 2021 IEEE 21st International Conference on Software Quality, Reliability and Security Companion (QRS-C), Hainan, China, 6–10 December 2021; pp. 1027–1032. [Google Scholar]
Gao, X.; Jiang, X.; Wu, Q.; Wang, X.; Lyu, C.; Lyu, L. Multi-modal code summarization fusing local API dependency graph and AST. In Proceedings of the Neural Information Processing: 28th International Conference, ICONIP 2021, Denpasar, Indonesia, 8–12 December 2021; pp. 290–297. [Google Scholar]
Zadeh, L.A. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst. 1978, 1, 3–28. [Google Scholar] [CrossRef]
Torra, V. Hesitant fuzzy sets. Int. J. Intell. Syst. 2010, 25, 529–539. [Google Scholar] [CrossRef]
Chen, N.; Xu, Z.; Xia, M. Correlation coefficients of hesitant fuzzy sets and their applications to clustering analysis. Appl. Math. Model. 2013, 37, 2197–2211. [Google Scholar] [CrossRef]
Google Code Jam. Available online: https://code.google.com/codejam/contests.html (accessed on 8 October 2016).
Svajlenko, J.; Islam, J.F.; Keivanloo, I.; Roy, C.K.; Mia, M.M. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, 29 September–3 October 2014; pp. 476–480. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. The overall framework of our model based on dual graph convolutional networks and interval-valued hesitant fuzzy set: (a) generation of the initial syntax and semantics representations, (b) the dual graph convolutional networks capture the relationships between nodes and enhances the features and (c) feature fusion and similarity calculation processes of code fragments A and B.

Figure 2. The original abstract syntax tree of code fragment

s u m_f u n 1 . j a v a

. In the abstract syntax tree, the green, white and yellow boxes represent special attribute nodes, attribute nodes and code nodes, respectively. Too many attribute nodes in the tree result in overcomplexity of the tree structure.

Figure 2. The original abstract syntax tree of code fragment

s u m_f u n 1 . j a v a

Figure 3. The simplifying and grouping process of the target abstract syntax tree. The attribute nodes are removed and the size of the tree is significantly reduced.

D e c l a r a t i o n

S t a t e m e n t

nodes and their leaf nodes are grouped into the same group. Dotted boxes with different colors indicate different groups.

Figure 3. The simplifying and grouping process of the target abstract syntax tree. The attribute nodes are removed and the size of the tree is significantly reduced.

D e c l a r a t i o n

S t a t e m e n t

nodes and their leaf nodes are grouped into the same group. Dotted boxes with different colors indicate different groups.

Figure 4. The generation process of group representation

r

Figure 4. The generation process of group representation

r

Figure 5. The CFG of source code

s u m_f u n 1 . j a v a

Figure 5. The CFG of source code

s u m_f u n 1 . j a v a

Figure 6. Schematic diagram of Formula (9).

Figure 7. Schematic diagram of Formula (10).

Figure 8. Time consumption of our model based on dual graph convolutional networks and interval-valued hesitant fuzzy set, as well as other baseline models in the training phase.

Figure 9. Time consumption of our model based on dual graph convolutional networks and interval-valued hesitant fuzzy set, as well as other baseline models in the prediction phase.

Figure 10. The feature concatenation process in Variant 5.

Table 1. Basic information about BCB and GCJ in our experiment.

	BCB	GCJ
Code fragments	9134	1669
Clone pairs	270,000	275,570
Non-clone pairs	270,000	275,570

Table 2. Comparison of detection performance on BCB and GCJ.

Groups	Approaches	BCB			GCJ
Groups	Approaches	P	R	F1	P	R	F1
Tree-based	CDLH	0.92	0.74	0.82	0.47	0.73	0.57
Tree-based	ASTNN	0.99	0.88	0.93	0.95	0.87	0.91
Graph-based	DeepSim	0.92	0.94	0.93	0.70	0.83	0.76
Hybrid	FCCA	0.98	0.92	0.94	0.95	0.90	0.92
	FA-AST	0.95	0.91	0.92	0.96	0.88	0.91
	DG-IVHFS	0.98	0.97	0.97	0.98	0.93	0.95

Table 3. The ablation experiment results on BCB and GCJ of group representations.

Approaches	BCB			GCJ
Approaches	P	R	F1	P	R	F1
Variant 1	0.96	0.94	0.94	0.95	0.89	0.91
DG-IVHFS	0.98	0.97	0.97	0.98	0.93	0.95

Table 4. The ablation experiment results on BCB and GCJ of Dual-GCN.

Approaches	BCB			GCJ
Approaches	P	R	F1	P	R	F1
Variant 2	0.71	0.79	0.74	0.69	0.75	0.71
Variant 3	0.89	0.82	0.85	0.86	0.83	0.84
Variant 4	0.92	0.87	0.89	0.90	0.88	0.88
DG-IVHFS	0.98	0.97	0.97	0.98	0.93	0.95

Table 5. The ablation experiment results on BCB and GCJ of IVHFS.

Approaches	BCB			GCJ
Approaches	P	R	F1	P	R	F1
Variant 5	0.92	0.95	0.93	0.92	0.93	0.92
DG-IVHFS	0.98	0.97	0.97	0.98	0.93	0.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, H.; Li, Z.; Guo, X. A Novel Source Code Clone Detection Method Based on Dual-GCN and IVHFS. Electronics 2023, 12, 1315. https://doi.org/10.3390/electronics12061315

AMA Style

Yang H, Li Z, Guo X. A Novel Source Code Clone Detection Method Based on Dual-GCN and IVHFS. Electronics. 2023; 12(6):1315. https://doi.org/10.3390/electronics12061315

Chicago/Turabian Style

Yang, Haixin, Zhen Li, and Xinyu Guo. 2023. "A Novel Source Code Clone Detection Method Based on Dual-GCN and IVHFS" Electronics 12, no. 6: 1315. https://doi.org/10.3390/electronics12061315

APA Style

Yang, H., Li, Z., & Guo, X. (2023). A Novel Source Code Clone Detection Method Based on Dual-GCN and IVHFS. Electronics, 12(6), 1315. https://doi.org/10.3390/electronics12061315

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Source Code Clone Detection Method Based on Dual-GCN and IVHFS

Abstract

1. Introduction

1.1. Related Work

1.2. Contributions

2. Problem Definition

3. Material and Methods

3.1. Syntax Representation Generation

3.1.1. Simplifying and Grouping AST

3.1.2. Group Embedding

3.1.3. Dependencies Generation

3.2. Semantics Representation Generation

3.3. Dual-GCN

3.3.1. Related Theory

3.3.2. SGA-GCN

3.3.3. CFG-GCN

3.4. Feature Fusion and Similarity Calculations

3.4.1. Related Theory

3.4.2. Feature Fusion Based on IVHFS

3.4.3. Similarity Calculation

4. Experiments

4.1. Datasets

4.2. Comparison Models

4.3. Experimental Setting

4.4. Research Questions

4.5. Results and Discussion

4.5.1. RQ1: What Is the Overall Effectiveness of DG-IVHFS Compared with Other Baseline Approaches?

4.5.2. RQ2: What Is the Time Consumption of DG-IVHFS Compared with Other Baseline Approaches?

4.5.3. RQ3: How Do the Group Representations Obtained by Simplifying and Grouping the AST Affect the Effectiveness of DG-IVHFS?

4.5.4. RQ4: How Does the Dual-GCN Affect the Effectiveness of DG-IVHFS?

4.5.5. RQ5: How Does the IVHFS Affect the Effectiveness of DG-IVHFS?

4.6. Limitations

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI