A Novel Source Code Clone Detection Method Based on Dual-GCN and IVHFS
<p>The overall framework of our model based on dual graph convolutional networks and interval-valued hesitant fuzzy set: (<b>a</b>) generation of the initial syntax and semantics representations, (<b>b</b>) the dual graph convolutional networks capture the relationships between nodes and enhances the features and (<b>c</b>) feature fusion and similarity calculation processes of code fragments A and B.</p> "> Figure 2
<p>The original abstract syntax tree of code fragment <math display="inline"><semantics> <mrow> <mi>s</mi> <mi>u</mi> <mi>m</mi> <mo>_</mo> <mi>f</mi> <mi>u</mi> <mi>n</mi> <mn>1</mn> <mo>.</mo> <mi>j</mi> <mi>a</mi> <mi>v</mi> <mi>a</mi> </mrow> </semantics></math>. In the abstract syntax tree, the green, white and yellow boxes represent special attribute nodes, attribute nodes and code nodes, respectively. Too many attribute nodes in the tree result in overcomplexity of the tree structure.</p> "> Figure 3
<p>The simplifying and grouping process of the target abstract syntax tree. The attribute nodes are removed and the size of the tree is significantly reduced. <math display="inline"><semantics> <mrow> <mi>D</mi> <mi>e</mi> <mi>c</mi> <mi>l</mi> <mi>a</mi> <mi>r</mi> <mi>a</mi> <mi>t</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> </mrow> </semantics></math> or <math display="inline"><semantics> <mrow> <mi>S</mi> <mi>t</mi> <mi>a</mi> <mi>t</mi> <mi>e</mi> <mi>m</mi> <mi>e</mi> <mi>n</mi> <mi>t</mi> </mrow> </semantics></math> nodes and their leaf nodes are grouped into the same group. Dotted boxes with different colors indicate different groups.</p> "> Figure 4
<p>The generation process of group representation <math display="inline"><semantics> <mi mathvariant="bold">r</mi> </semantics></math>.</p> "> Figure 5
<p>The CFG of source code <math display="inline"><semantics> <mrow> <mi>s</mi> <mi>u</mi> <mi>m</mi> <mo>_</mo> <mi>f</mi> <mi>u</mi> <mi>n</mi> <mn>1</mn> <mo>.</mo> <mi>j</mi> <mi>a</mi> <mi>v</mi> <mi>a</mi> </mrow> </semantics></math>.</p> "> Figure 6
<p>Schematic diagram of Formula (<a href="#FD9-electronics-12-01315" class="html-disp-formula">9</a>).</p> "> Figure 7
<p>Schematic diagram of Formula (<a href="#FD10-electronics-12-01315" class="html-disp-formula">10</a>).</p> "> Figure 8
<p>Time consumption of our model based on dual graph convolutional networks and interval-valued hesitant fuzzy set, as well as other baseline models in the training phase.</p> "> Figure 9
<p>Time consumption of our model based on dual graph convolutional networks and interval-valued hesitant fuzzy set, as well as other baseline models in the prediction phase.</p> "> Figure 10
<p>The feature concatenation process in Variant 5.</p> ">
Abstract
:1. Introduction
1.1. Related Work
1.2. Contributions
- Considering the complex structure of AST, we design a new representation of syntactic features that simplifies the structure of AST while preserving important syntactic information and making subsequent convolution operations more convenient.
- We propose a Dual-GCN to learn the relationships between statements, which enhances the syntactic and semantic features of the source code.
- We introduce IVHFS into the model and develop a new feature fusion method, which enables our model to achieve accurate similarity calculations. Subsequent ablation experiment results demonstrate the effectiveness of IVHFS.
- To verify the time performance and effectiveness of DG-IVHFS, we conduct experiments on BigCloneBench and GoogleCodeJam datasets. Our experimental results demonstrate that DG-IVHFS achieves a significant improvement over other baseline methods.
2. Problem Definition
Listing 1. |
Listing 2. |
3. Material and Methods
3.1. Syntax Representation Generation
3.1.1. Simplifying and Grouping AST
3.1.2. Group Embedding
3.1.3. Dependencies Generation
3.2. Semantics Representation Generation
3.3. Dual-GCN
3.3.1. Related Theory
3.3.2. SGA-GCN
3.3.3. CFG-GCN
3.4. Feature Fusion and Similarity Calculations
3.4.1. Related Theory
3.4.2. Feature Fusion Based on IVHFS
3.4.3. Similarity Calculation
4. Experiments
4.1. Datasets
4.2. Comparison Models
- CDLH [14] is an AST-based clone detection method that uses Tree-LSTM to encode AST and compares Hamming distance between hash vectors.
- ASTNN [15] is an advanced AST-based clone detection method that first decomposes the AST into multiple statement trees, and then encodes them separately using recursive encoders to generate vector representations.
- DeepSim [23] is a graph-based clone detection method that uses semantic matrices to encode control flow and data flow information.
4.3. Experimental Setting
4.4. Research Questions
- RQ1: What is the overall effectiveness of DG-IVHFS compared with other baseline approaches?
- RQ2: What is the time consumption of DG-IVHFS compared with other baseline approaches?
- RQ3: How do the group representations obtained by simplifying and grouping the AST affect the effectiveness of DG-IVHFS?
- RQ4: How does the Dual-GCN affect the effectiveness of DG-IVHFS?
- RQ5: How does the IVHFS affect the effectiveness of DG-IVHFS?
4.5. Results and Discussion
4.5.1. RQ1: What Is the Overall Effectiveness of DG-IVHFS Compared with Other Baseline Approaches?
4.5.2. RQ2: What Is the Time Consumption of DG-IVHFS Compared with Other Baseline Approaches?
4.5.3. RQ3: How Do the Group Representations Obtained by Simplifying and Grouping the AST Affect the Effectiveness of DG-IVHFS?
4.5.4. RQ4: How Does the Dual-GCN Affect the Effectiveness of DG-IVHFS?
- Variant 2: The SGA-GCN and CFG-GCN, i.e., Dual-GCN, were removed on the basis of DG-IVHFS. We used IVHFS to directly handle the original semantic and syntactic feature vectors. Other settings remained unchanged.
- Variant 3: The CFG-GCN was removed on the basis of DG-IVHFS. We used IVHFS to deal with the initial semantic feature vectors, as well as the syntactic feature vectors enhanced by the SGA-GCN. Other settings remained unchanged.
- Variant 4: The SGA-GCN was removed on the basis of DG-IVHFS. We used IVHFS to deal with the initial syntactic feature vectors, as well as the semantic feature vectors enhanced by the CFG-GCN. Other settings remained unchanged.
4.5.5. RQ5: How Does the IVHFS Affect the Effectiveness of DG-IVHFS?
4.6. Limitations
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Roy, C.K.; Cordy, J.R. A survey on software clone detection research. Queens Sch. Comput. TR 2007, 541, 64–68. [Google Scholar]
- Göde, N.; Koschke, R. Incremental clone detection. In Proceedings of the 13th European Conference on Software Maintenance and Reengineering, Kaiserslautern, Germany, 24–27 March 2009; pp. 219–228. [Google Scholar]
- Kamiya, T.; Kusumoto, S.; Inoue, K. CCFinder: A multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng. 2002, 28, 654–670. [Google Scholar] [CrossRef] [Green Version]
- Wang, P.; Svajlenko, J.; Wu, Y.; Xu, Y.; Roy, C.K. CCAligner: A token based large-gap clone detector. In Proceedings of the 40th International Conference on Software Engineering, Gothenburg, Sweden, 27 May–3 June 2018; pp. 1066–1077. [Google Scholar]
- Li, L.; Feng, H.; Zhuang, W.; Meng, N.; Ryder, B. Cclearner: A deep learning-based clone detection approach. In Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), Shanghai, China, 17–22 September 2017; pp. 249–260. [Google Scholar]
- Sajnani, H.; Saini, V.; Svajlenko, J.; Roy, C.K.; Lopes, C.V. Sourcerercc: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering, Austin, TX, USA, 14 May 2016; pp. 1157–1168. [Google Scholar]
- Roy, C.K.; Cordy, J.R. NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In Proceedings of the 16th IEEE International Conference on Program Comprehension, Amsterdam, The Netherlands, 10–13 June 2008; pp. 172–181. [Google Scholar]
- Ducasse, S.; Rieger, M.; Demeyer, S. A language independent approach for detecting duplicated code. In Proceedings of the IEEE International Conference on Software Maintenance (ICSM), Oxford, UK, 30 August–3 September 1999; pp. 109–118. [Google Scholar]
- Jadon, S. Code clones detection using machine learning technique: Support vector machine. In Proceedings of the 2016 International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, 29–30 April 2016; pp. 399–303. [Google Scholar]
- Ragkhitwetsagul, C.; Krinke, J. Using compilation/decompilation to enhance clone detection. In Proceedings of the 11th International Workshop on Software Clones (IWSC), Klagenfurt, Austria, 21 February 2017; pp. 1–7. [Google Scholar]
- Yu, D.; Wang, J.; Wu, Q.; Yang, J.; Wang, J.; Yang, W.; Yan, W. Detecting Java code clones with multi-granularities based on bytecode. In Proceedings of the 41st Annual Computer Software and Applications Conference (COMPSAC), Turin, Italy, 4–8 July 2017; pp. 317–326. [Google Scholar]
- Jiang, L.; Misherghi, G.; Su, Z.; Glondu, S. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering (ICSE), Minneapolis, MN, USA, 20–26 May 2007; pp. 96–105. [Google Scholar]
- Mou, L.; Li, G.; Zhang, L.; Wang, T.; Jin, Z. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the 30th AAAI conference on artificial intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1–7. [Google Scholar]
- Wei, H.; Li, M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 3034–3040. [Google Scholar]
- Zhang, J.; Wang, X.; Zhang, H.; Sun, H.; Wang, K.; Liu, X. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineering (ICSE), Montréal, QC, Canada, 25–31 May 2019; pp. 783–794. [Google Scholar]
- Bui, N.D.; Yu, Y.; Jiang, L. Infercode: Self-supervised learning of code representations by predicting subtrees. In Proceedings of the 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 25–28 May 2021; pp. 1186–1197. [Google Scholar]
- Jo, Y.-B.; Lee, J.; Yoo, C.-J. Two-Pass technique for clone detection and type classification using tree-based convolution neural network. Appl. Sci. 2021, 11, 6613. [Google Scholar] [CrossRef]
- Liang, H.; Ai, L. AST-path based compare-aggregate network for code clone detection. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
- Hu, Y.; Zou, D.; Peng, J.; Wu, Y.; Shan, J.; Jin, H. TreeCen: Building tree graph for scalable semantic code clone detection. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, Oakland Center, MI, USA, 10–14 October 2022; pp. 1–12. [Google Scholar]
- Wu, Y.; Feng, S.; Zou, D.; Jin, H. Detecting semantic code clones by building AST-based Markov chains model. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, Oakland Center, MI, USA, 10–14 October 2022; pp. 1–13. [Google Scholar]
- Komondoor, R.; Horwitz, S. Using slicing to identify duplication in source code. In Proceedings of the International static analysis symposium, Paris, France, 16–18 July 2001; pp. 40–56. [Google Scholar]
- Krinke, J. Identifying similar code with program dependence graphs. In Proceedings of the 8th Working Conference on Reverse Engineering, Stuttgart, Germany, 2–5 October 2001; pp. 301–309. [Google Scholar]
- Zhao, G.; Huang, J. Deepsim: Deep learning code functional similarity. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Lake Buena, Vista, FL, USA, 4–9 November 2018; pp. 141–151. [Google Scholar]
- Yuan, Y.; Kong, W.; Hou, G.; Hu, Y.; Watanabe, M.; Fukuda, A. From local to global semantic clone detection. In Proceedings of the 6th International Conference on Dependable Systems and Their Applications (DSA), Harbin, China, 3–6 January 2020; pp. 13–24. [Google Scholar]
- Wu, Y.; Zou, D.; Dou, S.; Yang, S.; Yang, W.; Cheng, F.; Liang, H.; Jin, H. SCDetector: Software functional clone detection based on semantic tokens analysis. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, New York, NY, USA, 21–25 September 2020; pp. 821–833. [Google Scholar]
- Zou, Y.; Ban, B.; Xue, Y.; Xu, Y. CCGraph: A PDG-based code clone detector with approximate graph matching. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), New York, NY, USA, 21–25 September 2020; pp. 931–942. [Google Scholar]
- Sheneamer, A.; Kalita, J. Semantic clone detection using machine learning. In Proceedings of the 15th IEEE international conference on machine learning and applications (ICMLA), Anaheim, CA, USA, 18–20 December 2016; pp. 1024–1028. [Google Scholar]
- Hua, W.; Sui, Y.; Wan, Y.; Liu, G.; Xu, G. Fcca: Hybrid code representation for functional clone detection using attention networks. IEEE Trans. Reliab. 2020, 70, 304–318. [Google Scholar] [CrossRef]
- Wang, W.; Li, G.; Ma, B.; Xia, X.; Jin, Z. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In Proceedings of the 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), London, ON, Canada, 18–21 February 2020; pp. 261–271. [Google Scholar]
- Tufano, M.; Watson, C.; Bavota, G.; Di Penta, M.; White, M.; Poshyvanyk, D. Deep learning similarities from different representations of source code. In Proceedings of the 15th International Conference on Mining Software Repositories (MSR), Gothenburg, Sweden, 27 May–3 June 2018; pp. 542–553. [Google Scholar]
- javalang. Available online: https://github.com/c2nes/javalang (accessed on 29 March 2022).
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- A Java optimization framework (Soot). Available online: https://github.com/Sable/soot (accessed on 2 March 2020).
- Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
- Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Lu, Z.; Li, R.; Hu, H.; Zhou, W. A code clone detection algorithm based on graph convolution network with AST tree edge. In Proceedings of the 2021 IEEE 21st International Conference on Software Quality, Reliability and Security Companion (QRS-C), Hainan, China, 6–10 December 2021; pp. 1027–1032. [Google Scholar]
- Gao, X.; Jiang, X.; Wu, Q.; Wang, X.; Lyu, C.; Lyu, L. Multi-modal code summarization fusing local API dependency graph and AST. In Proceedings of the Neural Information Processing: 28th International Conference, ICONIP 2021, Denpasar, Indonesia, 8–12 December 2021; pp. 290–297. [Google Scholar]
- Zadeh, L.A. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst. 1978, 1, 3–28. [Google Scholar] [CrossRef]
- Torra, V. Hesitant fuzzy sets. Int. J. Intell. Syst. 2010, 25, 529–539. [Google Scholar] [CrossRef]
- Chen, N.; Xu, Z.; Xia, M. Correlation coefficients of hesitant fuzzy sets and their applications to clustering analysis. Appl. Math. Model. 2013, 37, 2197–2211. [Google Scholar] [CrossRef]
- Google Code Jam. Available online: https://code.google.com/codejam/contests.html (accessed on 8 October 2016).
- Svajlenko, J.; Islam, J.F.; Keivanloo, I.; Roy, C.K.; Mia, M.M. Towards a big data curated benchmark of inter-project code clones. In Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, 29 September–3 October 2014; pp. 476–480. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
BCB | GCJ | |
---|---|---|
Code fragments | 9134 | 1669 |
Clone pairs | 270,000 | 275,570 |
Non-clone pairs | 270,000 | 275,570 |
Groups | Approaches | BCB | GCJ | ||||
---|---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | ||
Tree-based | CDLH | 0.92 | 0.74 | 0.82 | 0.47 | 0.73 | 0.57 |
ASTNN | 0.99 | 0.88 | 0.93 | 0.95 | 0.87 | 0.91 | |
Graph-based | DeepSim | 0.92 | 0.94 | 0.93 | 0.70 | 0.83 | 0.76 |
Hybrid | FCCA | 0.98 | 0.92 | 0.94 | 0.95 | 0.90 | 0.92 |
FA-AST | 0.95 | 0.91 | 0.92 | 0.96 | 0.88 | 0.91 | |
DG-IVHFS | 0.98 | 0.97 | 0.97 | 0.98 | 0.93 | 0.95 |
Approaches | BCB | GCJ | ||||
---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | |
Variant 1 | 0.96 | 0.94 | 0.94 | 0.95 | 0.89 | 0.91 |
DG-IVHFS | 0.98 | 0.97 | 0.97 | 0.98 | 0.93 | 0.95 |
Approaches | BCB | GCJ | ||||
---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | |
Variant 2 | 0.71 | 0.79 | 0.74 | 0.69 | 0.75 | 0.71 |
Variant 3 | 0.89 | 0.82 | 0.85 | 0.86 | 0.83 | 0.84 |
Variant 4 | 0.92 | 0.87 | 0.89 | 0.90 | 0.88 | 0.88 |
DG-IVHFS | 0.98 | 0.97 | 0.97 | 0.98 | 0.93 | 0.95 |
Approaches | BCB | GCJ | ||||
---|---|---|---|---|---|---|
P | R | F1 | P | R | F1 | |
Variant 5 | 0.92 | 0.95 | 0.93 | 0.92 | 0.93 | 0.92 |
DG-IVHFS | 0.98 | 0.97 | 0.97 | 0.98 | 0.93 | 0.95 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, H.; Li, Z.; Guo, X. A Novel Source Code Clone Detection Method Based on Dual-GCN and IVHFS. Electronics 2023, 12, 1315. https://doi.org/10.3390/electronics12061315
Yang H, Li Z, Guo X. A Novel Source Code Clone Detection Method Based on Dual-GCN and IVHFS. Electronics. 2023; 12(6):1315. https://doi.org/10.3390/electronics12061315
Chicago/Turabian StyleYang, Haixin, Zhen Li, and Xinyu Guo. 2023. "A Novel Source Code Clone Detection Method Based on Dual-GCN and IVHFS" Electronics 12, no. 6: 1315. https://doi.org/10.3390/electronics12061315
APA StyleYang, H., Li, Z., & Guo, X. (2023). A Novel Source Code Clone Detection Method Based on Dual-GCN and IVHFS. Electronics, 12(6), 1315. https://doi.org/10.3390/electronics12061315