Abstract
Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at https://github.com/jorge-martinez-gil/codesim
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ul Ain, Q., Butt, W.H., Anwar, M.W., Azam, F., Maqbool, B.: A systematic review on code clone detection. IEEE Access 7, 86121–86144 (2019)
Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages, vol. 3(POPL), pp. 1–29 (2019)
Aniceto, R.C., Holanda, M., Castanho, C., Da Silva, D.: Source code plagiarism detection in an educational context: a literature mapping. In: 2021 IEEE Frontiers in Education Conference (FIE), pp. 1–9. IEEE (2021)
Baxter, I.D., et al.: Clone detection using abstract syntax trees. In: 1998 International Conference on Software Maintenance, ICSM 1998, Bethesda, Maryland, USA, November 16–19, 1998, pp. 368–377. IEEE Computer Society (1998)
Bellon, S., Koschke, R., Antoniol, G., Krinke, J., Merlo, E.: Comparison and evaluation of clone detection tools. IEEE Trans. Softw. Eng. 33(9), 577–591 (2007)
Bergroth, L., Hakonen, H., Raita, T.: A survey of longest common subsequence algorithms. In: Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000, pp. 39–48. IEEE (2000)
Corley, C.D., Mihalcea, R.: Measuring the semantic similarity of texts. In: Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, pp. 13–18 (2005)
Damashek, M.: Gauging similarity with n-grams: language-independent categorization of text. Science 267(5199), 843–848 (1995)
Dang, Y., Ge, S., Huang, R., Zhang, D.: Code clone detection experience at microsoft. In: Proceedings of the 5th International Workshop on Software Clones, pp. 63–64 (2011)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Burstein, J., Doran, C., Solorio, T., (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019)
Dou, S., et al.: Towards understanding the capability of large language models on code clone detection: a survey. arXiv preprint arXiv:2308.01191 (2023)
Ferrante, J., Ottenstein, K.J., Warren, J.D.: The program dependence graph and its use in optimization. ACM Trans. Programm. Lang. Syst. (TOPLAS) 9(3), 319–349 (1987)
Gabel, M., Jiang, L., Su, Z.: Scalable detection of semantic clones. In: Proceedings of the 30th International Conference on Software Engineering, pp. 321–330 (2008)
Haque, S., Eberhart, Z., Bansal, A., McMillan, C.: Semantic similarity metrics for evaluating source code summarization. In: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension, pp. 36–47 (2022)
Hartanto, A.D., Syaputra, A., Pristyanto, Y.: Best parameter selection of Rabin-Karp algorithm in detecting document similarity. In: 2019 International Conference on Information and Communications Technology (ICOIACT), pp. 457–461. IEEE (2019)
Higo, Y., Ueda, Y., Kamiya, T., Kusumoto, S., Inoue, K.: On software maintenance process improvement based on code clone analysis. In: Oivo, M., Komi-Sirviö, S. (eds.) PROFES 2002. LNCS, vol. 2559, pp. 185–197. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36209-6_17
Horwitz, S.: Identifying the semantic and textual differences between two versions of a program. In: Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, pp. 234–245 (1990)
Juergens, E., Deissenboeck, F., Hummel, B., Wagner, S.: Do code clones matter? In: 2009 IEEE 31st International Conference on Software Engineering, pp. 485–495. IEEE (2009)
Karnalim, O.: TF-IDF inspired detection for cross-language source code plagiarism and collusion. Comput. Sci. 21, 1–24 (2020)
Karnalim, O.: Explanation in code similarity investigation. IEEE Access 9, 59935–59948 (2021)
Karnalim, O., Budi, S., Toba, H., Joy, M.: Source code plagiarism detection in academia with information retrieval: dataset and the observation. Inform. Educ. 18(2), 321–344 (2019)
Karnalim, O., Simon: Syntax trees and information retrieval to improve code similarity detection. In: Proceedings of the Twenty-Second Australasian Computing Education Conference, pp. 48–55 (2020)
Krinke, J.: Identifying similar code with program dependence graphs. In: Proceedings Eighth Working Conference on Reverse Engineering, pp. 301–309. IEEE (2001)
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, vol. 10, pp. 707–710 (1966)
Martinez-Gil, J.: Semantic similarity aggregators for very short textual expressions: a case study on landmarks and points of interest. J. Intell. Inf. Syst. 53(2), 361–380 (2019)
Martinez-Gil, J.: A comprehensive review of stacking methods for semantic similarity measurement. Mach. Learn. App. 10, 100423 (2022)
Martinez-Gil, J., Chaves-Gonzalez, J.M.: Semantic similarity controllers: on the trade-off between accuracy and interpretability. Knowl. Based Syst. 234, 107609 (2021)
Martinez-Gil, J., Chaves-Gonzalez, J.M.: A novel method based on symbolic regression for interpretable semantic similarity measurement. Expert Syst. Appl. 160, 113663 (2020)
Novak, M., Joy, M., Kermek, D.: Source-code similarity detection and detection tools used in academia: a systematic review. ACM Trans. Comput. Educ. (TOCE) 19(3), 1–37 (2019)
Nuñez-Varela, A.S., Pérez-Gonzalez, H.G., Martínez-Perez, F.E., Soubervielle-Montalvo, C.: Source code metrics: a systematic mapping study. J. Syst. Softw. 128, 164–197 (2017)
Peters, M.E., et al.: Deep contextualized word representations. In: Walker, M.A. Ji, H., Stent, A., (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1–6, 2018, Volume 1 (Long Papers), pp. 2227–2237. Association for Computational Linguistics (2018)
Ragkhitwetsagul, C., Krinke, J., Marnette, B.: A picture is worth a thousand words: code clone detection based on image similarity. In: 12th IEEE International Workshop on Software Clones, IWSC 2018, Campobasso, Italy, March 20, 2018, pp. 44–50. IEEE Computer Society (2018)
Roy, C.K., Cordy, J.R., Koschke, R.: Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci. Comput. Programm. 74(7), 470–495 (2009)
Roy, C.K., Cordy, J.R.: A survey on software clone detection research. Queen’s School Comput. TR. 541(115), 64–68 (2007)
Saini, N., Singh, S., et al.: Code clones: detection and management. Proc. Comput. Sci. 132, 718–727 (2018)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85 (2003)
Singla, N., Garg, D.: String matching algorithms and their applicability in various applications. Int. J. Soft Comput. Eng. 1(6), 218–222 (2012)
Wise, M.J.: String similarity via greedy string tiling and running Karp-Rabin matching. Online Preprint 119(1), 1–17 (1993)
Ming, X.: A similarity metric method of obfuscated malware using function-call graph. J. Comput. Virol. Hacking Techn. 9, 35–47 (2013)
Zager, L.A., Verghese, G.C.: Graph similarity scoring and matching. Appl. Math. Lett. 21(1), 86–94 (2008)
Acknowledgments
The author thanks all the anonymous reviewers for their help in improving the manuscript. The research reported in this paper has been funded by the Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology (BMK), the Federal Ministry for Labour and Economy (BMAW), and the State of Upper Austria in the frame of the SCCH competence center INTEGRATE [(FFG grant no. 892418)] in the COMET - Competence Centers for Excellent Technologies Programme managed by Austrian Research Promotion Agency FFG.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Martinez-Gil, J. (2024). Source Code Clone Detection Using Unsupervised Similarity Measures. In: Bludau, P., Ramler, R., Winkler, D., Bergsmann, J. (eds) Software Quality as a Foundation for Security. SWQD 2024. Lecture Notes in Business Information Processing, vol 505. Springer, Cham. https://doi.org/10.1007/978-3-031-56281-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-56281-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56280-8
Online ISBN: 978-3-031-56281-5
eBook Packages: Computer ScienceComputer Science (R0)