[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Efficient plagiarism detection for large code repositories

Published: 01 February 2007 Publication History

Abstract

Unauthorized re-use of code by students is a widespread problem in academic institutions, and raises liability issues for industry. Manual plagiarism detection is time-consuming, and current effective plagiarism detection approaches cannot be easily scaled to very large code repositories. While there are practical text-based plagiarism detection systems capable of working with large collections, this is not the case for code-based plagiarism detection. In this paper, we propose techniques for detecting plagiarism in program code using text similarity measures and local alignment. Through detailed empirical evaluation on small and large collections of programs, we show that our approach is highly scalable while maintaining similar levels of effectiveness to that of the popular JPlag and MOSS systems. Copyright © 2006 John Wiley & Sons, Ltd.

References

[1]
1. Sheard J, Dick M, Markham S, Macdonald I, Walsh M. Cheating and plagiarism: Perceptions and practices of first year IT students. Proceedings of the 7th Annual Conference on Innovation and Technology in Computer Science Education, Aarhus, Denmark, June 2002. ACM Press: New York, 2002; 183-187.]]
[2]
2. Merriam-Webster online dictionary. Merriam-Webster, Inc., Springfield, MA. http://www.m-w.com/ {February 2006}.]]
[3]
3. Prechelt L, Malpohl G, Philippsen M. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science 2002; 8(11):1016-1038.]]
[4]
4. Gitchell D, Tran N. Sim: A utility for detecting similarity in computer programs. Proceedings of the 30th SIGCSE Technical Symposium, March 1999; 266-270.]]
[5]
5. Chawla M. An indexing technique for efficiently detecting plagiarism in large volumes of source code. Honours Thesis, RMIT University, Melbourne, Australia, October 2003.]]
[6]
6. Schleimer S, Wilkerson D, Aiken A. Winnowing: Local algorithms for document fingerprinting. SIGMOD Conference on the Management of Data, New York, June 2003. ACM Press: New York, 2003; 76-85.]]
[7]
7. Witten I, Moffat A, Bell T. Managing Gigabytes: Compressing and Indexing Documents and Images (2nd edn.). Morgan Kaufmann: San Francisco, CA, 1999.]]
[8]
8. Trotman A. Compressing inverted files. Information Retrieval 2003; 6(1):5-19.]]
[9]
9. Shannon CE. A mathematical theory of communication. The Bell Systems Technical Journal 1948; 27:379-423, 623-656.]]
[10]
10. Robertson S, Walker S. Okapi/Keenbow at TREC-8. Proceedings of the 8th Text Retrieval Conference (TREC-8), Gaithersburg, MD, November 1999. NIST, 1999; 151-162.]]
[11]
11. Smith T, Waterman M. Identification of common molecular subsequences. Journal of Molecular Biology 1981; 147(1):195-197.]]
[12]
12. Altschul S, Gish W, Miller W, Myers E, Lipman D. Basic local alignment search tool. Journal of Molecular Biology 1990; 215:403-410.]]
[13]
13. Morgenstern B, Frech K, Dress A, Werner T. DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics 1998; 14(3):290-294.]]
[14]
14. Hoad T, Zobel J. Methods for identifying versioned and plagiarised documents. Journal of the American Society for Information Science and Technology 2002; 54(3):203-215.]]
[15]
15. Heintze N. Scalable document fingerprinting. 1996 USENIX Workshop on Electronic Commerce, November 1996; 191-200.]]
[16]
16. Donaldson J, Lancaster A, Sposato P. A plagiarism detection system. Proceedings of the 12th SIGCSE Technical Symposium on Computer Science Education. ACM Press: New York 1981; 21-25.]]
[17]
17. Verco K, Wise M. Software for detecting suspected plagiarism: Comparing structure and attribute-counting systems. Proceedings of Australian Conference on Computer Science Education, Sydney, Australia, July 1996. ACM Press: New York, 1996; 81-88.]]
[18]
18. Bowyer K, Hall L. Experience using MOSS to detect cheating on programming assignments. Proceedings of the 29th ASEE/IEEE Frontiers in Education Conference, San Juan, Puerto Rico, November 1999. IEEE Computer Society: Los Alamitos, CA, 1999; 18-22.]]
[19]
19. Wise MJ. Running Karp-Rabin matching and greedy string tiling. Technical Report TR 463, School of Information Technologies, The University of Sydney, Sydney, Australia, March 1993.]]
[20]
20. Karp R, Rabin M. Efficient randomised pattern-matching algorithms. IBM Journal of Research and Development 1987; 31(2):249-260.]]
[21]
21. Manber U. Finding similar files in a large file system. Proceedings of the USENIX Winter 1994 Technical Conference, San Francisco, CA, January 1994. ACM Press: New York, 1994; 1-10.]]
[22]
22. Baker B. On finding duplication and near-duplication in large software systems. Proceedings of the 2nd Working Conference on Reverse Engineering, Los Alamitos, CA, July 1995, Wills L, Newcomb P, Chikofsky E (eds.). IEEE Computer Society Press: Los Alamitos, CA, 1995; 86-95.]]
[23]
23. Baker B, Manber U. Deducing similarities in Java sources from bytecodes. Proceedings of Usenix Annual Technical Conference, Berkeley, IL, June 1998. ACM Press: New York, 1998; 179-190.]]
[24]
24. Broder A. On the resemblance and containment of documents. Proceedings of Compression and Complexity of Sequences, Positano, Italy, June 1998. IEEE Computer Society: Los Alamitos, CA, 1998; 21-29.]]
[25]
25. Broder A, Glassman S, Manasse M, Zweig G. Syntactic clustering of the Web. Selected Papers from the 6th International Conference on World Wide Web, Santa Clara, CA, 1997, Enslow P, Genesereth M, Patterson A (eds.). Elsevier Science: Amsterdam, 1997; 1157-1166.]]
[26]
26. Irving RW. Plagiarism and collusion detection using the Smith-Waterman algorithm. Technical Report TR-2004-164, University of Glasgow Computing Science Department Research Report, April 2004.]]
[27]
27. Mozgovoy M, Fredriksson K, White D, Joy M, Sutinen E. Fast plagiarism detection system. Proceedings of the International Symposium on String Processing and Information Retrieval (SPIRE2005), Buenos Aires, Argentina, November 2005 (Lecture Notes in Computer Science, vol. 3772). Springer: Heidelberg, 2005; 267-270.]]
[28]
28. Zobel J, Moffat A, Sacks-Davis R. Searching large lexicons for partially specified terms using compressed inverted files. Proceedings of the 19th International Conference on Very Large Databases, Dublin, Ireland, August 1993, Agrawal R, Baker S, Bell D (eds.). Morgan Kaufmann: San Francisco, CA, 1993; 290-301.]]
[29]
29. Chiueh T, Huang L. Efficient real-time index updates in text retrieval systems. SUNY Stony Brook ECSL Technical Report ECSL-TR-66, State University of New York, New York, April 1999.]]
[30]
30. Lester N, Zobel J, Williams HE. In-place versus re-build versus re-merge: Index maintenance strategies for text retrieval systems. Proceedings of the Australasian Computer Science Conference, Dunedin, NZ, January 2004, Estivill-Castro V (ed.). Australian Computer Society, 2004; 15-22.]]
[31]
31. Tomasic A, García-Molina H, Shoens K. Incremental updates of inverted lists for text document retrieval. SIGMOD'94: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, MN, 1994. ACM Press: New York, 1994; 289-300.]]
[32]
32. Jansen B, Spink A, Bateman J, Saracevic T. Real life information retrieval: A study of user queries on the Web. ACM SIGIR Forum 1998; 32(1):5-17.]]
[33]
33. Whale G. Detection of plagiarism in student programs. Proceedings of the 9th Australian Computer Science Conference, Canberra, Australia, January 1986 (Australian Computer Science Communications, vol. 8). Australian Computer Society, 1986; 231-241.]]
[34]
34. Arwin C, Tahaghoghi SMM. Plagiarism detection across programming languages. Proceedings of the 29th Australasian Computer Science Conference, Hobart, Australia, January 2006 (Conferences in Research and Practice in Information Technology (CRPIT), vol. 48), Estivill-Castro V, Dobbie G (eds.). Australian Computer Society, 2006; 277-286.]]
[35]
35. Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval (1st edn.). Addison-Wesley, 1999.]]
[36]
36. Culwin F, MacLeod A, Lancaster T. Source code plagiarism in UK HE computing schools: Issues, attitudes and tools. Technical Report SBU-CISM-01-01, South Bank University School of Computing, Information Systems and Mathematics, September 2001.]]

Cited By

View all
  • (2024)VeriBin: A Malware Authorship Verification Approach for APT Tracking through Explainable and Functionality-Debiasing Adversarial Representation LearningACM Transactions on Privacy and Security10.1145/366990127:3(1-37)Online publication date: 20-Jul-2024
  • (2023)SCS-Gan: Learning Functionality-Agnostic Stylometric Representations for Source Code Authorship VerificationIEEE Transactions on Software Engineering10.1109/TSE.2022.317722849:4(1426-1442)Online publication date: 1-Apr-2023
  • (2022)RoPGenProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510181(1906-1918)Online publication date: 21-May-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Software
Software  Volume 37, Issue 2
February 2007
113 pages
ISSN:0038-0644
EISSN:1097-024X
Issue’s Table of Contents

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 01 February 2007

Author Tags

  1. indexing
  2. local alignment
  3. plagiarism detection
  4. program code similarity

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)VeriBin: A Malware Authorship Verification Approach for APT Tracking through Explainable and Functionality-Debiasing Adversarial Representation LearningACM Transactions on Privacy and Security10.1145/366990127:3(1-37)Online publication date: 20-Jul-2024
  • (2023)SCS-Gan: Learning Functionality-Agnostic Stylometric Representations for Source Code Authorship VerificationIEEE Transactions on Software Engineering10.1109/TSE.2022.317722849:4(1426-1442)Online publication date: 1-Apr-2023
  • (2022)RoPGenProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510181(1906-1918)Online publication date: 21-May-2022
  • (2021)Large-scale and Robust Code Authorship Identification with Deep Feature LearningACM Transactions on Privacy and Security10.1145/346166624:4(1-35)Online publication date: 19-Jul-2021
  • (2021)ICodeNet - A Hierarchical Neural Network Approach For Source Code Author IdentificationProceedings of the 2021 13th International Conference on Machine Learning and Computing10.1145/3457682.3457709(180-185)Online publication date: 26-Feb-2021
  • (2020)How are Deep Learning Models Similar?Proceedings of the 28th International Conference on Program Comprehension10.1145/3387904.3389254(172-183)Online publication date: 13-Jul-2020
  • (2019)Source-code Similarity Detection and Detection Tools Used in AcademiaACM Transactions on Computing Education10.1145/331329019:3(1-37)Online publication date: 21-May-2019
  • (2019)A Comparison of Three Popular Source code Similarity Tools for Detecting Student PlagiarismProceedings of the Twenty-First Australasian Computing Education Conference10.1145/3286960.3286974(112-117)Online publication date: 29-Jan-2019
  • (2019)SiameseEmpirical Software Engineering10.1007/s10664-019-09697-724:4(2236-2284)Online publication date: 1-Aug-2019
  • (2019)Usage and attribution of Stack Overflow code snippets in GitHub projectsEmpirical Software Engineering10.1007/s10664-018-9650-524:3(1259-1295)Online publication date: 1-Jun-2019
  • Show More Cited By

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media