[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1882291.1882315acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

A study of the uniqueness of source code

Published: 07 November 2010 Publication History

Abstract

This paper presents the results of the first study of the uniqueness of source code. We define the uniqueness of a unit of source code with respect to the entire body of written software, which we approximate with a corpus of 420 million lines of source code. Our high-level methodology consists of examining a collection of 6,000 software projects and measuring the degree to which each project can be `assembled' solely from portions of this corpus, thus providing a precise measure of `uniqueness' that we call syntactic redundancy. We parameterized our study over a variety of variables, the most important of which being the level of granularity at which we view source code. Our suite of experiments together consumed approximately four months of CPU time, providing quantitative answers to the following questions: at what levels of granularity is software unique, and at a given level of granularity, how unique is software? While we believe these questions to be of intrinsic interest, we discuss possible applications to genetic programming and developer productivity tools.

References

[1]
R. Al-Ekram, C. Kapser, R. C. Holt, and M. W. Godfrey. Cloning by accident: an empirical study of source code cloning across software systems. In ISESE, pages 376--385, 2005.
[2]
S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and C. Lopes. Sourcerer: a search engine for open source code supporting structure-based search. In OOPSLA '06: Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications, pages 681--682, 2006.
[3]
B. S. Baker. On finding duplication and near-duplication in large software systems. In WCRE '95: Proceedings of the Second Working Conference on Reverse Engineering, page 86, 1995.
[4]
R. Cottrell, R. J. Walker, and J. Denzinger. Semi-automating small-scale source code reuse via structural correspondence. In SIGSOFT '08/FSE-16: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 214--225, 2008.
[5]
P. Devanbu, S. Karstu, W. Melo, and W. Thomas. Analytical and empirical evaluation of software reuse metrics. In ICSE '96: Proceedings of the 18th international conference on Software engineering, pages 189--199, 1996.
[6]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB '99: Proceedings of the 25th International Conference on Very Large Data Bases, pages 518--529, 1999.
[7]
L. Jiang, G. Misherghi, Z. Su, and S. Glondu. DECKARD: Scalable and accurate tree-based detection of code clones. In Proceedings of ICSE, pages 96--105, 2007.
[8]
L. Jiang and Z. Su. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of ISSTA, pages 81--92, 2009.
[9]
T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7):654--670, 2002.
[10]
J. R. Koza. Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge, MA, USA, 1992.
[11]
R. V. Krejcie and D. W. Morgan. Determining sample size for research activities. Educational and psychological measurement, 30:607--610, 1970.
[12]
O. A. L. Lemos, S. K. Bajracharya, J. Ossher, R. S. Morla, P. C. Masiero, P. Baldi, and C. V. Lopes. CodeGenie: using test-cases to search and reuse source code. In Proceedings of ASE, 2007.
[13]
S. Livieri, Y. Higo, M. Matushita, and K. Inoue. Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In Proceedings of ICSE, pages 106--115, 2007.
[14]
A. Mockus. Large-scale code reuse in open source software. In FLOSS '07: Proceedings of the First International Workshop on Emerging Trends in FLOSS Research and Development, page 7, 2007.
[15]
A. Mockus. Amassing and indexing a large sample of version control systems: Towards the census of public source code history. In MSR '09: Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, pages 11--20, 2009.
[16]
A. Nandi and H. V. Jagadish. Effective phrase prediction. In VLDB '07: Proceedings of the 33rd international conference on Very large data bases, pages 219--230, 2007.
[17]
A. Podgurski and L. Pierce. Retrieving reusable software by sampling behavior. ACM Trans. Softw. Eng. Methodol., 2(3):286--303, 1993.
[18]
S. P. Reiss. Semantics-based code search. In ICSE '09: Proceedings of the 2009 IEEE 31st International Conference on Software Engineering, pages 243--253, 2009.
[19]
S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 76--85, 2003.
[20]
W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest. Automatically finding patches using genetic programming. In ICSE '09: Proceedings of the 2009 IEEE 31st International Conference on Software Engineering, pages 364--374, 2009.

Cited By

View all
  • (2024)Leveraging Phylogenetics in Software Product Families: The Case of Latent Content Generation in Video GamesProceedings of the 28th ACM International Systems and Software Product Line Conference10.1145/3646548.3672596(113-124)Online publication date: 2-Sep-2024
  • (2024)Transforming Community Q&A System Content into Learning Objects2024 IEEE 17th International Scientific Conference on Informatics (Informatics)10.1109/Informatics62280.2024.10900874(220-225)Online publication date: 13-Nov-2024
  • (2024)An Empirical Study of Code Clones: Density, Entropy, and PatternsScience of Computer Programming10.1016/j.scico.2024.103259(103259)Online publication date: Dec-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
FSE '10: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
November 2010
302 pages
ISBN:9781605587912
DOI:10.1145/1882291
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. large scale study
  2. software uniqueness
  3. source code

Qualifiers

  • Research-article

Conference

SIGSOFT/FSE'10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)63
  • Downloads (Last 6 weeks)3
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Leveraging Phylogenetics in Software Product Families: The Case of Latent Content Generation in Video GamesProceedings of the 28th ACM International Systems and Software Product Line Conference10.1145/3646548.3672596(113-124)Online publication date: 2-Sep-2024
  • (2024)Transforming Community Q&A System Content into Learning Objects2024 IEEE 17th International Scientific Conference on Informatics (Informatics)10.1109/Informatics62280.2024.10900874(220-225)Online publication date: 13-Nov-2024
  • (2024)An Empirical Study of Code Clones: Density, Entropy, and PatternsScience of Computer Programming10.1016/j.scico.2024.103259(103259)Online publication date: Dec-2024
  • (2023)Natural Language to Code: How Far Are We?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616323(375-387)Online publication date: 30-Nov-2023
  • (2023)Big Code Search: A BibliographyACM Computing Surveys10.1145/360490556:1(1-49)Online publication date: 26-Aug-2023
  • (2023)DAISY: Dynamic-Analysis-Induced Source Discovery for Sensitive DataACM Transactions on Software Engineering and Methodology10.1145/356993632:4(1-34)Online publication date: 27-May-2023
  • (2023)Using Transfer Learning for Code-Related TasksIEEE Transactions on Software Engineering10.1109/TSE.2022.318329749:4(1580-1598)Online publication date: 1-Apr-2023
  • (2023)A Statistical Method for API Usage Learning and API Misuse Violation Finding2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)10.1109/SERA57763.2023.10197708(358-365)Online publication date: 23-May-2023
  • (2023)Large Language Models for Software Engineering: Survey and Open Problems2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE)10.1109/ICSE-FoSE59343.2023.00008(31-53)Online publication date: 14-May-2023
  • (2023)Using the Deep Learning-Based Approaches for Program Debugging and Repair2023 10th International Conference on Dependable Systems and Their Applications (DSA)10.1109/DSA59317.2023.00059(443-454)Online publication date: 10-Aug-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media