More Web Proxy on the site http://driver.im/

research-article

A study of the uniqueness of source code

Authors:

Zhendong SuAuthors Info & Claims

FSE '10: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering

Pages 147 - 156

https://doi.org/10.1145/1882291.1882315

Published: 07 November 2010 Publication History

Abstract

This paper presents the results of the first study of the uniqueness of source code. We define the uniqueness of a unit of source code with respect to the entire body of written software, which we approximate with a corpus of 420 million lines of source code. Our high-level methodology consists of examining a collection of 6,000 software projects and measuring the degree to which each project can be `assembled' solely from portions of this corpus, thus providing a precise measure of `uniqueness' that we call syntactic redundancy. We parameterized our study over a variety of variables, the most important of which being the level of granularity at which we view source code. Our suite of experiments together consumed approximately four months of CPU time, providing quantitative answers to the following questions: at what levels of granularity is software unique, and at a given level of granularity, how unique is software? While we believe these questions to be of intrinsic interest, we discuss possible applications to genetic programming and developer productivity tools.

References

[1]

R. Al-Ekram, C. Kapser, R. C. Holt, and M. W. Godfrey. Cloning by accident: an empirical study of source code cloning across software systems. In ISESE, pages 376--385, 2005.

[2]

S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and C. Lopes. Sourcerer: a search engine for open source code supporting structure-based search. In OOPSLA '06: Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications, pages 681--682, 2006.

Digital Library

[3]

B. S. Baker. On finding duplication and near-duplication in large software systems. In WCRE '95: Proceedings of the Second Working Conference on Reverse Engineering, page 86, 1995.

Digital Library

[4]

R. Cottrell, R. J. Walker, and J. Denzinger. Semi-automating small-scale source code reuse via structural correspondence. In SIGSOFT '08/FSE-16: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 214--225, 2008.

Digital Library

[5]

P. Devanbu, S. Karstu, W. Melo, and W. Thomas. Analytical and empirical evaluation of software reuse metrics. In ICSE '96: Proceedings of the 18th international conference on Software engineering, pages 189--199, 1996.

Digital Library

[6]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB '99: Proceedings of the 25th International Conference on Very Large Data Bases, pages 518--529, 1999.

Digital Library

[7]

L. Jiang, G. Misherghi, Z. Su, and S. Glondu. DECKARD: Scalable and accurate tree-based detection of code clones. In Proceedings of ICSE, pages 96--105, 2007.

Digital Library

[8]

L. Jiang and Z. Su. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of ISSTA, pages 81--92, 2009.

Digital Library

[9]

T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7):654--670, 2002.

Digital Library

[10]

J. R. Koza. Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge, MA, USA, 1992.

Digital Library

[11]

R. V. Krejcie and D. W. Morgan. Determining sample size for research activities. Educational and psychological measurement, 30:607--610, 1970.

[12]

O. A. L. Lemos, S. K. Bajracharya, J. Ossher, R. S. Morla, P. C. Masiero, P. Baldi, and C. V. Lopes. CodeGenie: using test-cases to search and reuse source code. In Proceedings of ASE, 2007.

Digital Library

[13]

S. Livieri, Y. Higo, M. Matushita, and K. Inoue. Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In Proceedings of ICSE, pages 106--115, 2007.

Digital Library

[14]

A. Mockus. Large-scale code reuse in open source software. In FLOSS '07: Proceedings of the First International Workshop on Emerging Trends in FLOSS Research and Development, page 7, 2007.

Digital Library

[15]

A. Mockus. Amassing and indexing a large sample of version control systems: Towards the census of public source code history. In MSR '09: Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, pages 11--20, 2009.

Digital Library

[16]

A. Nandi and H. V. Jagadish. Effective phrase prediction. In VLDB '07: Proceedings of the 33rd international conference on Very large data bases, pages 219--230, 2007.

Digital Library

[17]

A. Podgurski and L. Pierce. Retrieving reusable software by sampling behavior. ACM Trans. Softw. Eng. Methodol., 2(3):286--303, 1993.

Digital Library

[18]

S. P. Reiss. Semantics-based code search. In ICSE '09: Proceedings of the 2009 IEEE 31st International Conference on Software Engineering, pages 243--253, 2009.

Digital Library

[19]

S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 76--85, 2003.

Digital Library

[20]

W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest. Automatically finding patches using genetic programming. In ICSE '09: Proceedings of the 2009 IEEE 31st International Conference on Software Engineering, pages 364--374, 2009.

Digital Library

Cited By

Chueca JBlasco DCetina CFont J(2024)Leveraging Phylogenetics in Software Product Families: The Case of Latent Content Generation in Video GamesProceedings of the 28th ACM International Systems and Software Product Line Conference10.1145/3646548.3672596(113-124)Online publication date: 2-Sep-2024
https://dl.acm.org/doi/10.1145/3646548.3672596
Miklovič MKuric EBaťalík PLang J(2024)Transforming Community Q&A System Content into Learning Objects2024 IEEE 17th International Scientific Conference on Informatics (Informatics)10.1109/Informatics62280.2024.10900874(220-225)Online publication date: 13-Nov-2024
https://doi.org/10.1109/Informatics62280.2024.10900874
Hu BYu DWu YHu TCai Y(2024)An Empirical Study of Code Clones: Density, Entropy, and PatternsScience of Computer Programming10.1016/j.scico.2024.103259(103259)Online publication date: Dec-2024
https://doi.org/10.1016/j.scico.2024.103259
Show More Cited By

Index Terms

A study of the uniqueness of source code

Recommendations

Do bugs lead to unnaturalness of source code?
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Texts in natural languages are highly repetitive and predictable because of the naturalness of natural languages. Recent research validated that source code in programming languages is also repetitive and predictable, and naturalness is an inherent ...
An Empirical Study on the Relationship Between Defects and Source Code’s Unnaturalness
Natural languages are “natural” in that texts in natural languages are repetitive and predictable. Recent research indicates that programming languages share similar characteristics (naturalness), with source code displaying patterns of repetition and ...
A Framework for Source Code Search Using Program Patterns

For maintainers involved in understanding and reengineering large software, locating source code fragments that match certain patterns is a critical task. Existing solutions to the problem are few, and they either involve manual, painstaking scans of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

FSE '10: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering

November 2010

302 pages

ISBN:9781605587912

DOI:10.1145/1882291

General Chair:
Gruia-Catalin Roman
Washington University in St. Louis, USA
,
Program Chair:
André van der Hoek
University of California, Irvine, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGSOFT: ACM Special Interest Group on Software Engineering

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGSOFT/FSE'10

Sponsor:

SIGSOFT

SIGSOFT/FSE'10: 18th ACM SIGSOFT Symposium on the Foundations of Software Engineering

November 7 - 11, 2010

New Mexico, Santa Fe, USA

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

155
Total Citations
View Citations
1,016
Total Downloads

Downloads (Last 12 months)63
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chueca JBlasco DCetina CFont J(2024)Leveraging Phylogenetics in Software Product Families: The Case of Latent Content Generation in Video GamesProceedings of the 28th ACM International Systems and Software Product Line Conference10.1145/3646548.3672596(113-124)Online publication date: 2-Sep-2024
https://dl.acm.org/doi/10.1145/3646548.3672596
Miklovič MKuric EBaťalík PLang J(2024)Transforming Community Q&A System Content into Learning Objects2024 IEEE 17th International Scientific Conference on Informatics (Informatics)10.1109/Informatics62280.2024.10900874(220-225)Online publication date: 13-Nov-2024
https://doi.org/10.1109/Informatics62280.2024.10900874
Hu BYu DWu YHu TCai Y(2024)An Empirical Study of Code Clones: Density, Entropy, and PatternsScience of Computer Programming10.1016/j.scico.2024.103259(103259)Online publication date: Dec-2024
https://doi.org/10.1016/j.scico.2024.103259
Wang SGeng MLin BSun ZWen MLiu YLi LBissyandé TMao XChandra SBlincoe KTonella P(2023)Natural Language to Code: How Far Are We?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616323(375-387)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616323
Kim KGhatpande SKim DZhou XLiu KBissyandé TKlein JLe Traon Y(2023)Big Code Search: A BibliographyACM Computing Surveys10.1145/360490556:1(1-49)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3604905
Zhang XHeaps JSlavin RNiu JBreaux TWang X(2023)DAISY: Dynamic-Analysis-Induced Source Discovery for Sensitive DataACM Transactions on Software Engineering and Methodology10.1145/356993632:4(1-34)Online publication date: 27-May-2023
https://dl.acm.org/doi/10.1145/3569936
Mastropaolo ACooper NPalacio DScalabrino SPoshyvanyk DOliveto RBavota G(2023)Using Transfer Learning for Code-Related TasksIEEE Transactions on Software Engineering10.1109/TSE.2022.318329749:4(1580-1598)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TSE.2022.3183297
Panda DBasia PNallavolu KZhong XSiy HSong M(2023)A Statistical Method for API Usage Learning and API Misuse Violation Finding2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)10.1109/SERA57763.2023.10197708(358-365)Online publication date: 23-May-2023
https://doi.org/10.1109/SERA57763.2023.10197708
Fan AGokkaya BHarman MLyubarskiy MSengupta SYoo SZhang J(2023)Large Language Models for Software Engineering: Survey and Open Problems2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE)10.1109/ICSE-FoSE59343.2023.00008(31-53)Online publication date: 14-May-2023
https://doi.org/10.1109/ICSE-FoSE59343.2023.00008
Lin THuang CFang C(2023)Using the Deep Learning-Based Approaches for Program Debugging and Repair2023 10th International Conference on Dependable Systems and Their Applications (DSA)10.1109/DSA59317.2023.00059(443-454)Online publication date: 10-Aug-2023
https://doi.org/10.1109/DSA59317.2023.00059
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten