[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2517208.2517226acmconferencesArticle/Chapter ViewAbstractPublication PagesgpceConference Proceedingsconference-collections
research-article

Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes

Published: 27 October 2013 Publication History

Abstract

Software repositories contain a vast wealth of information about software development. Mining these repositories has proven useful for detecting patterns in software development, testing hypotheses for new software engineering approaches, etc. Specifically, mining source code has yielded significant insights into software development artifacts and processes. Unfortunately, mining source code at a large-scale remains a difficult task. Previous approaches had to either limit the scope of the projects studied, limit the scope of the mining task to be more coarse-grained, or sacrifice studying the history of the code due to both human and computational scalability issues. In this paper we address the substantial challenges of mining source code: a) at a very large scale; b) at a fine-grained level of detail; and c) with full history information.
To address these challenges, we present domain-specific language features for source code mining. Our language features are inspired by object-oriented visitors and provide a default depth-first traversal strategy along with two expressions for defining custom traversals. We provide an implementation of these features in the Boa infrastructure for software repository mining and describe a code generation strategy into Java code. To show the usability of our domain-specific language features, we reproduced over 40 source code mining tasks from two large-scale previous studies in just 2 person-weeks. The resulting code for these tasks show between 2.0x--4.8x reduction in code size. Finally we perform a small controlled experiment to gain insights into how easily mining tasks written using our language features can be understood, with no prior training. We show a substantial number of tasks (77%) were understood by study participants, in about 3 minutes per task.

References

[1]
Hierarchical visitor pattern, c2 pattern repository. http://c2.com/cgi/wiki?HierarchicalVisitorPattern, 2012.
[2]
Sourceforge website. http://sourceforge.net/, 2012.
[3]
Apache Software Foundation. Hadoop: Open source implementation of MapReduce. http://hadoop.apache.org/, 2013.
[4]
J. Bevan, E. J. Whitehead, Jr., S. Kim, and M. Godfrey. Facilitating software evolution research with Kenyon. In ESEC/FSE'05: 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 177--186, 2005.
[5]
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI'04: 6th Symposium on Operating System Design and Implementation, pages 137--150, 2004.
[6]
R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In ICSE'13: 35th International Conference on Software Engineering, pages 422--431, 2013.
[7]
R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen. A large-scale empirical study of Java language feature usage. Technical report, Iowa State University, 2013.
[8]
M. Gabel and Z. Su. A study of the uniqueness of source code. In FSE'10: 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 147--156, 2010.
[9]
E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional, 1994.
[10]
T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. Predicting fault incidence using software change history. IEEE Trans. Softw. Eng., 26 (7): 653--661, 2000.
[11]
M. Grechanik, C. McMillan, L. DeFerrari, M. Comi, S. Crespi, D. Poshyvanyk, C. Fu, Q. Xie, and C. Ghezzi. An empirical investigation into a large-scale Java open source code repository. In ESEM'10: International Symposium on Empirical Software Engineering and Measurement, pages 11:1--11:10, 2010.
[12]
E. Hajiyev, M. Verbaere, and O. de Moor. Codequest: scalable source code queries with datalog. In ECOOP'06: 20th European conference on Object-Oriented Programming, pages 2--27, 2006.
[13]
A. E. Hassan. Predicting faults using the complexity of code changes. In ICSE'09: 31st International Conference on Software Engineering, pages 78--88, 2009.
[14]
K. Herzig and A. Zeller. Mining cause-effect-chains from version histories. In ISSRE'11: 22nd IEEE International Symposium on Software Reliability Engineering, pages 60--69, 2011.
[15]
D. Janzen and K. De Volder. Navigating and querying code without getting lost. In AOSD'03: 2nd international conference on Aspect-oriented software development, pages 178--187, 2003.
[16]
S. P. Jones. Haskell 98 Language and Libraries: The Revised Report. Cambridge University Press, 2003.
[17]
S. Kim, T. Zimmermann, J. Whitehead, and A. Zeller. Predicting faults from cached history. In ICSE'07: 29th International Conference on Software Engineering, pages 489--498, 2007.
[18]
M. Kimmig, M. Monperrus, and M. Mezini. Querying source code with natural language. In ASE'11: 26th IEEE/ACM International Conference on Automated Software Engineering, pages 376--379, 2011.
[19]
E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi. Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery, 18: 300--336, April 2009.
[20]
M. Martin, B. Livshits, and M. S. Lam. Finding application errors and security flaws using pql: a program query language. In OOPSLA'05: 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, pages 365--383, 2005.
[21]
C. McMillan, D. Poshyvanyk, M. Grechanik, Q. Xie, and C. Fu. Portfolio: Searching for relevant functions and their usages in millions of lines of code. TOSEM: ACM Transactions on Software Engineering and Methodology, page To Appear, 2013.
[22]
Y. Mileva, A. Wasylkowski, and A. Zeller. Mining evolution of object usage. In ECOOP'11: 25th European Conference on Object-Oriented Programming, 2011.
[23]
N. Nagappan and T. Ball. Use of relative code churn measures to predict system defect density. In ICSE'05: 27th International Conference on Software Engineering, pages 284--292, 2005.
[24]
I. Neamtiu, J. S. Foster, and M. Hicks. Understanding source code evolution using abstract syntax tree matching. In MSR'05: International Workshop on Mining Software Repositories, pages 1--5, 2005.
[25]
S. Okur and D. Dig. How do developers use parallel libraries? In FSE'12: 20th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 54:1--54:11, 2012.
[26]
B. C. d. S. Oliveira, M. Wang, and J. Gibbons. The visitor pattern as a reusable, generic, type-safe component. In OOPSLA'08: 23rd ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 439--456, 2008.
[27]
D. Orleans and K. J. Lieberherr. Dj: Dynamic adaptive programming in java. In REFLECTION'01: 3rd International Conference on Metalevel Architectures and Separation of Crosscutting Concerns, pages 73--80, 2001.
[28]
D. Orleans and K. J. Lieberherr. DemeterJ. Technical report, Northeastern University, 2001. URL http://www.ccs.neu.edu/research/demeter/DemeterJava/.
[29]
J. Ovlinger and M. Wand. A language for specifying recursive traversals of object structures. In OOPSLA'99: 14th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 70--81, 1999.
[30]
C. Parnin, C. Bird, and E. Murphy-Hill. Adoption and use of Java generics. Empirical Software Engineering, pages 1--43, 2012.
[31]
L. S. Pinto, S. Sinha, and A. Orso. Understanding myths and realities of test-suite evolution. In FSE'12: 20th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 33:1--33:11, 2012.
[32]
H. Rajan, T. N. Nguyen, R. Dyer, and H. A. Nguyen. Boa website. http://boa.cs.iastate.edu/, 2012.
[33]
G. Udding, B. Dagenais, and M. P. Robillard. Temporal analysis of API usage concepts. In ICSE'12: 34th International Conference on Software Engineering, pages 804--814, 2012.
[34]
J. Visser. Visitor combination and traversal control. In OOPSLA'01: 16th ACM SIGPLAN conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 270--282, 2001.
[35]
T. Zimmermann, P. Weisgerber, S. Diehl, and A. Zeller. Mining version histories to guide software changes. In ICSE'04: 26th International Conference on Software Engineering, pages 563--572, 2004.

Cited By

View all
  • (2023)A Pub/Sub-Based Mechanism for Inter-Component Exception Notification in Android Applications2023 IEEE 20th International Conference on Software Architecture (ICSA)10.1109/ICSA56044.2023.00018(105-116)Online publication date: Mar-2023
  • (2021)World of code: enabling a research workflow for mining and analyzing the universe of open source VCS dataEmpirical Software Engineering10.1007/s10664-020-09905-926:2Online publication date: 25-Feb-2021
  • (2019)Casting about in the dark: an empirical study of cast operations in Java programsProceedings of the ACM on Programming Languages10.1145/33605843:OOPSLA(1-31)Online publication date: 10-Oct-2019
  • Show More Cited By

Index Terms

  1. Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    GPCE '13: Proceedings of the 12th international conference on Generative programming: concepts & experiences
    October 2013
    198 pages
    ISBN:9781450323734
    DOI:10.1145/2517208
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. boa
    2. source code mining
    3. visitor pattern

    Qualifiers

    • Research-article

    Conference

    GPCE'13
    Sponsor:
    GPCE'13: Generative Programming: Concepts and Experiences
    October 27 - 28, 2013
    Indiana, Indianapolis, USA

    Acceptance Rates

    GPCE '13 Paper Acceptance Rate 20 of 59 submissions, 34%;
    Overall Acceptance Rate 56 of 180 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Pub/Sub-Based Mechanism for Inter-Component Exception Notification in Android Applications2023 IEEE 20th International Conference on Software Architecture (ICSA)10.1109/ICSA56044.2023.00018(105-116)Online publication date: Mar-2023
    • (2021)World of code: enabling a research workflow for mining and analyzing the universe of open source VCS dataEmpirical Software Engineering10.1007/s10664-020-09905-926:2Online publication date: 25-Feb-2021
    • (2019)Casting about in the dark: an empirical study of cast operations in Java programsProceedings of the ACM on Programming Languages10.1145/33605843:OOPSLA(1-31)Online publication date: 10-Oct-2019
    • (2019)An Empirical Study on Inter-Component Exception Notification in Android PlatformProceedings of the XXXIII Brazilian Symposium on Software Engineering10.1145/3350768.3350784(73-83)Online publication date: 23-Sep-2019
    • (2019)World of codeProceedings of the 16th International Conference on Mining Software Repositories10.1109/MSR.2019.00031(143-154)Online publication date: 26-May-2019
    • (2019)Redundancy-free analysis of multi-revision software artifactsEmpirical Software Engineering10.1007/s10664-018-9630-924:1(332-380)Online publication date: 1-Feb-2019
    • (2018)Large-scale study of substitutability in the presence of effectsProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3236024.3236075(528-538)Online publication date: 26-Oct-2018
    • (2018)Collective program analysisProceedings of the 40th International Conference on Software Engineering10.1145/3180155.3180252(620-631)Online publication date: 27-May-2018
    • (2017)Toward full elasticity in distributed static analysis: the case of callgraph analysisProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering10.1145/3106237.3106261(442-453)Online publication date: 21-Aug-2017
    • (2017)Reducing redundancies in multi-revision code analysis2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER.2017.7884617(148-159)Online publication date: Feb-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media