Sourcerer: mining and searching internet-scale software repositories

Erik Linstead¹,
Sushil Bajracharya¹,
Trung Ngo¹,
Paul Rigor¹,
Cristina Lopes¹ &
…
Pierre Baldi¹

1029 Accesses
3 Altmetric
Explore all metrics

Abstract

Large repositories of source code available over the Internet, or within large organizations, create new challenges and opportunities for data mining and statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, fingerprinting, and database storage of open source software on an Internet-scale. In one experiment, we gather 4,632 Java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, method call, and lexical containment distributions. We then develop and apply unsupervised, probabilistic, topic and author-topic (AT) models to automatically discover the topics embedded in the code and extract topic-word, document-topic, and AT distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing source file similarity, developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering an software development staffing. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92– roughly 10–30% better than previous approaches based on text alone. A prototype of the system is available at: http://sourcerer.ics.uci.edu.

Article PDF

What do developers search for on the web?

Article 09 April 2017

Semantic topic models for source code analysis

Article 22 November 2016

CCFinderX: An Interactive Code Clone Analysis Environment

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Sun Java Coding Standards. http://java.sun.com/docs/codeconv/
Koders web site. http://www.koders.com
Krugle web site. http://www.krugle.com
Codase web site. http://www.Codase.com
Csourcesearch web site. http://csourcesearch.net/
Google CodeSearch web site. http://www.google.com/codesearch
SparsJ Search System. http://demo.spars.info/
Andrzejewski D, Mulhern A, Liblit B, Zhu X (2007) Statistical debugging using latent topic models. In: Matwin S, Mladenic D (eds) 18th European conference on machine learning. Warsaw, Poland (to appear)
Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: ICSE ’06: proceeding of the 28th international conference on Software engineering. ACM, New York, pp 361–370
Baeza-Yates RA, Ribeiro-Neto BA (1999) Modern information retrieval. ACM Press/Addison-Wesley
Baldi P, Frasconi P, Smyth P (2003) Modeling the internet and the web: probabilistic methods and algorithms. Wiley
Baldi P, Lopes C, Linstead E, Bajracharya S (2008) A theory of aspects as latent topics. In: OOPSLA ’08: proceedings of the 23rd annual ACM SIGPLAN conference on object-oriented programming systems, languages, and applications, Nashville, TN. ACM, New York, NY (to appear)
Blei D, Lafferty J (2006) Correlated topic models. In: Weiss Y, Schölkopf B, Platt J (eds) Advances in neural information processing systems, vol 18. MIT Press, Cambridge, MA, pp 147–154
Google Scholar
Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3: 993–1022
Article MATH Google Scholar
Brill E (1994) Some advances in transformation-based part of speech tagging. In: National conference on artificial intelligence. pp 722–727
Buntine W (2005) Open source search: a data mining platform. SIGIR Forum 39(1): 4–10
Article Google Scholar
Chen J, Swamidass SJ, Dou Y, Bruand J, Baldi P (2005) ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics 21: 4133–4139
Article Google Scholar
Concas G, Marchesi M, Pinna S, Serra N (2007) Power-laws in a large object-oriented software system. IEEE Trans Softw Eng 33(10): 687–708
Article Google Scholar
Cox A, Clarke C, Sim S (1999) A model independent source code repository. In: CASCON ’99: proceedings of the 1999 conference of the centre for advanced studies on collaborative research. IBM Press, p 1
Deerwester S, Dumais S, Landauer T, Furnas G, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6): 391–407
Article Google Scholar
Finnigan P, Holt R, Kalas I, Kerr S, Kontogiannis K, Mueller H, Mylopoulos J, Perelgut S, Stanley M, Wong K (1997) The software bookshelf. IBM Sys J 36(4): 564–593
Article Google Scholar
Frean GBM, Noble J, Rickerby M, Smith H, Visser M, Melton H, Tempero E (2006) Understanding the shape of java software. In: OOPSLA ’06. ACM Press, New York, pp 397–503
Gil JY, Maman I (2005) Micro patterns in java code. In: OOPSLA ’05: proceedings of the 20th annual ACM SIGPLAN conference on object oriented programming systems languages and applications. ACM Press, New York, pp 97–116
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci U S A 101(Suppl 1): 5228–5235
Article Google Scholar
Hajiyev E, Verbaere M, de Moor O (2006) Codequest: scalable source code queries with datalog. In: Thomas D (eds) ECOOP’06: proceedings of the 20th European conference on object-oriented programming, vol 4067 of lecture notes in computer science. Springer, Berlin, pp 2–27
Google Scholar
Hill R, Rideout J (2004) Automatic method completion. In: ASE. IEEE Computer Society, pp 228–235
Holmes R, Murphy GC (2005) Using structural context to recommend source code examples. In: ICSE ’05: proceedings of the 27th international conference on software engineering. ACM Press, New York, pp 117–125
Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12): 952–970
Article Google Scholar
Inoue K, Yokomori R, Fujiwara H, Yamamoto T, Matsushita M, Kusumoto S (2003) Component rank: relative significance rank for software component search. In: ICSE ’03: proceedings of the 25th international conference on software engineering. IEEE Computer Society, Washington, DC, pp 14–24
Inoue K, Yokomori R, Yamamoto T, Kusumoto S (2005) Ranking significance of software components based on use relations. IEEE Trans Softw Eng 31(3): 213–225
Article Google Scholar
Kawaguchi S, Garg PK, Matsushita M, Inoue K (2004) Mudablue: an automatic categorization system for open source repositories. In: APSEC ’04: proceedings of the 11th Asia-Pacific software engineering conference (APSEC’04). IEEE Computer Society, Washington, DC, pp 184–193
Kiczales G, Lamping J, Menhdhekar A, Maeda C, Lopes C, Loingtier J, Irwin J (1997) Aspect-oriented programming. In: Akşit M, Matsuoka S (eds) Proceedings European conference on object-oriented programming, vol 1241. Springer, Berlin, pp 220–242
Google Scholar
Knuth DE (1971) An empirical study of fortran programs. Softw Pract Exp 1(2):105–133
Article MATH Google Scholar
Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: identifying topics in source code. Info Softw Technol 49(3): 230–243
Article Google Scholar
Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2007) Mining eclipse developer contributions via author-topic models. MSR 2007: proceedings of the fourth international workshop on mining software repositories. pp 30–33
Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2008) Mining internet-scale software repositories. In: Platt JC, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems, vol 20. MIT Press, Cambridge, MA, pp 929–936
Google Scholar
Liu C, Chen C, Han J, Yu PS (2006) Gplag: detection of software plagiarism by program dependence graph analysis. In: KDD ’06: proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 872–881
Mandelin D, Xu L, Bodík R, Kimelman D (2005) Jungloid mining: helping to navigate the api jungle. In: PLDI ’05: proceedings of the 2005 ACM SIGPLAN conference on programming language design and implementation. ACM Press, New York, pp 48–61
Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: Proceedings of the 11th working conference on reverse engineering (WCRE 2004). The Netherlands, pp 214–223
McCormick E, Volder KD (2004) Jquery: finding your way through tangled code. In: OOPSLA ’04: companion to the 19th annual ACM SIGPLAN conference on object-oriented programming systems, languages, and applications. ACM. New York, pp 9–10
Minto S, Murphy GC (2007) Recommending emergent teams. In: MSR ’07: proceedings of the fourth international workshop on mining software repositories. IEEE Computer Society. Washington, DC, p 5
Mitzenmacher, M. (2003) A brief history of generative models for power law and lognormal distributions. Internet Math 1(2)
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Article Google Scholar
Newman D, Block S (2006) Probabilistic topic decomposition of an eighteenth-century american newspaper. J Am Soc Inf Sci Technol 57(6): 753–767
Article Google Scholar
Newman D, Chemudugunta C, Smyth P, Steyvers M (2006) Analyzing entities and topics in news articles using statistical topic models. In: ISI. pp 93–104
Oman P, Hagemeister J (1992) Metrics for assessing a software system’s maintainability. In: Proceedings of the international conference on software maintenance 1992. IEEE Computer Society Press, pp 337–344
Page L, Brin S, Motwani R, Winograd T (1998) The pagerank citation ranking: bringing order to the web. Stanford Digital Library working paper SIDL-WP-1999-0120 of 11/11/1999 (see: http://dbpubs.stanford.edu/pub/1999-66)
Paul S (1992) Scruple: a reengineer’s tool for source code search. In: CASCON ’92: proceedings of the 1992 conference of the centre for advanced studies on collaborative research. IBM Press, pp 329–346
Paul S, Prakash A (1994) A framework for source code search using program patterns. IEEE Trans Softw Eng 20(6): 463–475
Article MATH Google Scholar
Poshyvanyk D, Marcus A, Dong Y (2006) Jiriss—an eclipse plug-in for source code exploration. ICPC 0: 252–255
Google Scholar
Puppin D, Silvestri F (2006) The social network of java classes. In: Haddad H (ed) SAC. ACM, New York, pp 1409–1413
Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI ’04: proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press, Arlington, pp 487–494
Sahavechaphan N, Claypool K (2006) Xsnippet: mining for sample code. SIGPLAN Not 41(10): 413–430
Article Google Scholar
Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: SIGMOD ’03: proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, New York, pp 76–85
Schröter A, Zimmermann T, Premraj R, Zeller A (2006) If your bug database could talk .... In: Proceedings of the 5th international symposium on empirical software engineering, vol II: short papers and posters. Rio de Janeiro, pp 18–20
Sindhgatta R (2006) Using an information retrieval system to retrieve source code samples. In: Osterweil LJ, Rombach HD, Soffa ML, (eds) ICSE. ACM, pp 905–908
Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T (2004) Probabilistic author-topic models for information discovery. In: KDD ’04: proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 306–315
Swamidass S, Baldi P (2007) Bounds and algorithms for exact searches of chemical fingerprints in linear and sub-linear time. J Chem Inf Model 47(2): 302–317
Article Google Scholar
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Statistical Assoc 101(476): 1566–1581
Article MATH MathSciNet Google Scholar
Ugurel S, Krovetz R, Giles CL (2002) What’s the code? automatic classification of source code archives. In: KDD ’02: proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 632–638
Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1): 191–211
Article MATH MathSciNet Google Scholar
Welker KD, Oman PW (1995) Software maintainability metrics models in practice. Crosstalk, J Def Softw Eng 8: 19–23
Google Scholar
Wheeldon R, Counsell S (2003) Power law distributions in class relationships. In: International workshop on source code analysis and manipulation, pp 45–54
Ye Y, Fischer G (2002) Supporting reuse by delivering task-relevant and personalized information. In: ICSE ’02: proceedings of the 24th international conference on software engineering. ACM, New York, pp 513–523
Zipf GK (1932) Selective studies and the principle of relative frequency in language. Harvard University Press

Download references

Author information

Authors and Affiliations

Donald Bren School of Information and Computer Sciences, University of California, Irvine, USA
Erik Linstead, Sushil Bajracharya, Trung Ngo, Paul Rigor, Cristina Lopes & Pierre Baldi

Authors

Erik Linstead
View author publications
You can also search for this author in PubMed Google Scholar
Sushil Bajracharya
View author publications
You can also search for this author in PubMed Google Scholar
Trung Ngo
View author publications
You can also search for this author in PubMed Google Scholar
Paul Rigor
View author publications
You can also search for this author in PubMed Google Scholar
Cristina Lopes
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Baldi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pierre Baldi.

Additional information

Responsible editor: Eamonn Keogh.

Erik Linstead, Sushil Bajracharya, and Trung Ngo have contributed equally to this work.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Linstead, E., Bajracharya, S., Ngo, T. et al. Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Disc 18, 300–336 (2009). https://doi.org/10.1007/s10618-008-0118-x

Download citation

Received: 16 September 2007
Accepted: 18 September 2008
Published: 15 October 2008
Issue Date: April 2009
DOI: https://doi.org/10.1007/s10618-008-0118-x

Sourcerer: mining and searching internet-scale software repositories

Abstract

Article PDF

Similar content being viewed by others

What do developers search for on the web?

Semantic topic models for source code analysis

CCFinderX: An Interactive Code Clone Analysis Environment

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Sourcerer: mining and searching internet-scale software repositories

Abstract

Article PDF

Similar content being viewed by others

What do developers search for on the web?

Semantic topic models for source code analysis

CCFinderX: An Interactive Code Clone Analysis Environment

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords