Abstract
Large repositories of source code available over the Internet, or within large organizations, create new challenges and opportunities for data mining and statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, fingerprinting, and database storage of open source software on an Internet-scale. In one experiment, we gather 4,632 Java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, method call, and lexical containment distributions. We then develop and apply unsupervised, probabilistic, topic and author-topic (AT) models to automatically discover the topics embedded in the code and extract topic-word, document-topic, and AT distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing source file similarity, developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering an software development staffing. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92– roughly 10–30% better than previous approaches based on text alone. A prototype of the system is available at: http://sourcerer.ics.uci.edu.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Sun Java Coding Standards. http://java.sun.com/docs/codeconv/
Koders web site. http://www.koders.com
Krugle web site. http://www.krugle.com
Codase web site. http://www.Codase.com
Csourcesearch web site. http://csourcesearch.net/
Google CodeSearch web site. http://www.google.com/codesearch
SparsJ Search System. http://demo.spars.info/
Andrzejewski D, Mulhern A, Liblit B, Zhu X (2007) Statistical debugging using latent topic models. In: Matwin S, Mladenic D (eds) 18th European conference on machine learning. Warsaw, Poland (to appear)
Anvik J, Hiew L, Murphy GC (2006) Who should fix this bug? In: ICSE ’06: proceeding of the 28th international conference on Software engineering. ACM, New York, pp 361–370
Baeza-Yates RA, Ribeiro-Neto BA (1999) Modern information retrieval. ACM Press/Addison-Wesley
Baldi P, Frasconi P, Smyth P (2003) Modeling the internet and the web: probabilistic methods and algorithms. Wiley
Baldi P, Lopes C, Linstead E, Bajracharya S (2008) A theory of aspects as latent topics. In: OOPSLA ’08: proceedings of the 23rd annual ACM SIGPLAN conference on object-oriented programming systems, languages, and applications, Nashville, TN. ACM, New York, NY (to appear)
Blei D, Lafferty J (2006) Correlated topic models. In: Weiss Y, Schölkopf B, Platt J (eds) Advances in neural information processing systems, vol 18. MIT Press, Cambridge, MA, pp 147–154
Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3: 993–1022
Brill E (1994) Some advances in transformation-based part of speech tagging. In: National conference on artificial intelligence. pp 722–727
Buntine W (2005) Open source search: a data mining platform. SIGIR Forum 39(1): 4–10
Chen J, Swamidass SJ, Dou Y, Bruand J, Baldi P (2005) ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics 21: 4133–4139
Concas G, Marchesi M, Pinna S, Serra N (2007) Power-laws in a large object-oriented software system. IEEE Trans Softw Eng 33(10): 687–708
Cox A, Clarke C, Sim S (1999) A model independent source code repository. In: CASCON ’99: proceedings of the 1999 conference of the centre for advanced studies on collaborative research. IBM Press, p 1
Deerwester S, Dumais S, Landauer T, Furnas G, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6): 391–407
Finnigan P, Holt R, Kalas I, Kerr S, Kontogiannis K, Mueller H, Mylopoulos J, Perelgut S, Stanley M, Wong K (1997) The software bookshelf. IBM Sys J 36(4): 564–593
Frean GBM, Noble J, Rickerby M, Smith H, Visser M, Melton H, Tempero E (2006) Understanding the shape of java software. In: OOPSLA ’06. ACM Press, New York, pp 397–503
Gil JY, Maman I (2005) Micro patterns in java code. In: OOPSLA ’05: proceedings of the 20th annual ACM SIGPLAN conference on object oriented programming systems languages and applications. ACM Press, New York, pp 97–116
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci U S A 101(Suppl 1): 5228–5235
Hajiyev E, Verbaere M, de Moor O (2006) Codequest: scalable source code queries with datalog. In: Thomas D (eds) ECOOP’06: proceedings of the 20th European conference on object-oriented programming, vol 4067 of lecture notes in computer science. Springer, Berlin, pp 2–27
Hill R, Rideout J (2004) Automatic method completion. In: ASE. IEEE Computer Society, pp 228–235
Holmes R, Murphy GC (2005) Using structural context to recommend source code examples. In: ICSE ’05: proceedings of the 27th international conference on software engineering. ACM Press, New York, pp 117–125
Holmes R, Walker RJ, Murphy GC (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE Trans Softw Eng 32(12): 952–970
Inoue K, Yokomori R, Fujiwara H, Yamamoto T, Matsushita M, Kusumoto S (2003) Component rank: relative significance rank for software component search. In: ICSE ’03: proceedings of the 25th international conference on software engineering. IEEE Computer Society, Washington, DC, pp 14–24
Inoue K, Yokomori R, Yamamoto T, Kusumoto S (2005) Ranking significance of software components based on use relations. IEEE Trans Softw Eng 31(3): 213–225
Kawaguchi S, Garg PK, Matsushita M, Inoue K (2004) Mudablue: an automatic categorization system for open source repositories. In: APSEC ’04: proceedings of the 11th Asia-Pacific software engineering conference (APSEC’04). IEEE Computer Society, Washington, DC, pp 184–193
Kiczales G, Lamping J, Menhdhekar A, Maeda C, Lopes C, Loingtier J, Irwin J (1997) Aspect-oriented programming. In: Akşit M, Matsuoka S (eds) Proceedings European conference on object-oriented programming, vol 1241. Springer, Berlin, pp 220–242
Knuth DE (1971) An empirical study of fortran programs. Softw Pract Exp 1(2):105–133
Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: identifying topics in source code. Info Softw Technol 49(3): 230–243
Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2007) Mining eclipse developer contributions via author-topic models. MSR 2007: proceedings of the fourth international workshop on mining software repositories. pp 30–33
Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi P (2008) Mining internet-scale software repositories. In: Platt JC, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems, vol 20. MIT Press, Cambridge, MA, pp 929–936
Liu C, Chen C, Han J, Yu PS (2006) Gplag: detection of software plagiarism by program dependence graph analysis. In: KDD ’06: proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 872–881
Mandelin D, Xu L, Bodík R, Kimelman D (2005) Jungloid mining: helping to navigate the api jungle. In: PLDI ’05: proceedings of the 2005 ACM SIGPLAN conference on programming language design and implementation. ACM Press, New York, pp 48–61
Marcus A, Sergeyev A, Rajlich V, Maletic J (2004) An information retrieval approach to concept location in source code. In: Proceedings of the 11th working conference on reverse engineering (WCRE 2004). The Netherlands, pp 214–223
McCormick E, Volder KD (2004) Jquery: finding your way through tangled code. In: OOPSLA ’04: companion to the 19th annual ACM SIGPLAN conference on object-oriented programming systems, languages, and applications. ACM. New York, pp 9–10
Minto S, Murphy GC (2007) Recommending emergent teams. In: MSR ’07: proceedings of the fourth international workshop on mining software repositories. IEEE Computer Society. Washington, DC, p 5
Mitzenmacher, M. (2003) A brief history of generative models for power law and lognormal distributions. Internet Math 1(2)
Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88
Newman D, Block S (2006) Probabilistic topic decomposition of an eighteenth-century american newspaper. J Am Soc Inf Sci Technol 57(6): 753–767
Newman D, Chemudugunta C, Smyth P, Steyvers M (2006) Analyzing entities and topics in news articles using statistical topic models. In: ISI. pp 93–104
Oman P, Hagemeister J (1992) Metrics for assessing a software system’s maintainability. In: Proceedings of the international conference on software maintenance 1992. IEEE Computer Society Press, pp 337–344
Page L, Brin S, Motwani R, Winograd T (1998) The pagerank citation ranking: bringing order to the web. Stanford Digital Library working paper SIDL-WP-1999-0120 of 11/11/1999 (see: http://dbpubs.stanford.edu/pub/1999-66)
Paul S (1992) Scruple: a reengineer’s tool for source code search. In: CASCON ’92: proceedings of the 1992 conference of the centre for advanced studies on collaborative research. IBM Press, pp 329–346
Paul S, Prakash A (1994) A framework for source code search using program patterns. IEEE Trans Softw Eng 20(6): 463–475
Poshyvanyk D, Marcus A, Dong Y (2006) Jiriss—an eclipse plug-in for source code exploration. ICPC 0: 252–255
Puppin D, Silvestri F (2006) The social network of java classes. In: Haddad H (ed) SAC. ACM, New York, pp 1409–1413
Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI ’04: proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press, Arlington, pp 487–494
Sahavechaphan N, Claypool K (2006) Xsnippet: mining for sample code. SIGPLAN Not 41(10): 413–430
Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: SIGMOD ’03: proceedings of the 2003 ACM SIGMOD international conference on management of data. ACM, New York, pp 76–85
Schröter A, Zimmermann T, Premraj R, Zeller A (2006) If your bug database could talk .... In: Proceedings of the 5th international symposium on empirical software engineering, vol II: short papers and posters. Rio de Janeiro, pp 18–20
Sindhgatta R (2006) Using an information retrieval system to retrieve source code samples. In: Osterweil LJ, Rombach HD, Soffa ML, (eds) ICSE. ACM, pp 905–908
Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T (2004) Probabilistic author-topic models for information discovery. In: KDD ’04: proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, pp 306–315
Swamidass S, Baldi P (2007) Bounds and algorithms for exact searches of chemical fingerprints in linear and sub-linear time. J Chem Inf Model 47(2): 302–317
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Statistical Assoc 101(476): 1566–1581
Ugurel S, Krovetz R, Giles CL (2002) What’s the code? automatic classification of source code archives. In: KDD ’02: proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 632–638
Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1): 191–211
Welker KD, Oman PW (1995) Software maintainability metrics models in practice. Crosstalk, J Def Softw Eng 8: 19–23
Wheeldon R, Counsell S (2003) Power law distributions in class relationships. In: International workshop on source code analysis and manipulation, pp 45–54
Ye Y, Fischer G (2002) Supporting reuse by delivering task-relevant and personalized information. In: ICSE ’02: proceedings of the 24th international conference on software engineering. ACM, New York, pp 513–523
Zipf GK (1932) Selective studies and the principle of relative frequency in language. Harvard University Press
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Eamonn Keogh.
Erik Linstead, Sushil Bajracharya, and Trung Ngo have contributed equally to this work.
Rights and permissions
About this article
Cite this article
Linstead, E., Bajracharya, S., Ngo, T. et al. Sourcerer: mining and searching internet-scale software repositories. Data Min Knowl Disc 18, 300–336 (2009). https://doi.org/10.1007/s10618-008-0118-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-008-0118-x