Abstract
Studying software repositories and hosting services can provide valuable insights into the behaviors of large groups of software developers and their projects. Traditionally, most analysis of metadata collected from software project hosting services has been conducted by specifying some short window of time, typically just a few years. To date, few - if any - studies have been built from data comprising the entirety of a hosting facility’s lifespan: from its birth to its death, and rebirth in another form. Thus, the first contribution of this paper is to present two data sets that support the historical analysis of over ten years of collected metadata from the now-defunct RubyForge project hosting site, as well as the follow-on successor to RubyForge, the RubyGems package (“gem”) hosting facility. The data sets and samples of usage demonstrated in this paper include: analyses of overall forge growth over time, presentation of data and analyses of project-level characteristics on both forges and their changes over time (for example in licenses, languages, and so on), and demonstration of how to use developer-level metadata (for example counts of new developers and calculation of developer-project density) to assess changes in person-level activity on both sites over time. Finally, because RubyForge was phased out and the gem-hosting portion of it was replaced by RubyGems, all the gems within RubyForge projects were transferred by project owners and by the site owners themselves into the RubyGems hosting facility. Thus, the data sets in this paper represent a unique opportunity to study projects as they moved from one ecosystem to another, and as such we show several methods for locating related projects between the two forges, and for building a cross-forge, longitudinal project history using information from both forges. These data sets and sample analyses in this paper will be relevant to researchers studying long-term software evolution, and distributed, hosted, or collaborative software development environments.
Similar content being viewed by others
Notes
This work is an extension of a short paper previously presented in the Data Showcase track at The International Conference on Mining Software Repositories in 2016. The full citation for that prior work can be found in Squire (2016a).
References
Blair P (2010) RubyGems.org Replaces RubyForge as Gem Host. InfoQ. Mar 30. Available at http://www.infoq.com/news/2010/03/rubygems
Booch G, Brown AW (2003) Collaborative development environments. Advances in Computers (59):1–27
Cooper P (2009) Gemcutter is the new official default ruby gem host. RubyInside (Oct 26) http://www.rubyinside.com/gemcutter-is-the-new-official-default-rubygem-host-2659.html
Delorey DP, Knutson CD, Giraud-Carrier C (2007) Programming language trends in open source development: An evaluation using data from all production phase SourceForge projects. In Proc. 2nd Workshop Public Data Software Dev. (WoPDaSD). Limerick, Ireland
DiBona C (2015) Bidding farewell to Google code. Google Open Source Blog March 12. Retrieved April 18, 2017 from https://opensource.googleblog.com/2015/03/farewell-to-google-code.html
FLOSSmole (2004) http://flossmole.org
GitHub (2017) About GitHub. Available at: https://GitHubcom/ about Retrieved April 20, 2017
Gousios G (2013) The GHTorrent dataset and tool suite. Proc. 10th Int. Conf. On mining software repositories (MSR 2013). 233–236
Harry B (2017) Shutting Down CodePlex. Brian Harry's Blog. March 31. Retrieved April 18, 2017 from https://blogs.msdn.microsoft.com/bharry/2017/03/31/shutting-down-codeplex/
Howison J, Crowston K, Conklin M (2006) FLOSSmole: a collaborative repository for FLOSS research data and analyses. Int J Information Technology and Web Engineering 1(3):17–26
Hyett PJ (2008) GitHub's RubyGem server. GitHub Blog. Available at https://github.com/blog/51-github-s-rubygem-server
Knuth DE (1973) The art of computer programming: volume 3, Sorting and Searching. Addison Wesley Longman Publishing Co., Inc., Redwood City, pp 391–92
Krein JL, MacLean AC, Knutson CD, Delorey DP, Eggett DL (2009) Language entropy: A metric or characterization of author programming language distribution. In Proc. 4th Workshop Public Data Software Dev. (WoPDaSD). Skovde, Sweden
Krein JL, MacLean AC, Knutson CD, Delorey DP, Eggett DL (2010) Impact of programming language fragmentation on developer productivity. Int. J. Open Source Sw. & Proc 2(2):41–61
Lerner J, Tirole J (2005) The scope of open source licensing. J. of Law, Economics, and. Policy 21(1):20–56
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals, Doklady Akademii Nauk SSSR, 163(4):845-848, 1965 (Russian). English translation in Soviet Physics Doklady 10(8):707–710 Available at: https://xlinux.nist.gov/dads/HTML/Levenshtein.html
McAlister N (2013) Study: most projects on GitHub not open source licensed. The Register April 13. Available at: http://www.theregister.co.uk/2013/04/18/github_licensing_study/
Miller J (2012) History of the canonical gem host for Ruby gems. Available at http://www.rubycoloredglasses.com/2012/05/history-of-the-canonical-gem-host-for-ruby-gems/
Phipps S (2014) Why all software needs a license. InfoWorld November 7. Available at: http://www.infoworld.com/article/2839560/open-source-software/sticking-a-license-on-everything.html
Phoenix E (2013) Tweet. November 10. Available at https://twitter.com/evanphx/status/399552820380053505
Roberts R (2009) Gemcutter: a fast and easy approach to ruby gem hosting. RubyInside. Aug 20. Available at http://www.rubyinside.com/gemcutter-a-fast-and-easy-approach-to-ruby-gem-hosting-2281.html
Ruby Rogues Podcast (2012) RubyGems with Nick Quaranto. Episodes 36. January 5. Available at: https://devchat.tv/ruby-rogues/036-rr-rubygems-with-nick-quaranto
RubyGems (2016a) About RubyGems. Available at: https://rubygems.org/pages/about
RubyGems (2016b) RubyGems Data Dumps. Available at: https://rubygems.org/pages/data
RubyGems (2016c) RubyGems Guides: Make your own gem. Available at http://guides.rubygems.org/make-your-own-gem/
Schuster W (2009) RubyForge to be phased out. InfoQ Oct 26 http://www.infoq.com/news/2009/10/rubyforge-phased-out-rubygemsorg
Squire M (2009) Integrating projects from multiple open source code forges. Int J Open Source Software & Proc 1(1):46–57
Squire M (2016a) Data Sets: The Circle of Life in Ruby Hosting, 2003–2015. In Proc. 13 th Int. Conference on Mining Software Repositories (MSR2016). Austin, TX, USA. 452–455
Squire M (2016b) Mastering Data Mining with Python. Packt: London, UK. Program available at: https://github.com/megansquire/masteringDM/blob/master/ch3/entityAttributeMetrics.py
Stocker M (2008) Pros and cons of GitHub vs RubyForge as gem source. InfoQ August 14. Available at http://www.infoq.com/news/2008/08/gems-from-rubyforge-and-github
Vasilescu B, Posnett D, Ray B, van den Brand MG, Serebrenik A, Devanbu P, Filkov V (2015) Gender and tenure diversity in GitHub teams. In Proc. CHI. ACM
Vendome C (2015) A large scale study of license usage on GitHub. In Proc. 37th Int. Conf. Softw. Eng. (ICSE), 2, 772–774
Villa L (2013) Younger developers reject licensing, risk chance for reform. Feb 13. Available at: https://opensource.com/law/13/2/post-open-source-software-licensing
Wanstrath C (2009) GitHub gem building is defunct. Oct 8. https://github.com/blog/515-gem-building-is-defunct
Acknowledgements
We gratefully acknowledge the National Science Foundation (grant number NSF-14-05643) for supporting this work.
Author information
Authors and Affiliations
Corresponding author
Appendix SQL code used to generate figures and tables
Appendix SQL code used to generate figures and tables
Listing A
SQL to create Fig. 3. Cumulative project count over the lifespan of RubyForge, by month and year.
Listing B
SQL for Fig. 4. New project registration dates for all projects in RubyForge, by month and year. Use final RubyForge data collection from 2013 (#12987).
Listing C
SQL for Fig. 5. Creation dates for gems in RubyGems, by month and year
Listing D
SQL used to generate 2006 data (#24) for Table 4
SQL used to generate 2013 data (#12987) for Table 4
Listing E
SQL used to generate 2006 data (#24) for Table 9.
SQL used to generate 2013 (#12987) data for Table 9
Listing F
SQL code to get total number of projects for latest RubyGems collection (July 2016, #61243).
Get the count of those that have no license (include self-typing “none”, etc).
Get the count for each license - retain the top ten after combining spelling abnormalities, for Table 11.
Listing G
Generate project count per developer for final RubyForge collection (2013).
How many developers worked on more than 10 projects?
Listing H.
Find the number of gems per owner.
Find the number of gems per author
Listing I
Write a query to get a list of RubyForge and RubyGems projects that match on URL, populate entities table. The names and URLs are drawn from two views called rf_entities and rg_entities.
Write a query to get a list of RubyForge and RubyGems projects that match on name, populate entities table (if that pair is not already in entities table)
Listing J
A GHTorrent query to show the number of new GitHub projects, with and without forks, over time, as discussed in Section 1.
FLOSSmole queries to show the total number of projects hosted on CodePlex and Google Code, the two next-largest code forges, aside from GitHub.
Rights and permissions
About this article
Cite this article
Squire, M. Data sets describing the circle of life in Ruby hosting, 2003–2016. Empir Software Eng 23, 1123–1152 (2018). https://doi.org/10.1007/s10664-017-9581-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-017-9581-6