Measuring the scientific-software community
The 2016 Community Leadership Summit in Austin, Texas featured a FLOSS Metrics side-track focusing on the tools and processes used to track and understand participation in open-source projects. The subject is of interest to those doing community-management work, since questions recur about how best to measure participation and influence beyond naive metrics like lines of code or commit frequency. But, as James Howison's session on scientific-software usage demonstrated, a well-organized open-source project with public mailing lists is far easier to analyze than certain other software niches.
Howison is a researcher at the University of Texas, Austin who has been studying the software ecosystem that exists almost entirely within academic circles. The root issue is that scientists and other researchers often write custom software to support their work (examples would include bespoke statistical packages in the R language and software that models some physical process of interest) but rarely, if ever, does that software get published on a typical open-source project-hosting service. Instead, it tends to be uploaded as a tar archive to the researcher's personal page on the university web site, where it sits until some other researcher discovers it through a citation and downloads it to run on another project.
This presents a challenge to anyone wanting to study how such software is used across the scientific community as a whole. For starters, it can be quite difficult to identify the software packages of interest, when the only indicator may be a citation in a research paper. Howison showed several examples of less-than-helpful citation styles that may be encountered. Some mention a URL where they downloaded software written by another researcher; some just mention the name of the package or the university where it originated. One crowd-pleasing citation said only "this was tested using software written in the Java language."
Howison has been collecting a data set of software cited in biology research. One of the challenges unique to the research-software niche is that the individual modules themselves are, as a rule, not compiled or linked against each other in what would be termed a "dependency relationship" in the traditional package-management sense. Instead, they may be run as separate jobs that act on the same data, not linked at run time but connected only by scripts or job queues, or they may only be connected by virtue of the fact that they model parts of the same larger research project. Thus, even when modules make their way onto a public repository site like the Comprehensive R Archive Network (CRAN), there is another problem: determining when specific modules are used in combination with others. Howison and others who study the scientific-software ecosystem term this phenomenon "complementary" modules.
Several efforts exist to discover and map complementary software usage. Howison works on the Scientific Software Network Map, which he demonstrated for the audience. It relies on user-submitted data from cooperating institutions, such as the logs from high-performance computing facilities about which packages (currently for R only) are run as part of the same job. Wherever possible, the packages are mapped to known public software sources, and the system creates a directed graph showing how individual packages are used in combination with others. The goal is to allow researchers to collaborate better, reporting bugs and feature requests that will improve their software for other teams that make use of it.
A similar effort is Depsy, which maps the usage of Python and R packages. Depsy provides information about dependency chains several levels deep, based on usage data extracted from research paper citations. It also factors in dependencies from the Python Package Index and CRAN, as well as data from GitHub repositories where available. Howison noted that Depsy has proven useful to researchers working on grant applications and tenure portfolios who have a need to show that their work is widely used.
Another tricky facet of assessing the scientific software ecosystem is measuring the installed base of a program—which, he noted, is not a problem unique to scientific software in the least. There are several approaches available, such as instrumenting software to ping the originating server whenever it is used, but that option has serious privacy concerns. The approach Howison has been working on digs deep into download statistics instead, which has revealed some perhaps surprising information.
A typical download graph over time looks like a heartbeat, he said: there is a big spike whenever a new release or some major publicity event occurs, then the number of downloads tapers back down to a relatively flat level. Conventional wisdom is that the heartbeat does not indicate the number of installed users, because the download spike includes a large number of experimenters, people downloading out of curiosity, and others who do not continue to run the program. But Howison's research indicates that, with sufficiently high-resolution download data, the "static" installed base correlates to the number of downloads that come during the initial spike and taper-off, but only when the average number of daily downloads from the post-spike period is subtracted.
That is to say, the active-installed-base users download a new release in the first wave, but one must wait and see what the new "baseline" download level after the release is, and adjust downward to remove its effect. So the installed base corresponds to the area under the spike of the download curve, but above the post-spike baseline level.
That measurement makes some degree of intuitive sense, but Howison cautioned that attempting to assess the size of an installed user base is highly dependent on good, high-resolution data: hourly statistics or better. And, unfortunately, that high-resolution data is hard to come by. He had the most success examining download statistics from packages hosted on SourceForge, which provides high-resolution statistics. GitHub and other newer project-hosting services are, evidently, falling short on this front.
As a practical matter, knowing the active installed base for
an open-source project is valuable for a number of reasons. Corporate
sponsors of projects always want to know what the open-source
equivalent to "sales" is, but many projects have discovered that raw
download numbers do not translate particularly well. All of the
techniques Howison described could be used to help open-source
projects better assess where their code is in use and by whom; the
questions are particularly tricky in the realm of academic research,
but certainly of value to developers in general.
Index entries for this article | |
---|---|
Conference | Community Leadership Summit/2016 |
Posted May 20, 2016 16:48 UTC (Fri)
by ScottMinster (subscriber, #67541)
[Link] (3 responses)
Posted May 20, 2016 17:12 UTC (Fri)
by sfeam (subscriber, #2841)
[Link] (1 responses)
Posted May 26, 2016 5:44 UTC (Thu)
by dlang (guest, #313)
[Link]
You also have people who start using it installing it on more systems (and yes, many people do install from the Internet instead of from a local copy)
Posted May 26, 2016 14:02 UTC (Thu)
by oldtomas (guest, #72579)
[Link]
This mystified me at first sight too.
But actually, the "steady trickle" are those downloading the thing, saying "oh, cool" and *perhaps* keeping using it or perhaps going on to other things.
You usually don't re-download the thing when the version hasn't changed (more so when some sort of cache, e.g. a decentralized VCS or your browser's web cache are in-between).
But if you are actively using it, you sure will download the next version, so yes, the interesting information is hidden somewhere between the peaks and the baseline.
Measuring the scientific-software community
So the installed base corresponds to the area under the spike of the download curve, but above the post-spike baseline level.
I'm not sure I understand this correctly. Who are the people downloading the software at the "baseline" level if not users? If anything, my intuition says to throw out the spike as experimenters and the curious, but not necessarily your core users. The curious might become users, but they are also likely to not. It's been my experience that people who use and depend on a piece of software resist updating because they cannot afford for the upgrade to break something.
What causes the baseline download rate? Indeed a puzzling question. I maintain several scientific software packages, ranging from reasonably popular to truly obscure. In all cases there is a continuous level of downloads. Who on earth would periodically download an obscure package that has not been updated in years? I have never been able to figure that out. But in any case the description of a baseline level with spikes following each new release matches what I have seen myself.
Measuring the scientific-software community
Measuring the scientific-software community
Measuring the scientific-software community