The unexpected effectiveness of Python in science
In a keynote on the first day of PyCon 2017, Jake VanderPlas looked at the relationship between Python and science. Over the last ten years or so, there has been a large rise in the amount of Python code being used—and released—by scientists. There are reasons for that, which VanderPlas described, but, perhaps more importantly, the growing practice of releasing all of this code can help solve one of the major problems facing science today: reproducibility.
VanderPlas said that it was his sixth PyCon; he started coming as "scruffy PhD. student" on a travel grant from the Python Software Foundation. In those days, he never imagined that he might some day be addressing the conference.
He likened PyCon to a mosaic; other conferences, like the SciPy conference or DjangoCon, give attendees a look at a specific slice of the Python community. At PyCon "you get it all". Each slice of the Python community has its own way of doing things, its own tools, and so on. In a conversation at PyCon, he once heard someone describe IPython as bloated and said that it promoted bad software practices, which was exactly the opposite of what he thought since he uses the tool regularly. That comment reflects that the other person uses Python for different reasons than he does. He suggested that attendees take the opportunity of the conference to talk to others who use Python in a different way; it might lead to new tools or insights.
He has worked on various projects, including SciPy, scikit-learn, and Astropy. He has a blog and has written several books on Python topics. His day job is at the University of Washington's eScience Institute, where he helps researchers "use computing more effectively", especially on large data sets.
Astronomy
Beyond all that, VanderPlas is an astronomer; he wanted to talk about how he uses Python as a scientist and astronomer. He put up a slide (Speaker Deck slides) showing Edwin Hubble at the eyepiece of a telescope in 1949, which is "a nice, romantic view of astronomy". But, he noted that in his ten years as an astronomer, he has never looked through a lens; these days, astronomers do database queries, he said.
He put up slides showing various kinds of telescopes that are being used today, as well as some of the visual output from them. Those interested will want to look at the slides and the YouTube video of the talk. He started with the Hubble Space Telescope; it has been in orbit since 1990 and one thing it has produced is an "ultra-deep field" image of a tiny section of the sky (roughly 1/10 of the full moon in size). That allowed astronomers to see galaxies that were up to 13 billion light years away. A complementary project was the Sloan Sky Survey that scanned the entire sky rather than at a single point. Spectrographic analysis of the data allows seeing the three-dimensional shape of the universe.
He then showed the artist's impression of the TRAPPIST-1 exoplanetary system from its Wikipedia entry. That is a system that has been determined to have seven rocky planets, some of which are in the habitable zone where liquid water can exist. In reality, the Kepler space telescope sees that system as four or so grayscale pixels; the wobbles and changes in brightness indicate when these planets pass in front of the star.
To extract information from that data requires "incredibly intricate statistical modeling of the system". That, in turn, means a complicated data-processing pipeline. That work is done in Python and the code is available on GitHub. It is one example of the scientific community "embracing the norms of the open-source community", VanderPlas said.
The James Webb Space Telescope is another instrument; it has a mirror that is three times the size of Hubble's. It is not as sensitive to visible light, but instead is tuned for infrared light. It might be able to actually image exoplanets but, more importantly, may be able to do spectroscopic measurement of light passing through the atmosphere of those planets. That is the "holy grail in exoplanet science", he said. It is a long shot, but it may detect oxygen or ozone in an atmosphere, which would be a sign of life since there is no geophysical way to produce those gases. Once again, the Python tools being used are available.
A project that he worked on as a graduate student, the Large Synoptic Survey Telescope (LSST), will really change every part of astronomy over the next ten years, he said. It has a three-gigapixel camera, which is the largest digital camera ever created. That requires a CCD that is the size of a person.
LSST will take two snapshots every thirty seconds for ten years, which will produce a ten-year time lapse of the full southern night sky from Chile. Each snapshot is the equivalent of around 1500 HDTV images and each night will require 15-30 terabytes of storage. By the end of the ten years, hundreds of petabytes will have been generated. There is a 600-page book that describes various things that can be done with the data and the processing code, in Python and C++, is available.
A graph of the mentions of programming languages in peer-reviewed papers in astronomy publications since 2000 shows Python making the "hockey stick" shape around 2011. Fortran, MATLAB, and IDL are all pretty flat since Python began that rise. IDL, which was the leader until 2015 or so, has actually shown a decline, which is good, he said, because it has a closed-source license.
Why Python?
When Guido van Rossum started Python, he never intended it to be the primary language for programmers. He targeted it as a teaching language and thought programs would be 10-50 lines long; a 500-line program would be near the top end. Obviously things have changed a bit since then. But, VanderPlas asked, what makes Python so effective for science?
Python's ability to interoperate with other languages is one key feature, he said. He paraphrased Isaac Newton: "If I have seen further, it is by importing from the code of giants." If you have to reinvent the wheel every time you extend the study, VanderPlas said, it will never happen.
David Beazley wrote a paper on "Scientific Computing with Python" back in 2000, which advocated the use of Python for science long before it was used much at all in those fields. Beazley mentioned all the different tools and data types that scientists have to deal with; it often takes a lot of effort to pull all of those things together.
Similarly, John Hunter, who created the Matplotlib plotting library for Python, described his previous work process in a SciPy 2012 keynote. He had Perl scripts that called C++ numerical programs, which generated data that got loaded in MATLAB or gnuplot. IPython creator Fernando Perez also described an awk/sed/bash environment for running C programs on supercomputers that he used as a graduate student. There was Perl, gnuplot, IDL, and Mathematica being used as well.
For science, "Python is glue", VanderPlas said. It allows scientists to use a high-level syntax to wrap C and Fortran programs and libraries, which is where most of the computation is actually done.
Another important feature is the "batteries included" philosophy of Python. That means there are all sorts of extras that come with the language; "compare that to C or C++ out of the box", he said. For those things that are not covered in the standard library, there is a huge ecosystem of third party libraries to fill in the gaps. People like Travis Oliphant, who created NumPy and SciPy, were able to add value by connecting low-level libraries to high-level APIs to make them easier to access and use.
The Python scientific stack has "ballooned over the last few years". There are multiple levels to that stack, starting with NumPy, IPython (and its successor, Jupyter), Cython, and others at the lowest level, moving through tools like Matplotlib, Pandas, and SciPy, etc., and then to libraries like scikit-learn, SymPy, StatsModels, and more. On top of those are various field-specific packages like Astropy, Biopython, SunPy, and beyond. If you have a problem you want to solve in Python, you will most likely find something available to help on GitHub, VanderPlas said.
The simple and dynamic nature of the language is another reason that Python fits well with science. He put up the classic "import antigravity" xkcd comic as something of an example. Python is fun to write and, for the most part, it is a matter of putting down what it is you want to happen. As Perry Greenfield put it at a PyAstro 2015 talk: Python is powerful for developers, but accessible to astronomers, which has a huge benefit that is not really being recognized or acknowledged.
Something that is often overlooked about scientific programming is that the speed of development is of primary importance, while the speed of execution is often a secondary consideration. Sometimes people are incredulous that petabytes of data are being processed using Python; they often ask "why don't you use C?" His half-joking response is: "Why don't you commute by airplane instead of by car? It is so much faster!"
Scientific programming is done in a non-linear fashion, generally. A scientist will take their data set and start playing around with it; there will be a lot of back and forth exploratory work that is done. For that, Jupyter notebooks are ideal, though they may not be a good fit for software development, VanderPlas said.
Impact on science
The "open ethos" of Python also makes it a good fit for science. Ten years ago, it was not the case that there were open repositories of code for telescopes, but over that time frame or longer science has been experiencing something of a replication crisis. The headlines of various publications are proclaiming that science is crumbling because peers are unable to reproduce published results.
Solving that problem is important and most who are looking at solving it are landing on the idea of "open science". That is starting to happen, he said. When the Laser Interferometer Gravitational-Wave Observatory (LIGO) detected the "incredible event" of "ripples in spacetime" caused by two black holes merging, part of what was released with the research was Jupyter notebooks with the data and analysis. "This is the way forward for science", he said.
He is also trying to "walk the walk" with two of his books. His Python Data Science Handbook and A Whirlwind Tour of Python are both available as Jupyter notebooks.
Python is really influencing science, he said. For example, the Astropy library is gaining popularity and a community. Astronomers are rallying around it, citing it in papers, and adding more tools to it.
The open-source community, and Python in particular, do things differently than academics and scientists have always done things. But "we've been able to learn" from open source and Python, VanderPlas said, and he hopes that leads to the downfall of the reproducibility problem. "The open-source ethos is such a good fit for science".
In conclusion, he returned to the idea of PyCon as a mosaic. He reiterated the idea that attendees should seek out those who do things differently than they do. Those communities have different approaches and tools; sitting in on talks from outside of your own communities can be highly beneficial. "You never know", it might just end up changing the way your field is done.
[I would like to thank the Linux Foundation for travel assistance to
Portland for PyCon.]
Index entries for this article | |
---|---|
Conference | PyCon/2017 |
Posted Jun 1, 2017 20:04 UTC (Thu)
by mikapfl (subscriber, #84646)
[Link] (11 responses)
one thing which is only mentioned in passing in the article, but what is pretty important IMHO for python in science, is that python is free as in beer. IDL, matlab, and similar tools are not free. Why should it matter in science, where we easily spend a few millions on a new machine or burn through chemicals worth thousands just to test a small idea? Well, it matters because students don't decide where the money is spent and especially early-stage researchers switch places often. If I am just learning programming and am changing posistions once to twice per year (think university studies - bachelor's at another institution - master's yet somewhere else - PhD on another continent) it becomes pretty cumbersome to talk each new group into buying software for ~1000 €, just so that I can use my knowledge and scripts. It is thus the rational choice to learn something like python and use it everywhere. I'm not even talking about the day when you think "maybe I should run my calculations on all my colleague's work stations" to then find out you would need 50 licences for matlab to do that. That's the day you seriously regret having learnt matlab when you could have learnt python instead. And because scientists are not mainly programmers, they usually never learn a second language. People are still using IDL just because they learned it in the 90's, although IDL is abysmal compared to almost any of its competitors. So, python will stay in science for at least 20 years, possibly longer.
Of course, that doesn't explain why python is replacing perl, bash, awk, sed and C/C++ or FORTRAN as the main "glue" programming language, so all the reasons explained in the article are also important. But for the young crowd, python (and R) are also very appealing because you can bring your knowledge everywhere for free.
Cheers
Mika
Posted Jun 2, 2017 0:11 UTC (Fri)
by droundy (subscriber, #4559)
[Link] (4 responses)
Posted Jun 2, 2017 10:27 UTC (Fri)
by ballombe (subscriber, #9523)
[Link] (3 responses)
Posted Jun 4, 2017 19:21 UTC (Sun)
by Sesse (subscriber, #53779)
[Link] (2 responses)
Posted Jun 6, 2017 13:18 UTC (Tue)
by anselm (subscriber, #2796)
[Link] (1 responses)
That's great but often (a) even the academic licenses cost their users non-trivial amounts of money, and (b) you don't necessarily get to keep your academic licenses after you graduate and/or move to another institution. Especially if you start working for a commercial company in your field of research you will require a full license for the software, which your company will need to pay for and which they probably would prefer not to. (The whole point of cheap academic licenses is to make people dependent on the software in question so they, or someone, will eventually have to pay up when they start using the software “for real”.)
Software that (like Python and its extensions for scientific computing) is free as in beer/speech from the get-go is preferable in this context.
Posted Jun 6, 2017 19:03 UTC (Tue)
by MattJD (subscriber, #91390)
[Link]
But I'm not bitter. I won't do business with the company that sells that software. But I'm not bitter.
Posted Jun 2, 2017 0:30 UTC (Fri)
by fenncruz (subscriber, #81417)
[Link] (3 responses)
Its not just people learned it in the 90's its that you end up inheriting some code from the 90's someone wrote in IDL so you have you learn IDL to get it to work, before you even think of converting it to python. Luckily a lot of the NASA libraries are being released these days in python, so hopefully IDL can die.
The other big thing moving being to python i found, is being able to ask help questions easily. I've asked IDL based questions online and no one has a clue what i'm talking about, ask a python question and you find its already been answered 10 different ways on stack overflow.
Posted Jun 2, 2017 5:24 UTC (Fri)
by Matt_G (subscriber, #112824)
[Link] (2 responses)
The main reason I think is the somewhat famous "Numerical Recipes" textbook (https://en.wikipedia.org/wiki/Numerical_Recipes) - kind of like the K&R of scientific computing. Almost all the senior engineers here swear by. It was my first reference text when I first started - maybe someone needs to write a "Numerical recipes in Python" book.
Posted Jun 2, 2017 16:46 UTC (Fri)
by smoogen (subscriber, #97)
[Link] (1 responses)
A lot of this old-code has to do with amount of time non-coders have to deal with code, but part of it is that you may need to run years old data plus new data and need it to have the same answers as before. Or that the satellite you are using was launched in the 80's and is still producing data 40 years later. You could get some people in to recode stuff with a 40% chance of it all falling apart from budget cuts.. or you could use the old hardware and use the money to buy some lab equipment that got cut from a different budget.
This is going to be one of the reasons that Python2.7 is going to have a longer shelf life than people want or expect. A lot of scientists love 2.7 because it is not going to change. Every new RHL/Debian/etc release, we would have to go and deal with support tickets at the Lab due to changes in python not always being as backwards compatible as thought. [My understanding is that there are systems at CERN and similar labs running RHL-7.3 15 years later because it has the python that the experiments were written against.. it will still be running them for N more years until the experiment completes.] Most of the scientists running long lived experiments are writing the stuff in 2.7 because of the fact for them it is no longer moving. If there is ever a 3.N that stops they will move to that also (while still running 2.7 and 1.5 systems.)
At this point I expect an Numerical Recipes book in Python would be written against Python 2.7
Posted Jun 8, 2017 22:51 UTC (Thu)
by geek (guest, #45074)
[Link]
Posted Jun 2, 2017 14:28 UTC (Fri)
by ibukanov (subscriber, #3942)
[Link]
Posted Jun 5, 2017 0:26 UTC (Mon)
by gdt (subscriber, #6284)
[Link]
"Free" also goes a some way to explaining the rising dominance of R in statistical computing. Even Microsoft has seen the way the wind is blowing and put substantial resources into improving the language's performance (on its own operating system, but source available for sharing with others thanks to the GPL). Part of the other reason for the rise of Python is that it is a "real" programming language. The current generation of scientists are applying large-scale computing to their investigations and are aware that this is best done through extending an existing well-regarded language rather than constructing a domain-specific language which may then lack the essentials for programming-in-the-large. Especially as writing a Python module, or even a Python C API module, is a small fraction of the work of writing a domain-specific language. Moreover there's a huge network effect: in pulling OTDR data into Python I got all the file parsing, statistical analysis, graphing, and archiving at low cost. These are also present in domain-specific languages, but usually your use-case has to exactly match the intended use of the language, and OTDR data is just outside the norm of a statistical dataset (with the world's worst file format, with the statistics of interest being very different to population statistics, etc).
Posted Jun 2, 2017 9:05 UTC (Fri)
by nettings (subscriber, #429)
[Link]
https://github.com/spatialaudio/DAGA2017_towards_open_sci...
Like in the article, one of the points of this paper is to improve reproducibility.
Posted Jun 8, 2017 22:07 UTC (Thu)
by oak (guest, #2786)
[Link]
The unexpected effectiveness of Python in science
The unexpected effectiveness of Python in science
The unexpected effectiveness of Python in science
to download bootleg copies to run on their own computers. And the bootleg copies contribute to the student learning
process, so as a teacher you cannot ignore them.
The unexpected effectiveness of Python in science
The unexpected effectiveness of Python in science
The unexpected effectiveness of Python in science
The unexpected effectiveness of Python in science
The unexpected effectiveness of Python in science
The unexpected effectiveness of Python in science
The unexpected effectiveness of Python in science
The unexpected effectiveness of Python in science
The unexpected effectiveness of Python in science
Only marginally related, but...
I'm pretty sure the statement in XKCD comic should be:The unexpected effectiveness of Python in science
from __future__ import antigravity