Leading items
Welcome to the LWN.net Weekly Edition for June 7, 2018
This edition contains the following feature content:
- Advanced computing with IPython: multilingual and highly parallel Python programming.
- Unplugging old batteries: is it time to retire some modules from the Python standard library? (From the Python Language Summit).
- Deferring seccomp decisions to user space: a proposed seccomp enhancement to let a user-space process make security decisions.
- Statistics from the 4.17 kernel development cycle: where the code in 4.17 came from.
- Will staging lose its Lustre?: a popular filesystem fails to graduate from the kernel's staging tree.
- A filesystem "change journal" and other topics: proposals on logging filesystem changes for crash recovery.
- The ZUFS zero-copy filesystem: a new zero-copy filesystem for nonvolatile memory storage.
- Flash storage topics: how to deal with performance problems caused by slow flash devices.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Advanced computing with IPython
If you use Python, there's a good chance you have heard of IPython, which provides an enhanced read-eval-print loop (REPL) for Python. But there is more to IPython than just a more convenient REPL. Today's IPython comes with integrated libraries that turn it into an assistant for several advanced computing tasks. We will look at two of those tasks, using multiple languages and distributed computing, in this article.
IPython offers convenient access to documentation, integration with matplotlib, persistent
history, and many other features that greatly ease interactive work with
Python. IPython also comes with a collection of "magic" commands
that alter the effect of single lines or blocks of code; for example, you
can time your code simply by typing %%time
at the prompt
before entering your Python statements. All of these features also work when
using the Jupyter notebook with the IPython
kernel, so you can freely switch between the terminal and the browser-based
interface while using the same commands.
Multilingual computing
No one language is ideal for everything. IPython and Jupyter allow you to exploit the strengths of multiple languages in a single notebook or interactive session.
The figure on the right, which is a snippet from a Jupyter notebook, shows the
simplest way to
make use of this feature. The %%ruby
cell magic command (the
double %%
means that the magic command will apply to the
entire cell) causes the cell contents to be handled by the Ruby
interpreter. The --out
flag stores the cell output into the
named variable, which is then globally available to the Python kernel that
interprets the contents of the other cells. The following cell casts the
string output of the Ruby code into an integer (note that int()
is legal Python, but not a part of Ruby). The Ruby code simply adds up the
integers from 1 to 100; the final result is stored in a
.
This can be done without installing any IPython or Jupyter kernels
or extensions—only Ruby is required. The same
thing can be done with Perl, Bash, or sh; other interpreted languages can be
added by editing the list in the source at [path to
IPython]/core/magics/script.py
.
F2PY
is a component of NumPy/SciPy that compiles and wraps Fortran
subroutines
so that they can be used with Python. The appeal is to be able to take
advantage of Fortran's fast numerical operations together with the
high-level convenience
and interactivity of Python. Typically, using F2PY
requires
several manual steps. However, a third-party extension called Fortran magic
(installable with pip
) provides a cell magic that uses
F2PY
under the hood to do the compilation and interface
creation. All that is needed to is a single line in a cell containing a
Fortran
subroutine or function (a Fortran compiler may need to be installed if
one is not already present on the system).
The figure below shows the process. First, we define a python function,
called eap()
, that uses a slowly converging approximation to
e, the base of the natural logarithms.
It calculates d
successive approximations, returning the final one.
The
next cell loads the Fortran magic machinery, which generates a user warning
(collapsed), as this was developed for an older version of IPython/Jupyter
(but everything still works). The load command defines the cell magic
that we use in the cell after that, which contains a Fortran version of the
same function, called eaf()
. When
that cell is executed, IPython compiles the code and generates the Python
interface. In the last two cells, each program is invoked with timing turned
on; they produce comparable outputs, but the Fortran version is about 24
times faster.
With only a single magic command, you can package a compiled Fortran command for interactive use in your session or notebook. Since the numerical parts of your programs are both the easiest to translate into Fortran, and the parts that will benefit the most from the process, this is a simple way to speed up a computation, and a good demonstration of the power of the IPython ecosystem.
Parallel and distributed computing
IPython provides a number of convenient solutions for dividing a computation among the processing cores of either a single machine or multiple networked computers. The IPython parallel computing tools do much of the setup and bookkeeping; in simple cases, it should allow performing parallel computations in an interactive context almost as simply as normal, single-processor calculations.
A common reason for running code on multiple processors is to speed it up or to increase throughput. This is only possible for certain types of problems, however, and only works well if the time saved doing arithmetic outweighs the overhead of moving data between processors.
If your goal is to maximize the speed of your computation, you will want to speed up its serial performance (its speed on a single processing core) as much as is practical before trying to take advantage of parallel or distributed computation. This set of slides [PDF] provides a clear and concise introduction to the various ways you might approach this problem when your core language is Python; it also introduces a few parallelization strategies. There are a large number of approaches to speeding up Python, all of which lie beyond the concerns of this article (and, regardless of language, the first approach should be a critical look at your algorithms and data structures).
Aside from trying to speed up your calculation, another purpose of networked, distributed computing is to gather information from or run tests on a collection of computers. For example, an administrator of a set of web servers located around the world can take advantage of the techniques described below to gather performance data from all the servers with a single command, using an IPython session as a central control center.
First, a note about NumPy. The
easiest way to parallelize a Python computation is simply to express it, if
possible, as a sequence of array operations on NumPy arrays. This will
automatically distribute the array data among all the cores on your machine
and perform array arithmetic in parallel (if you have any doubt, submit a
longish NumPy calculation while observing a CPU monitor, such as
htop
, and you will see all cores engaged). Not every
computation can be expressed in this way; but if your program already uses
NumPy in a way that allows parallel execution, and you try to use the
techniques described below to run your program on multiple cores, it will
introduce unnecessary interprocess communication, and slow things down rather than
speeding them up.
In cases where your program is not a natural fit for NumPy's array
processing, IPython provides other, nearly equally convenient methods for
taking advantage of multiple processors. To use these facilities, you must
install the ipyparallel
library (pip install
ipyparallel
will do it).
IPython and the ipyparallel
library support a large variety
of styles and paradigms for parallel and distributed computing. I have
constructed a few examples that demonstrate several of these paradigms in a
simple way. This should give some entry points to begin experimenting
immediately, with a minimum of setup, and give an idea of the range of
possibilities. To learn about all the options, consult the documentation
[PDF].
The first example replicates a computation on each of the machine's CPU
cores. As mentioned above, NumPy
automatically divides work among these cores, but with IPython you can
access them in other ways. To begin, you must create a computing
"cluster". With the
installations of IPython and ipyparallel
come several
command-line tools. The command to create a cluster is:
$ ipcluster start --n=xNormally,
x
is set to the number of cores in your system; my
laptop has four, so --n=4.
That command should result in a message that the
cluster was created. You can now interact with it from within IPython or a
Jupyter notebook.
The figure on the right shows part of a Jupyter session using the cluster. The
first two cells import the IPython parallel library and instantiate a
Client
. To check that all four cores are in fact available,
the ids list of the Client instance are displayed.
The next cell (in the figure on the left)
imports the
choice()
function, which randomly
chooses an element from a collection, pylab
for plotting, and
configures a setting that causes plots to be embedded in the notebook.
Note the cell is using the %%px
magic. This incantation, at
the top of a cell, causes the calculation in the cell to be replicated, by
default, into a separate process for each core. Since we started our
cluster using four cores, there will be four processes, each of which
has its own private versions of all the variables, and each of which runs
independently of the others.
The next two cells (below) compute a 2D random walk with a million steps. They
are each decorated by the %%timeit
cell magic, which times the
calculation within a cell. The first cell uses the --targets
argument to limit the calculation to a
single process on a single core (core "0"; we could have chosen
any number from 0 to 3); the second uses %%px
without an
argument to use all cores. Note that in this style of parallel computing,
each variable, including each list, is replicated among the cores. This is
in contrast to array calculations in NumPy, where arrays are split among
cores, each one working on a part of a larger problem. Therefore, in this
example, each
core will calculate a separate, complete million-step random walk.
If the timings showed identical times for each version, that would mean that we actually did four times as much work in the same amount of time in the second case, because the calculation is done four times. However, the second version actually took a bit more than twice as long, which means that we only achieved a speedup of about two. This is due to the overhead of interprocess communication and setup, and should decrease as the calculations become more lengthy.
What happens if we append a plot command to the cells, to see what the random walks look like? The next figure shows how each process produces its own plot (with its own random seed). This style of multiprocessing can be a convenient way to compare different versions of a calculation side by side, without having to run each iteration one after the other.
You can also execute a one-liner on the cluster, using %px
,
which is the single-line version of the magic command. Using this, you can
mix serial and parallel code within a single cell. So after importing the
random integer function (randint()):
%px randint(0,9) Out[0:8]: 1 Out[1:8]: 3 Out[2:8]: 9 Out[3:8]: 0
The output cell is labeled, as usual, in order (here it was the 8th calculation in my session), but also indicates which of the four compute cores produced each result.
The %%px
and %px
magic commands are easy ways
to replicate a computation among a cluster of processors when you want each
processor to operate on its own private copy of the data. A classic
technique for speeding up a calculation on a list by using multiple
processors follows a different pattern: the list is divided among the
processors, each goes to work on its individual segment, and the results
are reassembled into a single list. This works best if the calculation on
each array element does not depend on the other elements.
IPython provides
some convenience functions for making these computations easy to
express. Here we'll take a look at the one that's most generally
useful. First, consider Python's map()
function; for example:
list(map(lambda x:f(x), range(16)))That will apply the function
f()
to each element of the list [0..15]
and return
the resulting list. It will do this on one processor, one element at a
time.
But if we've started up a cluster as above, we can write it this way:
rp[:].map_sync(lambda x:f(x), range(16))This will divide the list into four segments of equal length, send each piece to a separate processor, and put the result back together, replacing the original. You can control which processors are employed by indexing
rp
, so
you can easily tell rp[0:1]
to work on one array while you
have rp[2:3]
do something else.
The processors that are used need not be cores
on the local machine. They can reside on any computers that can be reached
through the internet, or on the local network. The simplest setup for computing over a
network is when your compute cluster consists of machines that you can
reach using SSH. Naturally, the configuration is a bit more involved than
simply typing the ipcluster
command. I have created a document
[PDF] to describe a
minimal example configuration that will get you started computing on a
networked cluster over SSH.
For over 20 years, computational scientists have relied on various approaches to parallel computing as the only way to perform really large calculations. As the desire for more accurate climate modeling, processing larger and larger data sets for machine learning, better simulations of galactic evolution, and more powerful calculations in many different fields outstripped the capabilities of single processors, parallel processing was employed to break the bottleneck. This used to be the exclusive domain of people willing to rewrite their algorithms to incorporate special parallel libraries into their Fortran and C programs, and to tailor their programs to the peculiarities of individual supercomputers.
IPython with
ipyparallel
offers an unprecedented ability to combine the
exploratory powers of scientific
Python with nearly instant access to multiple computing cores. The
system presents high-level abstractions that make it intuitive to interact
with a local or networked cluster of compute nodes, regardless of the
details of how the cluster is implemented. This ease of interactive use has
helped IPython and Python to become a popular tool for scientific
computation and data science across a wide variety of disciplines. For just
one recent example, this
paper [PDF] presenting research on a problem at the intersection of
machine learning and biochemistry, benefited from the ease of use of
ipyparallel
; it includes a section discussing the advantages
of the system.
Sometimes an idea is more easily expressed, or a calculation will run faster, in another language. Multilingual computing used to require elaborate interfaces to use multiple compilers or working in separate interpreters. The ability to enhance the fluid, exploratory nature of computing that IPython and Jupyter already enable, by allowing the user to code in different languages at will, enables a genuinely new way to interact with a computer.
IPython and Jupyter are more than just interfaces to an interpreter. The enhancements to the computing experience described in this article are fairly recent, but are not merely curiosities for software developers—they are already being put to use by scientists and engineers in applications. Tools such as these are levers for creativity; what that helps to bring forth in the future will be interesting to see.
Unplugging old batteries
Python is famous for being a "batteries included" language—its standard library provides a versatile set of modules with the language—but there may be times when some of those batteries have reached their end of life. At the 2018 Python Language Summit, Christian Heimes wanted to suggest a few batteries that may have outlived their usefulness and to discuss how the process of retiring standard library modules should work.
The "batteries included" phrase for Python came from the now-withdrawn PEP 206 in 2006. That PEP argued that having a rich standard library was an advantage for the language since users did not need to download lots of other modules to get real work done. That argument still holds, but there are some modules that are showing their age and should, perhaps, be unplugged and retired from the standard library.
For example, Heimes listed several different obsolete modules. He included the uu module, which implements encoding and decoding for uuencoded files. That standard is from 1980 and long predates MIME, he said. Three separate ancient media format libraries were also on the list: chunk (for IFF data), aifc (an Amiga audio format), and sunau (Sun's au audio format) . The final module in this list was nis, which implements Network Information Service (NIS, also known as "yellow pages"); its successor, NIS+, has been around since 1992, he said. No one spoke up to oppose retiring those modules.
The motivation to retire these (and other) modules is to reduce the cruft in the standard library. That will lead to a leaner standard library so there isn't such a "huge long list" of modules that greets new developers. It will also reduce the maintenance burden. Beyond that, there are almost certainly security flaws that exist in some of these older, largely unloved and undeveloped modules.
He has started drafting a PEP, which is about 80% done, but there are still open questions. There are also quite a number of debatable modules that might be considered for retirement, such as sndhdr and imghdr, which try to determine the type of a sound or image file but are woefully out of date. Several web modules had fixes and enhancements proposed for Python 2.1 in PEP 222, but that never happened. The old import library imp has been deprecated since 3.4; Brett Cannon said that it would be moving out of the standard library in 2020 when the Python 2.x series reaches its end of life. And so on.
Even if the list of modules to retire was agreed upon, there are still questions about how that process should work. Would the modules simply be removed with the hope that someone would pick them up and maintain them in the Python Package Index (PyPI)? Or would PyPI modules be created for them? Would they live in a single "dead battery" namespace or each using the existing module name? PyPI does not allow standard library module names for submitted modules, so it should be possible to use the existing names in PyPI if that is deemed desirable.
Ned Deily was not sure that a "ten minute discussion" at the summit was the right way to decide which modules to remove. He suggested that the process needed more visibility throughout the community, perhaps via a poll. There are probably uses for these modules that attendees are completely unaware of, he said. Posting the PEP will help raise the visibility, an attendee said; it will get discussed more widely at that point.
Deferring seccomp decisions to user space
There has been a lot of work in recent years to use BPF to push policy decisions into the kernel. But sometimes, it seems, what is really wanted is a way for a BPF program to punt a decision back to user space. That is the objective behind this patch set giving the secure computing (seccomp) mechanism a way to pass complex decisions to a user-space helper program.Seccomp, in its most flexible mode, allows user space to load a BPF program (still "classic" BPF, not the newer "extended" BPF) that has the opportunity to review every system call made by the controlled process. This program can choose to allow a call to proceed, or it can intervene by forcing a failure return or the immediate death of the process. These seccomp filters are known to be challenging to write for a number of reasons, even when the desired policy is simple.
Tycho Andersen, the author of the "seccomp trap to user space" patch set, sees a number of situations where the current mechanism falls short. His scenarios include allowing a container to load modules, create device nodes, or mount filesystems — with rigid controls applied. For example, creation of a /dev/null device would be allowed, but new block devices (or almost anything else) would not. Policies to allow this kind of action can be complex and site-specific; they are not something that would be easily implemented in a BPF program. But it might be possible to write something in user space that could handle decisions like these.
To enable this, Andersen's patch set adds a new return type for BPF programs (SECCOMP_RET_USER_NOTIF) that will cause the program making the call to be blocked while information about the call is sent to user space. A controlling program wanting to receive these notifications (and make decisions) must open a file descriptor by setting the SECCOMP_FILTER_FLAG_GET_LISTENER flag when loading the filter program. The returned file descriptor can then be polled for events; reading from it will return the next available notification signaled by the BPF filter.
Notifications, when read, are encoded in this structure:
struct seccomp_notif { __u64 id; pid_t pid; struct seccomp_data data; };
The returned id is a unique number identifying this event, pid is the ID of the process that triggered the notification, and data is the seccomp_data structure that was given to the BPF program describing the system call in progress:
struct seccomp_data { int nr; /* System call number */ __u32 arch; /* AUDIT_ARCH_* value (see <linux/audit.h>) */ __u64 instruction_pointer; /* CPU instruction pointer */ __u64 args[6]; /* Up to 6 system call arguments */ };
The user-space program can then meditate on whatever it is that the controlled program wishes to do. Note that the behavior of user notifications is similar to SECCOMP_RET_ERRNO, in that the system call itself will not be invoked in the context of the controlled process. So if the controlling process wants the system call to run in some form, it must do the work in its own context. When it has reached a decision (and done any needed work), it communicates that back to the kernel by filling in a seccomp_notif_resp structure and writing it back to the notification file descriptor:
struct seccomp_notif_resp { __u64 id; __s32 error; __s64 val; };
The id value must match that found in the original notification. error should be either zero or a negative error code; in the latter case, it will be negated and used as an error return from the system call that created the notification in the first place. If error is zero, then that system call will return successfully with val as its return value.
As a somewhat experimental addition, the final patch in the series adds two fields to the seccomp_notif_resp structure:
__u8 return_fd; __u32 fd;
These fields allow the control program to provide a file descriptor to be used as the return value from the system call; if return_fd is nonzero, fd will be passed to the controlled program. As Andersen notes, this mechanism will only work for system calls that are expected to return a file descriptor in the first place, but it's a starting point.
The protocol for the communication between the kernel and the control program has been the topic of some discussion in the past; in its current form, it will be difficult to extend when new features are (inevitably) added. Reviewers in the past have suggested using the netlink protocol instead, but that involves more complexity than the current implementation. Whether those reviewers will insist on that change before this code can be merged remains to be seen.
Overall, this patch series is another step in an interesting set of changes that has been taking place. The boundary between the kernel and user space was once a hard and well-defined line described by the system-call interface. Increasingly, developers are working to make it possible for users to move functionality across that line in both directions, both putting policy into the kernel with BPF programs or moving it out with various types of user-space helpers. As the computing environment changes, it seems that this flexibility will be needed to ensure that Linux stays relevant.
Statistics from the 4.17 kernel development cycle
The 4.17 kernel appears to be on track for a June 3 release, barring an unlikely last-minute surprise. So the time has come for the usual look at some development statistics for this cycle. While 4.17 is a normal cycle for the most part, it does have one characteristic of note: it is the third kernel release ever to be smaller (in terms of lines of code) than its predecessor.The 4.17 kernel, as of just after 4.17-rc7, has brought in 13,453 non-merge changesets from 1,696 developers. Of those developers, 256 made their first contribution to the kernel in this cycle; that is the smallest number of first-time developers since 4.8 (which had 237). The changeset count is nearly equal to 4.16 (which had 13,630), but the developer count is down from the 1,774 seen in the previous cycle.
Those developers added 690,000 lines of code, but removed 869,000, for a net reduction of nearly 180,000 lines. The main reason for the reduced line count, of course, is the removal of eight unused architectures. It's worth noting that, even with that much code hacked out, 4.17 will still be a little bit larger than 4.15.
The most active developers this time around were:
Most active 4.17 developers
By changesets Kuninori Morimoto 245 1.8% Kirill Tkhai 160 1.2% Arnd Bergmann 148 1.1% Chris Wilson 147 1.1% Colin Ian King 133 1.0% Alexandre Belloni 124 0.9% Rex Zhu 122 0.9% Dominik Brodowski 119 0.9% Christian König 119 0.9% Mauro Carvalho Chehab 106 0.8% Ajay Singh 102 0.8% Ville Syrjälä 100 0.7% Arnaldo Carvalho de Melo 99 0.7% Geert Uytterhoeven 94 0.7% Hans de Goede 86 0.6% Masahiro Yamada 83 0.6% Eric Dumazet 77 0.6% Gustavo A. R. Silva 72 0.5% Fabio Estevam 72 0.5% Linus Walleij 71 0.5%
By changed lines Arnd Bergmann 315103 22.9% Jesper Nilsson 100033 7.3% Greg Kroah-Hartman 81362 5.9% Feifei Xu 52509 3.8% David Howells 40705 3.0% Tom St Denis 32968 2.4% James Hogan 31998 2.3% Anirudh Venkataramanan 18937 1.4% Kuninori Morimoto 16175 1.2% Corentin Labbe 15265 1.1% John Crispin 13188 1.0% Yasunari Takiguchi 12983 0.9% Gilad Ben-Yossef 12426 0.9% Greentime Hu 11690 0.8% Rex Zhu 11458 0.8% Erik Schmauss 10980 0.8% Jacopo Mondi 10842 0.8% Harry Wentland 10198 0.7% Simon Horman 9179 0.7% Eric Biggers 8626 0.6%
Kuninori Morimoto contributed 245 patches, almost all concerned with a large renaming effort taking place in the ALSA sound driver subsystem. Kirill Thkai did a lot of work to increase parallelism in the network stack, Arnd Bergmann removed most of the old architecture code and did a lot of other cleanup (and year-2038) work throughout the kernel, Chris Wilson did a lot of work on the Intel i915 graphics driver, and Colin Ian King contributed a set of cleanup and typo-fixing patches.
The lines-changed column is dominated by Bergmann and Jesper Nilsson (who removed the Cris architecture). Greg Kroah-Hartman deleted a bunch of staging code (including the venerable IRDA infrared driver stack), Feifei Xu added more AMD GPU definitions, and David Howells removed the mn10300 architecture and did a bunch of filesystem-level work.
Work on 4.17 was supported by 241 companies that we were able to identify; the most active of those were:
Most active 4.17 employers
By changesets Intel 1392 10.3% (None) 977 7.3% Red Hat 870 6.5% (Unknown) 756 5.6% AMD 754 5.6% IBM 564 4.2% Renesas Electronics 559 4.2% Linaro 527 3.9% 448 3.3% Mellanox 405 3.0% SUSE 400 3.0% Bootlin 330 2.5% Samsung 268 2.0% Oracle 267 2.0% Huawei Technologies 244 1.8% Odin 232 1.7% ARM 222 1.7% (Consultant) 201 1.5% Canonical 188 1.4% Code Aurora Forum 181 1.3%
By lines changed Linaro 338103 24.6% AMD 138729 10.1% Axis Communications 100396 7.3% Intel 84613 6.2% Linux Foundation 81678 5.9% Red Hat 71152 5.2% Renesas Electronics 42565 3.1% (None) 35960 2.6% Imagination Technologies 32000 2.3% IBM 25841 1.9% ARM 23906 1.7% (Unknown) 22646 1.6% 21390 1.6% BayLibre 20931 1.5% Mellanox 19081 1.4% Bootlin 16256 1.2% (Consultant) 15353 1.1% Sony 14029 1.0% Fon 13188 1.0% Samsung 12823 0.9%
As usual, there are few surprises here.
The Reviewed-by tag was created to credit those who review code prior to its merging into the kernel. The actual use of that tag is sporadic at best, making it a poor guide to who is actually performing code review. But it still can be worth a look (and people complain when we don't post it). So here is a list of the top credited reviewers, alongside the counts of non-author signoffs (which are also an indicator of patch review):
Most active 4.17 code reviewers
Reviewed-by tags Alex Deucher 213 4.5% Rob Herring 192 4.1% Tony Cheng 123 2.6% Geert Uytterhoeven 108 2.3% Andrew Morton 102 2.2% Andy Shevchenko 94 2.0% Christian König 83 1.8% Chris Wilson 69 1.5% Daniel Vetter 64 1.4% Laurent Pinchart 57 1.2% Sebastian Reichel 56 1.2% Harry Wentland 56 1.2% Johannes Thumshirn 55 1.2% Hannes Reinecke 55 1.2% Christoph Hellwig 53 1.1% Guenter Roeck 51 1.1% Simon Horman 48 1.0% Darrick J. Wong 45 1.0% David Sterba 43 0.9% Ido Schimmel 42 0.9%
Non-author signoffs David S. Miller 1378 10.9% Greg Kroah-Hartman 876 7.0% Alex Deucher 640 5.1% Mark Brown 537 4.3% Mauro Carvalho Chehab 390 3.1% Andrew Morton 335 2.7% Ingo Molnar 318 2.5% Arnaldo Carvalho de Melo 213 1.7% Michael Ellerman 210 1.7% Herbert Xu 209 1.7% Jens Axboe 201 1.6% Martin K. Petersen 198 1.6% Kalle Valo 155 1.2% Thomas Gleixner 153 1.2% David Sterba 151 1.2% Jason Gunthorpe 134 1.1% Jeff Kirsher 133 1.1% Simon Horman 128 1.0% Doug Ledford 123 1.0% Shawn Guo 121 1.0%
Unlike many maintainers, Alex Deucher applies a Reviewed-by tag to many patches that he applies to his repository, causing him to show up in both columns. Rob Herring has reviewed a wide range of patches centered mostly around device-tree bindings and related issues; these patches were generally applied by somebody else. Geert Uytterhoeven reviews patches from a fair variety of authors, but he is not normally the maintainer who applies them.
Andrew Morton reviews far more code than he ever gets credit for. Until recently, though, that activity has not been reflected by Reviewed-by tags: he supplied exactly one in 2008, 14 in 2009, one in 2012, and one in 2015. That changed in January of this year when he started adding Reviewed-by tags to many of the patches he applies to his own tree; this is part of a broader effort to ensure that all memory-management patches are reviewed. Morton understands what it means to truly review a patch, so each of those tags certainly indicates a real amount of work.
Tony Cheng is an interesting and potentially different case. He is an employee of AMD and, seemingly without exception, his Reviewed-by tags are applied to patches from other AMD developers, and the reviews themselves do not appear on the public mailing lists. He also applies Reviewed-by tags to his own patches, which are relatively small and few in number (example). Reviewed-by tags from people working at the same company as the patch author are often looked at with suspicion by other developers, especially when the reviews happen behind closed doors. In truth, in-house reviews can be among the most rigorous and demanding — or they can be a rubber stamp. Either way, though, applying Reviewed-by tags to one's own patches is not how things are usually done.
The signoff column, of course, shows which maintainers have been accepting the most patches. It does not guarantee that the maintainer has reviewed all of those patches before applying them, though maintainers should ensure that somebody has reviewed them. In any case, there is certainly some amount of work implied by having signed off on a lot of patches.
Use of Reviewed-by tags would appear to be increasing; over time, that may help to bring the amount of review work in the kernel into clearer focus. For now, though, its use remains both spotty and inconsistent; it's better than no data at all, but it's not even close to a complete picture. Overall, these numbers, like all in this article, are far from perfect metrics about who is really doing the work to keep the kernel project going.
One thing that is clear from these numbers is that the kernel remains a busy place — one of the busiest software-development projects on the planet. It seems unlikely that things will slow down anytime soon.
Will staging lose its Lustre?
The kernel's staging tree is meant to be a path by which substandard code can attract increased developer attention, be improved, and eventually find its way into the mainline kernel. Not every module graduates from staging; some are simply removed after it becomes clear that nobody cares about them. It is rare, though, for a project that is actively developed and widely used to be removed from the staging tree, but that may be about to happen with the Lustre filesystem.The staging tree was created almost exactly ten years ago as a response to the ongoing problem of out-of-tree drivers that had many users but which lacked the code quality to get into the kernel. By giving such code a toehold, it was hoped, the staging tree would help it to mature more quickly; in the process, it would also provide a relatively safe place for aspiring kernel developers to get their hands dirty fixing up the code. By some measures, staging has been a great success: it has seen nearly 50,000 commits contributed by a large community of developers, and a number of drivers have, indeed, shaped up and moved into the mainline. The "ccree" TrustZone CryptoCell driver graduated from staging in 4.17, for example, and the visorbus driver moved to the mainline in 4.16.
Other code has been less fortunate, though. The gdm72xx, dgap, and olpc_dcon drivers were all deleted in 4.6 due to a lack of interest, and a whole set of RDMA drivers was deleted in 4.5. The COMEDI driver set has received over 8,500 changes since it entered the staging tree, but has still not managed to graduate; it has seen less than 100 patches in the last year. Placement in the staging tree is clearly not a guarantee that a driver will improve enough to move into the mainline.
Then there is the Lustre filesystem, which was added to the staging tree just over five years ago for the 3.11 release. Lustre has a rather longer history than that, though; it was started by the prolific Peter Braam in 1999. It was eventually picked up by Sun Microsystems, then suffered death by Oracle in 2010. In more recent times, its development has been managed by OpenSFS; it seems to have a strong following in industries needing a high-end distributed filesystem for high-performance computing applications.
As of 4.17, there have been 3,778 patches applied to the Lustre filesystem in the staging tree. A full 33% of those have come from Intel employees, and 11% from Outreachy interns. But this work has not yet managed to make Lustre ready to move out of the staging tree, and the associated TODO file remains long. It's not clear when Lustre will be brought into shape.
Indeed, it may never happen. Greg Kroah-Hartman, the maintainer of the staging tree, is now pushing to remove Lustre outright:
Removal from the mainline would, Kroah-Hartman said, allow it to proceed forward at full speed; the project could then return once its code-quality issues have been addressed.
One of the obvious problems with Lustre is its sheer size; at just under 200,000 lines of code, it's not something that is going to be cleaned up quickly. With that size comes quite a bit of complexity; highly scalable distributed filesystems are not simple, and beginning developers cannot really be expected to make substantive changes to them.
But the other problem, according to Kroah-Hartman, is that development of
Lustre is not actually happening in the staging tree. Instead, the Lustre
project maintains its own external tree and makes regular releases outside
of the mainline cycle. The 2.11.0 release, for
example, came out in early April and added a number of new features. Some
of the work done in the Lustre repository is sporadically brought over to
the copy in the staging tree, but that tree is clearly not the focus of
development. As Kroah-Hartman commented: "This dual-tree development
model has never worked, and the state of this codebase is proof of
that
".
Some developers (Christoph Hellwig, for example) applauded this move. Unsurprisingly, the Lustre developers are somewhat less enthusiastic. Andreas Dilger argued that, as a filesystem with thousands of users, its code should be in the mainline (though Kroah-Hartman countered that none of those users are running the staging version of the code) and that Lustre has improved considerably over the years. Neil Brown, who has contributed many improvements to Lustre, is also against its removal, fearing that it would never return afterward.
What will happen next is unclear. It may be that Kroah-Hartman's real purpose was to light a fire underneath the project and force some action rather than the actual deletion of the code. But there is little doubt that Lustre will eventually find itself staged out if the pace of improvement (and perhaps its development model in general) does not change. Staging is meant to be an entry point into the kernel, not a halfway house where code remains indefinitely.
A filesystem "change journal" and other topics
At the 2017 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Amir Goldstein presented his work on adding a superblock watch mechanism to provide a scalable way to notify applications of changes in a filesystem. At the 2018 edition of LSFMM, he was back to discuss adding NTFS-like change journals to the kernel in support of backup solutions of various sorts. As a second topic for the session, he also wanted to discuss doing more performance-regression testing for filesystems.
Goldstein said he is working on getting the superblock watch feature merged. It works well and is used in production by his employer, CTERA Networks, but there is a need to get information about filesystem changes even after a crash. Jan Kara suggested that what was wanted was an indication of which files had changed since the last time the filesystem changes were queried; Goldstein agreed.
NTFS has a change journal and he is working on something similar for his company's products that is based on overlayfs. Changes that are made to the filesystem go into the overlay. For his use case, he does not need high accuracy, false positives are not a serious problem; it is sufficient to simply know that something may have changed. His application will then scan the directory to determine what, if anything, has actually changed.
Kent Overstreet asked about using filesystem sequence numbers, but Goldstein said that not all filesystems have them. He is going to continue with his overlayfs-based plan, but would rather see a generic API that could be used by others.
Dave Chinner wondered if the inode version (i_version) field could be used. Goldstein said that he wants to be able to query the filesystem to get all of the changes that have happened since a particular point in time. Josef Bacik said that Btrfs has that feature. Goldstein said that both Btrfs and dm-thin (thin provisioning) will provide a list of blocks that have changed.
Simply scanning the inode array at mount time to find updates since the last query would be easy, but could take a long time, Ted Ts'o said. Chinner said the real underlying problem is missed notifications—filesystem changes that the application is not notified about. If that problem is solved, there is no need to scan after a crash.
Goldstein is using a journal of the fsnotify event stream to reconstruct lost events in the event of a crash. But Chinner is worried about missing change notifications because the fsnotify event does not make it into the journal before the crash. Kara suggested a new fsnotify event that would indicate the intent to change; it would be journaled before the actual change. Since false positives are not a problem, if the actual change does not happen (and the fsnotify event is not actually generated), everything will still work.
Kara said that FreeBSD has a facility that provides something similar to the NTFS change journal. The API for that is already established and might provide inspiration for the Linux API. Goldstein said that he already has a way to solve his immediate problem; he has lots of ideas for additional features if he gets the time to work on them.
Filesystem performance regressions
Goldstein then shifted gears; he would like to see more filesystem performance-regression testing and wanted to discuss that. Bacik said that some performance tests have been merged into xfstests recently and asked for more. He has created a way to get fio data dumped in JSON that can be pulled into a SQLite database for doing comparisons.
Overstreet suggested that those tests should be run automatically; if it has to be done manually, it won't happen. But Bacik said he has been focused on just getting it running; he wondered if it would be more valuable to run performance tests every time or only when the developer wants to look at performance numbers.
Chinner said that the performance testing is really only meaningful for him when he is doing A/B testing. Otherwise, various runs of the test suite might have different debug settings (e.g. lockdep), so the results would not be comparable. In order for the runs to be meaningful, they have to be done in a controlled and consistent environment.
Al Viro wondered how much variability was being seen between test runs. In his testing he has seen lots of variability, which makes it even harder to compare the results between different kernel versions. The allowable variability before flagging a regression is defined in the tests, Bacik said; it is around 2% or so. Right now, the output from every run is stored in a database, but it is fairly rudimentary, he said.
Kara said that MMTests gathers similar kinds of data. He has found that averages are not particularly useful because the data is so noisy run to run, especially if the difference between the kernels is large. Average plus standard deviation is a reasonable starting point. He is not opposed to incorporating something simple into xfstests, but is concerned that more complex tests just make the run-to-run variability so high that it makes it hard to find where an observed regression is coming from.
Bacik said that Facebook rebases its kernels yearly and he would like to have a simple test to be sure that the performance hasn't gone down radically. He wants discrete tests that won't show a lot of variability. But he is not trying to catch small performance losses with these tests. He said that the tests that are there now are "better than nothing" and that nothing is what was there before. He asked again for more tests. He also asked that Ts'o and Chinner run the performance tests for ext4 and XFS, as he is doing for Btrfs.
The ZUFS zero-copy filesystem
At the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Boaz Harrosh presented his zero-copy user-mode filesystem (ZUFS). It is both a filesystem in its own right and a framework similar to FUSE for implementing filesystems in user space. It is geared toward extremely low latency and high performance, particularly for systems using persistent memory.
Harrosh began by saying that the idea behind his talk is to hopefully entice others into helping out with ZUFS. There are lots of "big iron machines" these days, some with extremely fast I/O paths (e.g. NVMe over fabrics with throughput higher than memory). "For some reason" there may be a need to run a filesystem in user space but the current interface is slow because "everyone is copy happy", he said.
Al Viro asked if Harrosh had looked at OrangeFS, which can share its pages with a user-space component. Harrosh said that he had worked with OrangeFS in the past, but that it has "nowhere near the performance" he is seeking.
He is focused on copies. If a system mainly uses NVDIMMs (i.e. persistent memory), the memory bandwidth should be used for storage. So ZUFS is "very strict that nothing is copied anywhere". Anything that can be accessed via a pointer to persistent memory will be; "even metadata is zero copy".
He showed some system diagrams of ZUFS that are similar to those in his 2017 Linux Plumbers Conference slides [.pptx]. There is one kernel component, the ZU Feeder (ZUF), that feeds remote procedure calls (RPCs) from applications to the ZU Server (ZUS), which lives in user space. ZUS can have various .so files linked to it that implement different filesystems using the framework; some of those might be proprietary. ZUF is released under the GPL, while ZUS is BSD licensed.
There are multiple ZUFS threads (ZTs), each with an affinity to a single core. Each ZT is dedicated to a particular application; there is no shared information between the ZTs and ZUS, so no locks are needed. A 4MB per-CPU zero-copy region (ZT-vma) is shared between the application and the ZT, so each CPU has its own area that can be used to communicate between the server and the application.
For a write operation, the application maps its buffers into its per-CPU ZT-vma and initiates the operation. ZT gets the pointer and length and does a memcpy() from the ZT-vma data to persistent memory. For a read, the application maps buffers to hold the data and a ZT fills them from the persistent memory. It supports multiple applications, with "not a lock in sight".
The in-kernel portion of ZUFS includes a ZUF-root, which is a mini-filesystem that allows the normal mount command to be used. The kernel will have knowledge of the filesystem types and mounts, but the filesystems are really mounted in user space. ZUS is a thin layer that implements VFS operations. It uses direct I/O by default, but can optionally use the page cache.
ZUFS is a zero-copy replacement for FUSE. It sacrifices some of the security of FUSE because it does not have a server per filesystem, but the API for ZUFS is simpler than FUSE. It also does not rely on copying, as FUSE does, of course.
He was looking for feedback, but a whirlwind tour of a new filesystem with a lot of differences from the usual fare may have been a bit overwhelming; there were not too many comments on any "big holes" that attendees saw. He said that there is a complete filesystem implementation at this point, only missing extended attributes (xattrs) and access-control lists (ACLs); it can run xfstests and is "pretty stable". It does, however, take some shortcuts; that means the server has a lot of ways to crash the kernel, which Viro called a "non-starter" in terms of getting it merged.
Flash storage topics
At the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Jaegeuk Kim described some current issues for flash storage, especially with regard to Android. Kim is the F2FS developer and maintainer, and the filesystem-track session was ostensibly about that filesystem. In the end, though, the talk did not focus on F2FS and instead ranged over a number of problem areas for Android flash storage.
He started by noting that Universal Flash Storage (UFS) devices have high read/write speeds, but can also have high latency for some operations. For example, ext4 will issue a discard command but a UFS device might take ten seconds to process it. That leads the user to think that Android is broken, he said.
UFS devices have a "huge garbage-collection overhead". When garbage collection is needed, the performance of even sequential writes drops way down. That needs to be avoided, so UFS must be periodically given some time to do its garbage collection. But power is a more important consideration, so hibernating the device is prioritized, which does not leave much time for the device to do its garbage collection.
Amir Goldstein suggested doing garbage collection when the device is charging; he thought that should provide a reasonable solution. Kim said that Android currently declares a ten-minute idle time at 2am that is used to defragment the filesystem. It could perhaps also be used for garbage collection.
The solution to the discard performance problem should be fairly straightforward, he said. A kernel thread (kthread) could be added to issue discards asynchronously during idle time. Candidate blocks could be added to a list that would be processed by the kthread. There is a race condition if the block gets reallocated, however.
Different UFS devices have different latencies for their cache-flush commands. Some vendors' devices have low latency but others have ten-second latencies for a single cache-flush command. Given that, it makes sense to batch cache-flush commands.
Filesystem encryption is mandatory for Android. It is present in ext4 and has also been added to F2FS. There is some hardware encryption code from Qualcomm that cannot be pushed upstream, however. Ted Ts'o said that it is "horrible code" that only works for ext4 ecryptfs or F2FS; no one has had time to clean it up for the mainline.
Kim would like to see the garbage collection on the device side get optimized. He would like to add a customized interface that can be called when it is time to do garbage collection. If the system can detect idle time, it can then initiate the garbage-collection process.
SQLite performance is another problem area. SQLite uses fsync() to ensure its data has gotten to storage. By default it uses a journal, so writes to the database end up requiring two writes and two fsync() calls (first for the journal and then to the final location). Two fsync() operations can be expensive and are not needed for F2FS because it is a copy-on-write filesystem. A feature has been added to SQLite to avoid one write and one fsync() by using F2FS atomic writes.
In order to reduce the latency of fsync() calls, he is looking at write barriers. He researched them and found that they had been removed long ago. Kent Overstreet said they were removed due to unclear semantics, especially for stacked filesystems. In that case, the stack would have to provide order guarantees for the BIOs all the way down the stack, which would be difficult to do and would defeat the purpose of some of the layers. Beyond that, it is impossible to test to make sure that has been done correctly.
But Kim said that the Android case would not involve device-mapper or other stacking, he is just trying to avoid the cache-flush command. Jan Kara suggested a new storage command, like "issue barrier", that would cause any I/O issued before the barrier to complete before any new I/O.
Page editor: Jonathan Corbet
Next page:
Brief items>>