Leading items

Welcome to the LWN.net Weekly Edition for November 16, 2017

This edition contains the following feature content:

KAISER: hiding the kernel from user space: a fundamental change to how user-space memory is laid out that is needed to prevent information leaks from the kernel.
ROCA: Return Of the Coppersmith Attack: an overview of the cryptographic failure that has compromised keys created by some devices.
The inherent fragility of seccomp(): the seccomp() mechanism is used to create sandboxes, but it is uniquely susceptible to breaking over software updates.
Block layer introduction part 2: the request layer: the second and final installment in Neil Brown's series on how the kernel's block layer works.
SciPy reaches 1.0: an important scientific-software project reaches a major milestone.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Note that November 23 is the US Thanksgiving holiday. In keeping with tradition, there will be no LWN Weekly Edition that week because we will all have eaten too much food to think straight. The Weekly Edition will return on November 30.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

KAISER: hiding the kernel from user space

By Jonathan Corbet
November 15, 2017

Since the beginning, Linux has mapped the kernel's memory into the address space of every running process. There are solid performance reasons for doing this, and the processor's memory-management unit can ordinarily be trusted to prevent user space from accessing that memory. More recently, though, some more subtle security issues related to this mapping have come to light, leading to the rapid development of a new patch set that ends this longstanding practice for the x86 architecture.

Some address-space history

On 32-bit systems, the address-space layout for a running process dedicated the bottom 3GB (0x00000000 to 0xbfffffff) for user-space use and the top 1GB (0xc0000000 to 0xffffffff) for the kernel. Each process saw its own memory in the bottom 3GB, while the kernel-space mapping was the same for all. On an x86_64 system, the user-space virtual address space goes from zero to 0x7fffffffffff (the bottom 47 bits), while kernel-space mappings are scattered in the range above 0xffff880000000000. While user space can, in some sense, see the address space reserved for the kernel, it has no actual access to that memory.

This mapping scheme has caused problems in the past. On 32-bit systems, it limits the total size of a process's address space to 3GB, for example. The kernel-side problems are arguably worse, in that the kernel can only directly access a bit less than 1GB of physical memory; using more memory than that required the implementation of a complicated "high memory" mechanism. 32-Bit systems were never going to be great at using large amounts of memory (for a 20th-century value of "large"), but keeping the kernel mapped into user space made things worse.

Nonetheless, this mechanism persists for a simple reason: getting rid of it would make the system run considerably slower. Keeping the kernel permanently mapped eliminates the need to flush the processor's translation lookaside buffer (TLB) when switching between user and kernel space, and it allows the TLB entries for kernel space to never be flushed. Flushing the TLB is an expensive operation for a couple of reasons: having to go to the page tables to repopulate the TLB hurts, but the act of performing the flush itself is slow enough that it can be the biggest part of the cost.

Back in 2003, Ingo Molnar implemented a different mechanism, where user space and kernel space each got a full 4GB address space and the processor would switch between them on every context switch. The "4G/4G" mechanism solved problems for some users and was shipped by some distributors, but the associated performance cost ensured that it never found its way into the mainline kernel. Nobody has seriously proposed separating the two address spaces since.

Rethinking the shared address space

On contemporary 64-bit systems, the shared address space does not constrain the amount of virtual memory that can be addressed as it used to, but there is another problem that is related to security. An important technique for hardening the system is kernel address-space layout randomization (KASLR), which randomizes the placement of the kernel in the virtual address space at boot time. By denying an attacker the knowledge of where the kernel lives in memory, KASLR makes many types of attack considerably more difficult. As long as the actual location of the kernel does not leak to user space, attackers will be left groping in the dark.

The problem is that this information leaks in many ways. Many of those leaks date back to simpler days when kernel addresses were not sensitive information; it even turns out that your editor introduced one such leak in 2003. Nobody was worried about exposing that information at that time. More recently, a concerted effort has been made to close off the direct leaks from the kernel, but none of that will be of much benefit if the hardware itself reveals the kernel's location. And that would appear to be exactly what is happening.

This paper from Daniel Gruss et al. [PDF] cites a number of hardware-based attacks on KASLR. They use techniques like exploiting timing differences in fault handling, observing the behavior of prefetch instructions, or forcing faults using the Intel TSX (transactional memory) instructions. There are rumors circulating that other such channels exist but have not yet been disclosed. In all of these cases, the processor responds differently to a memory access attempt depending on whether the target address is mapped in the page tables, regardless of whether the running process can actually access that location. These differences can be used to find where the kernel has been placed — without making the kernel aware that an attack is underway.

Fixing information leaks in the hardware is difficult and, in any case, deployed systems are likely to remain vulnerable. But there is a viable defense against these information leaks: making the kernel's page tables entirely inaccessible to user space. In other words, it would seem that the practice of mapping the kernel into user space needs to end in the interest of hardening the system.

KAISER

The paper linked above provided an implementation of separated address spaces for the x86-64 kernel; the authors called it "KAISER", which evidently stands for "kernel address isolation to have side-channels efficiently removed". This implementation was not suitable for inclusion into the mainline, but it was picked up and heavily modified by Dave Hansen. The resulting patch set (still called "KAISER") is in its third revision and seems likely to find its way upstream in a relatively short period of time.

Whereas current systems have a single set of page tables for each process, KAISER implements two. One set is essentially unchanged; it includes both kernel-space and user-space addresses, but it is only used when the system is running in kernel mode. The second "shadow" page table contains a copy of all of the user-space mappings, but leaves out the kernel side. Instead, there is a minimal set of kernel-space mappings that provides the information needed to handle system calls and interrupts, but no more. Copying the page tables may sound inefficient, but the copying only happens at the top level of the page-table hierarchy, so the bulk of that data is shared between the two copies.

Whenever a process is running in user mode, the shadow page tables will be active. The bulk of the kernel's address space will thus be completely hidden from the process, defeating the known hardware-based attacks. Whenever the system needs to switch to kernel mode, in response to a system call, an exception, or an interrupt, for example, a switch to the other page tables will be made. The code that manages the return to user space must then make the shadow page tables active again.

The defense provided by KAISER is not complete, in that a small amount of kernel information must still be present to manage the switch back to kernel mode. In the patch description, Hansen wrote:

The minimal kernel page tables try to map only what is needed to enter/exit the kernel such as the entry/exit functions, interrupt descriptors (IDT) and the kernel trampoline stacks. This minimal set of data can still reveal the kernel's ASLR base address. But, this minimal kernel data is all trusted, which makes it harder to exploit than data in the kernel direct map which contains loads of user-controlled data.

While the patch does not mention it, one could imagine that, if the presence of the remaining information turns out to give away the game, it could probably be located separately from the rest of the kernel at its own randomized address.

The performance concerns that drove the use of a single set of page tables have not gone away, of course. More recent processors offer some help, though, in the form of process-context identifiers (PCIDs). These identifiers tag entries in the TLB; lookups in the TLB will only succeed if the associated PCID matches that of the thread running in the processor at the time. Use of PCIDs eliminates the need to flush the TLB at context switches; that reduces the cost of switching page tables during system calls considerably. Happily, the kernel got support for PCIDs during the 4.14 development cycle.

Even so, there will be a performance penalty to pay when KAISER is in use:

KAISER will affect performance for anything that does system calls or interrupts: everything. Just the new instructions (CR3 manipulation) add a few hundred cycles to a syscall or interrupt. Most workloads that we have run show single-digit regressions. 5% is a good round number for what is typical. The worst we have seen is a roughly 30% regression on a loopback networking test that did a ton of syscalls and context switches.

Not that long ago, a security-related patch with that kind of performance penalty would not have even been considered for mainline inclusion. Times have changed, though, and most developers have realized that a hardened kernel is no longer optional. Even so, there will be options to enable or disable KAISER, perhaps even at run time, for those who are unwilling to take the performance hit.

All told, KAISER has the look of a patch set that has been put onto the fast track. It emerged nearly fully formed and has immediately seen a lot of attention from a number of core kernel developers. Linus Torvalds is clearly in support of the idea, though he naturally has pointed out a number of things that, in his opinion, could be improved. Nobody has talked publicly about time frames for merging this code, but 4.15 might not be entirely out of the question.

Comments (34 posted)

ROCA: Return Of the Coppersmith Attack

November 14, 2017

This article was contributed by Antoine Beaupré

On October 30, 2017, a group of Czech researchers from Masaryk University presented the ROCA paper at the ACM CCS Conference, which earned the Real-World Impact Award. We briefly mentioned ROCA when it was first reported but haven't dug into details of the vulnerability yet. Because of its far-ranging impact, it seems important to review the vulnerability in light of the new results published recently.

ROCA and its impacts

As we all probably know, most modern cryptography is based on the fundamental assumption that finding the prime factors of a large number is much harder than generating that number from two large primes. For this assumption to hold, however, the prime numbers need to be randomly chosen, otherwise it becomes possible to guess those numbers more easily. The prime generation process can take time, especially on embedded devices like cryptographic tokens.

The ROCA vulnerability occurred because Infineon, a popular smartcard manufacturer, developed its own proprietary algorithm based on the fast prime technique. If used correctly, fast prime allows the creation of randomly distributed primes faster than traditional methods, which generally consist of generating random numbers and testing for primality. The ROCA paper shows that Infineon goofed on the implementation and the resulting primes were not evenly distributed. This opened the possibility of recovering private keys from public key material in a reasonable time frame.

Private key recovery is one of the worst security vulnerabilities (short of direct cleartext recovery) that can happen in asymmetric crypto systems. It turns out that Infineon is one of only three secure chip manufacturers and therefore its chips are everywhere: in cryptographic keycards (e.g. the popular Yubikey 4) but also Trusted Platform Modules (TPMs) that live in most modern computers; even some official identity cards, which are used in everything from banking to voting, have Infineon chips. So the impact of this vulnerability is broad: medical records, voting, OpenPGP signatures and encryption, Digital Rights Management (DRM), full disk encryption, and secure boot; all become vulnerable if the wrong keycard generated the private key material.

Hacking the Estonian elections

Let's take an extreme example of identity theft to illustrate the severity of the ROCA vulnerability. Estonia used Infineon chips in its state-issued identity cards. Because those cards are used in its electronic voting system, it was speculated that votes could be forged in an election. Indeed, there was a parliamentary election in 2015, at a time when vulnerable cards were in the wild.

The Estonian government claims that leveraging the attack to commit electoral fraud in Estonia is "complicated and not cheap". This seems to rely on an unnamed expert as saying it would "cost approximately €60 billion" to mount such an attack, a number found in this article published by the Estonian state media (ERR). Indeed, estimates ~~from the ROCA paper~~ show that cracking a single key would cost €80,000 in cloud computing costs. Since then, however, the vulnerability was also reviewed by cryptographers Daniel J. Bernstein and Tanja Lange, who found that it was possible to improve the performance of the attack: they stopped after a ~~four-fold~~ 25% improvement, but suspect even further speed-ups are possible. So let's see what effect those numbers would have in an election now.

There are 750,000 vulnerable cards, according to ERR, out of the 1.3 million cards currently in circulation, according to National Electoral Committee. There were around 900,000 eligible voters in the 2015 parliamentary elections, about 600,000 of those voted and about 30% of voters cast their votes electronically. So, assuming the distribution of compromised cards is uniform among voters, we can use the percentage of compromised cards (roughly 60%) to estimate that vulnerable cards could have been used in about 17% of the total votes.

The 2015 election was pretty close: the Estonian Reform Party beat its closest rival, the Estonian Centre Party, by only 5%. So, it could have been possible to affect the result of that election, even without compromising all the cards. Bernstein and Lange were actually generous when they said that "'large-scale vote fraud' does not require breaking all of the ID cards; 10% already buys massive influence".

In fact, according to their numbers, the €80,000 required to break one card can be reduced by a factor of four thanks to ~~their~~ various performance improvements and by another factor of ten using dedicated hardware, which means we can expect targeted attacks to be 40 times cheaper than the original paper, bringing the cost of one vote down to around €2000. An Ars Technica article quoted another researcher as saying: "I'm not sure whether someone can slash the cost of one key below $1,000 as of today, but I certainly see it as a possibility."

So, being generous with the exchange rates for the sake of convenience, we could use a price per vote range of about €1,000-2,000. This means that buying the 5% of the votes necessary to win that 2015 election (or around 30,000 votes) would cost €30-60 million, which is a much lower number than originally announced and definitely affordable for a hostile foreign-state actor.

Now, I am not saying that this happened during the 2015 Estonian election. I am merely pointing out that the resources required to buy votes (one way or another) are much cheaper than previously thought. And while the Electoral Committee released the server-side source code in 2013, that wasn't where the ROCA problem lies. The vulnerability, this time, is client-side: the identity cards based on proprietary hardware. Fortunately, the Estonian authorities took action to block the Infineon cards on November 3 and issue new certificates for the vulnerable cards in a state-wide replacement program. So far, there is no evidence of foul play or identity theft, according to state officials.

The attack and disclosure

ROCA is an acronym for "Return Of the Coppersmith Attack" which, in turn, refers to a class of attacks on RSA that uses some knowledge about the secret key material that allows the key to be guessed in less than brute-force time. The actual mathematical details are available in the full paper [PDF] and go beyond my modest mathematical knowledge, but it certainly seems like Infineon took shortcuts in designing its random number generator. What the Czech researchers found is that the primes generated by the Infineon algorithm had a specific structure that made them easier to guess than randomly distributed primes. Indeed, all primes generated by RSALib, the library Infineon created for those chips, followed this pattern:

    p = k ∗ M + (65537^a mod M)

Here p is one of the secret primes used to construct the public key. The secrets in the equation are k and a, while M varies depending with the chosen key size. The problem is that M is disproportionately large, close enough to the size of p so that k and a are too small. And this is where the Coppersmith method comes in: it relies on small integers being part of the equation generating the prime. It is also where I get lost in the math; interested readers can read the papers to find out more.

Something interesting here is that Bernstein and Lange were able to reproduce the results before the paper was publicly released, using only information disclosed as part of the public advisory. As such, they raise the interesting question of whether delaying publication of the actual paper helped in containing the security vulnerability, since they were able to construct an attack within about a week of part-time work. They outlined that publicizing the keyword "Coppersmith" in the title of the research paper (which was public at the time of disclosure) was especially useful in confirming they were on the right path during their research. In that sense, Bernstein and Lange imply that delaying the publication of the paper had no benefit to the security community, and may have actually benefited attackers in a critical period, while impeding the work of honest security researchers.

They also questioned the unusually long delay given to Infineon (eight months) before disclosure. Usually, "responsible disclosure" ranges from a few days to several months. For example, Google's Project Zero team gives projects 90 days to fix problems before it publicizes the results. The long delay was presumably given to provide ample time for companies and governments to organize mitigation measures. Yet, when the paper was finally released, some companies (e.g. Gemalto) still weren't clear if their products were affected. According to Ars Technica, Gemalto's customers, including Microsoft, which uses the products for two-factor authentication, are now still left wondering if their products are secure. Similarly, Estonia recently scrambled to suspend all affected cards, earlier than originally planned, even though they were notified of the vulnerability back in August. It seems the long delay didn't work: stakeholders did not proceed with those changes promptly and thousands if not millions of devices are still vulnerable at the time of writing.

Part of the problem is that a lot of those devices simply cannot be upgraded in the field. For example, Yubico forbids firmware changes to the Yubikey 4. So the company had to set up a free replacement program for the popular keycards. TPM modules, fortunately, are often flashable, but those need to be done through a manufacturer update, which may be hard to arrange. Most often, fixes are just not deployed at all, because supply chains make it difficult to find the original manufacturer that can provide fixes. In fact, the ROCA paper described the impact evaluation as "far from straightforward", partly because secure hardware is often designed as an opaque system that is deliberately hard to evaluate. In particular, the researchers stated that "prevalence of the RSALib in the area of programmable smartcards is notoriously difficult to estimate". The researchers further explained that we may deal with the fallout of ROCA for a long time:

We expect to see the cards for a rather long time (several years) before all the vulnerable cards are eventually sold out, especially when dealing with low volume markets. The buyers should check the cards for the presence of fingerprinted keys and/or opt for longer key lengths if they are supported by the card hardware.

The ROCA paper also noted that medical records identification cards and electronic payment cards (EMV) may be vulnerable to this issue as well, although they were not aware of publicly available sources for the public keys for those devices—a necessary condition to mount an attack.

Yet the researchers are confident their disclosure led to good results. In an email interview, Petr Svenda said:

The main driver for us to publicly announce the vulnerability including details is to force companies to really act. This hopefully happened (new firmware for TPMs, updates to Bitlocker, revocation of eID keys...)

Svenda also believes the security vulnerability is "more likely to be unintentional" because the prime structure leak was "so obvious" when compared to other cryptographic vulnerabilities like Dual_EC_DRBG. He pointed to some of the recommendations from the paper:

Don't keep the design secret and have verifiable implementation
Use some kind of secure multiparty computation to remove single points of failure

The latter idea is novel for me, but was proposed in academic literature back in 1997. The idea is to not rely solely on one algorithm or device to generate primes but collaboratively generate them among multiple parties. This makes vulnerabilities like ROCA much less likely because the same mistake would need to happen among all the components for the attack to be effective.

Obviously, you can also generate private keys on another computer and move them to the keycard later. But, as the Debian OpenSSL debacle showed, desktop computers are not immune to this type of vulnerability either.

The ROCA paper highlights the "dangers of keeping the design secret and the implementation closed-source, even if both are thoroughly analyzed and certified by experts". They were also critical of the certification process which "counter-intuitively 'rewards' the secrecy of design by additional certification 'points' when an implementation is difficult for potential attackers to obtain - thus favoring security by obscurity."

The paper concludes by stating that "relevant certification bodies might want to reconsider such an approach in favor of open implementations and specifications." This is a similar conclusion to the one I reached in my comparison of cryptographic tokens: we should be able to design secure, yet open, hardware and hopefully these kinds of vulnerabilities will serve as a lesson for the future.

Comments (18 posted)

The inherent fragility of seccomp()

By Jonathan Corbet
November 10, 2017

Kernel developers have worried for years that tracepoints could lead to applications depending on obscure implementation details; the consequent need to preserve existing behavior to avoid causing regressions could end up impeding future development. A recent report shows that the seccomp() system call is also more prone to regressions than users may expect — but kernel developers are unlikely to cause these regressions and, indeed, have little ability to prevent them. Programs using seccomp() will have an inherently higher risk of breaking when software is updated.

seccomp() allows the establishment of a filter that will restrict the set of system calls available to a process. It has obvious uses related to sandboxing; if an application has no need for, say, the open() system call, blocking access to that call entirely can reduce the damage that can be caused if that application is compromised. As more systems and programs are subjected to hardening, the use of seccomp() can be expected to continue to increase.

Michael Kerrisk recently reported that upgrading to glibc 2.26 broke one of his demonstration applications. That program was using seccomp() to block access to the open() system call. The problem that he ran into comes down to the fact that applications almost never invoke system calls directly; instead, they call wrappers that have been defined by the C library.

The glibc open() wrapper has, since the beginning, been a wrapper around the kernel's open() system call. But open() is an old interface that has long been superseded by openat(). The older call still exists because applications expect it to be there, but it is implemented as a special case of openat() within the kernel itself. In glibc 2.26, the open() wrapper was changed to call openat() instead. This change was not visible to ordinary applications, but it will break seccomp() filters that behave differently for open() and openat().

Kerrisk was not really complaining about the change, but he did want to inform the glibc developers that there were user-visible effects from it: "I want to raise awareness that these sorts of changes have the potential to possibly cause breakages for some code using seccomp, and note that I think such changes should not be made lightly or gratuitously". The developers should, he was suggesting, keep the possibility of breaking seccomp() filters in mind when making changes, and they should document such changes when they cannot be avoided.

Florian Weimer, however, disagreed:

I have the opposite view: We should make such changes as often as possible, to remind people that seccomp filters (and certain SELinux and AppArmor policies) are incompatible with the GNU/Linux model, where everything is developed separately and not maintained within a single source tree (unlike say OpenBSD). This means that you really can't deviate from the upstream Linux userspace ABI (in the broadest possible sense) and still expect things to work.

Another way of putting this might be: seccomp() filters are not considered to be a part of the ABI that is provided by glibc, so incompatible changes there are not considered regressions. They are, instead, a consequence of filtering below the glibc level while expecting behavior above that level to remain unchanged.

Weimer's point of view would appear to be the one that will govern glibc development going forward. So Kerrisk has proposed some man-page changes to make the fragility of seccomp() filters a bit less surprising to developers. Playing the game at this level will require a fairly deep understanding of what is going on and the ability to adapt to future C-library changes.

This outcome could be seen as an argument in favor of a filtering interface like OpenBSD's pledge(). Like seccomp(), pledge() is used to limit the set of system calls available to a process, but pledge() is defined in terms of broad swathes of functionality rather than working at the level of individual system calls. It can be used to allow basic file I/O, for example, while disabling the opening (or creation) of new files. pledge() is far less expressive than seccomp() and cannot implement anything close to the same range of policies but, for basic filtering, it seems far less likely to generate surprises after a kernel or library update.

But Linux doesn't have pledge() and seems unlikely to get it. seccomp() can certainly get the sandboxing job done, but developers who use it should expect to spend some ongoing effort maintaining their filters.

Comments (30 posted)

Block layer introduction part 2: the request layer

November 9, 2017

This article was contributed by Neil Brown

The Linux block layer provides an upstream interface to filesystems and block-special devices allowing them to access a multitude of storage backends in a uniform manner. It also provides downstream interfaces to device drivers and driver-support frameworks that allow those drivers and frameworks to receive requests in a manner most suitable to each. Some drivers do not benefit from preliminary handling and just use the thin "bio layer" that we met previously. Other drivers benefit from some preprocessing that might detect batches of consecutive requests, may reorder requests based on various criteria, and which presents the requests as one or more well-defined streams. To service these drivers, there exists a section of the block layer that I refer to as the request layer.

There are currently two parallel parts to the request layer, sometimes described as "single-queue" and "multi-queue". The multi-queue part is relatively new (years, not decades) and may, one day, completely replace the single-queue version. That has not yet happened and both are in active use in current kernels. Having two different queuing approaches to examine provides opportunities to learn from the contrasts, so we will spend a little time looking at each of these two parts and how they present requests to the underlying frameworks or drivers. First, though, we will look at what is common between them, which is best seen by examining two data structures, struct request_queue and struct request.

The `request_queue` and the `request`

The request_queue and request structures are closely related to the gendisk and bio structures used in the bio layer: one represents a particular device, while the other represents an I/O request. While request structures are only used for devices that make use of the request layer, the request_queue structure is allocated for every gendisk. Some of the fields, such as parts of the struct queue_limits substructure, apply equally well to all block devices and arguably should live in the gendisk. Other fields only apply to the queue management provided by the request layer, and really only need to be allocated for devices using this layer. The current arrangement is simply a historical accident that isn't worth fixing and is, in any case, externally visible through the contents of the /sys/block/*/queue/ sysfs directories.

The request structure represents a single I/O request that will ultimately be delivered to the underlying device. It contains a list of one or more bios that must represent contiguous I/O operations, various fields to track its overall status (such as timestamps and the originating CPU), and several anchors so it can be included in larger data structures. It has a struct list_head queuelist so it can appear in a simple queue, a struct hlist_node hash so it can appear in a hash table (used to see if any outstanding requests are adjacent to a new bio), and a struct rb_node rb_node used to include the request in a red-black tree. When a request structure is allocated, some extra space is allocated at the end to be used by the underlying driver to store any additional per-request information. This space is sometimes used to store the command header that is sent to the underlying devices, such as a SCSI command descriptor block, but drivers are free to use it however they wish.

These requests are created by the appropriate make_request_fn() function (blk_queue_bio() for single-queue or blk_mq_make_request() for multi-queue) and handed to an I/O scheduler or "elevator", a name derived from the elevator algorithm that was once a cornerstone of disk I/O scheduling and is now, at most, a minor detail. We will take a quick look at the single-queue approach, and then see how that compares to the newer multi-queue mechanism.

Scheduling requests for a single queue

Traditionally, most storage devices were made up of a set of spinning circular platters with magnetic coating and a single head (or set of heads, one per platter) that moved along the radius of the spinning disk to read or change the magnetic polarization at any location on any platter. Such a device can only process a single request at a time, and has a substantial cost in moving from one location on the platters to another. The single-queue implementation started out aimed at driving this sort of device and, while it has broadened in scope over the years, its structure still reflects the needs of rotating storage devices.

The three key tasks for a single-queue scheduler are:

To collect multiple bios representing contiguous operations into a smaller number of requests that are large enough to make best use of the hardware but not so large that they exceed any limitations of the device.
To queue these requests in an order that minimizes seek time while not delaying important requests unduly. Providing an optimal solution to this problem is the source of all the complexity. As it is impossible to know, in general, how important each request is or how much seek time will be wasted, we rely on heuristics to choose a good order, and heuristics are never perfect.
To make these requests available to the underlying driver so it can pluck them off the queue when it is ready and to provide a mechanism for notification when those requests are complete.

The last task is fairly straightforward. A driver registers a request_fn() using blk_init_queue_node() and this function is called whenever there are new requests on the queue that can usefully be processed. The driver is responsible for collecting a request using blk_peek_request() and processing it. When that request finishes (or earlier if the device can handle multiple requests in parallel), the driver should collect another request and submit it, without waiting for further calls to the request_fn(). Completion of each request is reported by calling blk_finish_request().

The first task is partly handled by common code that performs a couple of quick checks in the queue to see if a request can easily be found to accept a new bio. On success, the scheduler will be asked to approve the merge; on failure, the scheduler will be given a chance to look for a merge itself. If no merge is possible, a new request is allocated and given to the scheduler. The scheduler may, at a later time, merge this request with another if either request ever grows large enough that they become contiguous.

The middle task is potentially more complex. Putting these requests in a suitable order depends a lot on how the word "suitable" is interpreted. The three different single-queue schedulers, noop, deadline, and cfq, use significantly different interpretations:

"noop" provides minimal sorting of requests, never allowing a read to be moved ahead of a write or vice-versa, but otherwise allowing one request to overtake another if, in line with the elevator algorithm, the I/O head is likely to reach the new one before the old one. Apart from this simple sorting, "noop" provides first-in-first-out queuing.
"deadline" collects batches of either read or write requests that were all submitted at close to the same time. Within a batch, requests are sorted and each batch is queued to the device either when it reaches the maximum size or when the expiry time is reached. This algorithm tries to put an upper limit on the delay imposed on any request while retaining the possibility of forming large batches.
"cfq" stands for "Complete Fairness Queueing"; this scheduler is substantially more complex than the others. It aims to provide fairness between different processes and, when configured with control groups, between different groups of processes as well. It maintains multiple queues internally; one for each process to hold the synchronous requests from the process (typically reads), and one for each priority level for the asynchronous requests (typically writes) with that priority. These queues each get a turn at submitting requests based on priority calculations. Each queue gets a time slice in which to submit a limited number of requests. When a synchronous queue has fewer requests than the limit, the device is allowed to go idle even though other queues might contain requests. Synchronous requests are often followed by other requests at a similar location, so keeping the device idle to wait for that next synchronous request can improve throughput.

This description barely scratches the surface of cfq. There is in-kernel documentation that provides more details and lists all the parameters that can be tuned to match different circumstances.

As already hinted, some devices can accept multiple requests at once, accepting new requests before earlier requests complete. This usually involves "tagging", adding a tag to each request so that completion notifications can be connected back to the appropriate request. The single-queue request layer contains support for keeping track of tags to whatever depth the device supports.

A device may internally support tagged commands by truly running some requests in parallel, such as by accessing an internal cache, by having multiple different components that each can handle a single request, or by providing its own internal queueing that works with more knowledge of the device internals than are available to the request layer. As the level of parallelism and the sophistication of the internal scheduling increases, the need for Linux to provide optimal scheduling for the request stream decreases. There will always be a need for some scheduling within Linux, to implement concepts such as control groups that the device hardware cannot know about, but when the device can do much of the scheduling itself, it makes sense for Linux to take a more hands-off approach. This is part of the motivation for the multi-queue side of the request layer.

Multiple queue and multiple CPUs

Another motivation for multi-queue support is that, as we get more and more processing cores in our systems, the locking overhead required to place requests from all cores into a single queue increases. The plugging infrastructure can help to some extent, but not as much as we would like. If we allocate a large number of queues: one per NUMA node or even one per CPU, then the locking overhead for putting a request onto the queue is substantially reduced. If the hardware allows multiple requests to be submitted in parallel from different CPUs, this is a clear net win. If the hardware only supports a single submission at a time, the multiple per-CPU queues still need to be merged. If they are merged in larger batches than the plugging functionality creates, this is again a net win. When there is no increase in batch size, careful coding should be able to ensure that, at least, there is no net loss.

As noted earlier, the cfq scheduler already contains multiple queues internally. These serve quite a different purpose than the queues provided by multi-queue. They associate requests with processes and priority levels whereas the multi-queue queues are tied to the details of the hardware. The multi-queue request layer maintains two groups of hardware-related queues: the software staging queues and the hardware dispatch queues.

Software staging queues (struct blk_mq_ctx) are allocated based on the CPU hardware, one per CPU or one per NUMA node. Requests are added to these queues, controlled by a spinlock that should mostly be uncontended, whenever the block I/O plug is unplugged (blk_mq_flush_plug_list()). These queues may optionally be managed by a separate multi-queue scheduler, of which there are three: bfq, kyber, and mq-deadline.

The hardware dispatch queues are allocated based on the target hardware, so there may be just one, or there may be as many as 2048 (or however many interrupt source the platform supports). Though the request layer does allocate a data structure (struct blk_mq_hw_ctx) for each hardware queue (or "hardware context"), and tracks the mapping from CPU to queue, the queue itself is the responsibility of the underlying driver. The request layer will, from time to time, pass requests together with a hardware context to the underlying driver. What happens next is up to the driver, but the expectation is that the driver will get the request to the hardware as quickly as possible, generally passing requests on in the order that they are received.

Another important difference between the multi-queue layer and the single-queue layer is that, for multi-queue, the request structures are all preallocated. Each request structure has an associated tag number that is unique among requests for that device, and that travels with the request all the way into the hardware — where supported — and back again. Allocating the tag number early achieves smoother transit for the request through lower layers once the request layer has decided to release it.

Rather than providing a single request_fn(), the underlying driver must provide a struct blk_mq_ops that lists up to eleven functions. Of these, the function of most interest here is queue_rq(). Other functions support timeouts, polling for completion, request initialization, and the like.

Once the configured scheduler decides that a request is ready and that it no longer wants to keep it on a queue to allow for re-ordering or extension, it will call the queue_rq() function. This design places the responsibility for handing off requests on the request layer itself, in contrast to single-queue where it was the responsibility of the driver to collect them. queue_rq() is given a hardware context and will normally place the request on some internal FIFO, or possibly handle it directly. The queue_rq() can refuse to accept a request by returning BLK_STS_RESOURCE, which causes the request to remain on the staging queue. Any return value other than BLK_STS_RESOURCE or BLK_STS_OK is treated as an I/O error.

Multi-queue scheduling

A multi-queue driver does not need to have a scheduler configured, in which case an approach similar to the "noop" single-queue scheduler is used. Consecutive bios are grouped into a single request, while non-consecutive bios each end up in their own request. These are queued in a simple FIFO order in the software staging queue, though when there are multiple submission queues, the default scheduler tries to submit new requests directly and only uses the staging queue after a BLK_STS_RESOURCE response. The software queues are then passed to the driver by calls to blk_mq_run_hw_queue() or blk_mq_delay_run_hw_queue(), which typically happens when the block device is unplugged.

The pluggable multi-queue schedulers have 22 entry points, two of which are insert_requests(), which should add a list of requests to the software staging queue, and dispatch_request(), which should choose a request to be passed to the given hardware queue. If insert_requests() is not provided, the requests are simply appended to the list. If dispatch_request() is not provided, requests will be collected from any staging queue that is configured to feed to this hardware queue, and these are passed down in arbitrary order. This "collect everything" step is the main point of cross-CPU locking and can hurt performance, so it is best if a device with a single hardware queue accepts all the requests it is given.

The mq-deadline scheduler provides much the same functionality as the single-queue deadline scheduler. It provides an insert_request() function that ignores the multiple staging queues and adds each request to one of two global, time-ordered queues — one for reads and one for writes. If a new request can be merged with an existing request that is done, else it is added to the end of the relevant queue. A dispatch_request() function is provided that returns the first of either of these queues, based on age, on batch size, and on not starving writes for too long.

The bfq scheduler is, to some extent, modeled after cfq, with the acronym standing for Budget Fair Queueing. There is some in-kernel documentation; it was covered in these pages a few years ago and further covered with the adaption to multi-queue more recently. bfq, much like mq-deadline, does not use multiple per-CPU staging queues. While it has multiple queues, they are accessed by all CPUs using a single spinlock.

Unlike mq-deadline and bfq, the kyber I/O scheduler, briefly discussed here earlier this year, does make use of the per-CPU (or per-NUMA-node) staging queues. It doesn't provide an insert_request() function but makes use of the default behavior. The dispatch_request() function maintains various internal queues on a per-hardware-context basis. If these queues are empty it will collect requests from all staging queues that map to the given hardware context, and will distribute them internally. When they are not empty, it delivers requests from the internal queues as appropriate. The intricacies of this strategy, how it distributes requests, and in what order they are processed, do not appear to be documented in the kernel.

End of the line for single-queue?

Having two different queuing systems, two sets of schedulers, and two driver interfaces is clearly not ideal. Can we expect to see the end of the single-queue code soon? How much better is multi-queue really?

Unfortunately these are questions that require lots of testing on hardware and this writer is mostly a software guy. From a software perspective, it is clear that, when used with hardware that supports parallel submission to multiple queues, multi-queue should bring significant benefits. When used with single-queue hardware, the multi-queue system should be able to, at least, achieve parity with single queue. It would not be reasonable to expect that parity to be achieved overnight though, as anything new must be expected to be imperfect.

An example of this imperfection is a set of patches that was recently accepted to be included in Linux 4.15. The patches make changes to the mq-deadline scheduler to address some performance issues that Red Hat's internal storage testing discovered. It is reasonable to expect that other players in the storage field are performing their own tests, and are likely to find regressions when switching to multi-queue. It is also reasonable to expect these regressions to be fixed over the coming months. 2017 might not be the year of the multi-queue-only Linux, but that year is not likely to be far away.

Comments (6 posted)

SciPy reaches 1.0

November 15, 2017

This article was contributed by Lee Phillips

After 16 years of evolution, the SciPy project has reached version 1.0. SciPy, a free-software project, has become one of the most popular computational toolkits for scientists from a wide range of disciplines, and is largely responsible for the ascendancy of Python in many areas of scientific research. While the 1.0 release is significant, much of the underlying software has been stable for some time; the "1.0" version number reflects that the project as a whole is on solid footing.

What is SciPy?

The term "SciPy" is overloaded: it refers to a library of scientific and numerical routines used with Python (and here usually spelled "scipy") and to an annual conference devoted to the use of Python in the sciences. It also is used to denote a collection of libraries, packaged together and often distributed as a whole, that work together to form a scientist's toolkit. Beyond that, the name is also used to refer to the organization that guides and curates the various parts of the libraries.

The core pieces of the SciPy collection are:

The scipy library
NumPy, which is the centrally important numerical library for Python,
IPython, which is the enhanced Python interactive shell (or REPL),
The matplotlib plotting library
The symbolic mathematics library SymPy
The pandas data analysis library

Another project that's often installed with SciPy is Jupyter, which can be used in place of IPython or the normal Python REPL. Jupyter uses a web browser as a rich interface to Python that is particularly well suited to scientific or mathematical work, as it can embed graphical output and LaTeX equations.

The libraries mentioned above are, for the most part, general-purpose infrastructure. The typical user of SciPy may also install one or more SciKits, which are specialized packages designed to be used with SciPy. They are kept apart from the main SciPy bundle when they are too specialized, have an incompatible license (SciPy uses BSD), or are not quite mature enough. Right now there are 89 SciKits listed on the official page, covering a wide range of specialties. There is software for fuzzy logic, aeronautical engineering, digital signal processing, particle physics, image processing, machine learning, and many more arcane tasks.

As for the scipy library itself, it contains such things as a database of physical constants, a library of special functions (Bessel, hypergeometric, elliptic, etc.), numerical integrators, Fourier transforms, statistics routines, and image processors.

The 1.0 release

Since SciPy, even when it refers to software, encompasses a large and various collection of programs, each of which has its own version number, the meaning of the SciPy release version needs some explanation. The significance of the 1.0 release is that some organizational milestones and some overall project goals have been reached. Much of the actual module code has been mature and stable for some time.

With this release, SciPy has adopted a formal governance structure with a Benevolent Dictator for Life (Pauli Virtanen), as well as a Steering Council. It has an official Code of Conduct and a Roadmap. The 1.0 version number is meant to reflect the maturity and stability of the organization as much as the code under the SciPy umbrella.

The project now has better Windows support, including both binary downloads and the ability to compile SciPy code using the Microsoft Visual C++ compiler and GFortran. This may strike some readers of this publication as uninteresting, but it is a boon to the users of SciPy who, for one reason or another, need to use Windows.

Some of the major technical milestones reached in this release include a new set of ordinary differential equation solvers and reorganization of the ODE library interface; some better performing functions within the optimizer library, which handles such things as searching for the minimum of functions, curve fitting, and root finding; and assimilation of more LAPACK and BLAS routines, completing the interface to the entire BLAS library. These are sets of Fortran subroutines for linear algebra and matrix computations that are widely used and highly regarded by numerical programmers working in all languages; highly optimized versions are available for various machine architectures. There is also an assortment of improvements to many other routines in the scipy library.

The SciPy community

While the first official version of SciPy was released in 2001, work on SciPy actually began in the 1990s. This was during the emergence of Python. Many young scientists and engineers, having been raised in the era of Fortran (and sometimes C) as the only language for scientific computing, began to experiment with this new interpreted, dynamically typed, expressive language. Python was fun to program in, but could it be used for serious work? Python's suitability as a glue language helped to answer that question in the affirmative.

SciPy began as a set of Python interfaces to trusted Fortran and C programs for numerical calculations. This makes sense: Fortran is used because its compilers generate fast numerical code, but its string handling and I/O are clumsy, and it has no convenient interfaces to the operating system or for interacting with other programs. Python is good at what Fortran lacks, although, before the emergence of NumPy, it was slow at numerical computation. Using Python to steer compiled numerical routines, and to explore their outputs, exploits the strengths of each language.

This turned out to be a popular idea. Now the scipy library involves over 500 contributors and is actively developed: in the past 30 days as of this writing, 28 developers have pushed 101 commits to the project (this does not include NumPy nor the other parts of SciPy). The SciPy developers are diverse. They include a physicist who now works for a New Zealand company in data science and forestry, a mathematician and physicist working for Wolfram Research, a researcher in image compression from Massachusetts, an applied mathematician from the Netherlands, a lecturer in natural language processing from Australia, a graduate student in astrophysics at the University of Notre Dame, a mathematician studying dynamical systems at Enthought in Texas, someone working in geographic information systems at MapBox from Colorado, an electronics engineer in New York, a cosmologist at Berkeley, and, of course, hundreds more in many fields of research.

Development is done in the open on GitHub, and the community is welcoming and helpful toward new contributors. This friendly culture may be due to the fact that most SciPy developers are researchers who are creating software with the overriding goal of helping to solve their research problems, with the niceties of programming practice as only a secondary concern.

Some idea of the emphases in development for the near future can be found in SciPy's official Roadmap. This document highlights the need for making the APIs consistent and improving test coverage. There are plans to fix issues with the Fourier Transform routines and clarify the differences between those in scipy and those in NumPy. Also, adding features to the interpolation routines (an opportunity for experts in splines to contribute), further generalizing the interface to LAPACK and BLAS routines, removing or rewriting the routines for wavelets (another good opportunity for contributions), improving several important routines for calculating special functions, and perhaps creating a new module for numerical differentiation are planned. From these details and others in the Roadmap, there emerges a general emphasis on not making any radical changes, but continuing to lay a solid foundation for a toolkit that will be generally usable far into the future. Hence the group has an interest in rationalizing the overall organization of the code, creating consistent interfaces that abstract the details of using legacy routines written with different styles and conventions, and keeping the documentation complete and up to date.

As I was preparing this article I came across a fascinating, current research report in Nature Communications on the biophysics of human sperm locomotion. It's not only free to read, but has links to all of the code used in the analysis of the experiments. It's all Python, and uses SciPy. I mention this not only to bring up another example of how SciPy has become so widely adopted, but to suggest that its adoption is part of a growing culture of openness in the sciences. Proprietary tools are being replaced by open-source alternatives; authors are placing data, which used to be kept locked away in laboratory filing cabinets, in open repositories. In addition, computer programs that used to be kept secret are now open to scrutiny and the grip of publishers on the dissemination of journal articles is being weakened. It's tempting to speculate that the spirit of open source is influencing science through the adoption of toolkits such as SciPy, and is reinforcing the movement toward greater transparency in research.

How to get it

SciPy is one of the free software world's huge umbrella projects, such as TeX Live, that consist of scores of other projects, many of which are developed independently of each other. As is usual with such umbrella projects, the version available through your distribution's package manager will be several releases behind the current one. In the case of SciPy, this may very well not matter to you; but if it does, and you desire the 1.0 release, you must get closer to the source.

SciPy supports multiple Python versions. For Python 2.x, it supports 2.6 and 2.7; for Python 3, all versions starting from 3.2 are supported.

The easiest way for a Python user on Linux probably is to use the pip install command—the full incantation is spelled out on the official install page, and simply uses pip to install SciPy's major components, such as NumPy and Jupyter.

Another option, which may be more convenient for Windows users and some others, is to install the self-contained Python distributions maintained by several companies, including Enthought, a major institutional sponsor of the SciPy project and its conferences. As of this writing, however, the versions packaged in these distributions, while recent, were slightly behind the latest releases available through pip.

Finally, if you have a few spare gigabytes and are willing to relinquish some control over which versions of various libraries are installed, Sage, the mathematical software covered here back in January, contains SciPy (and Jupyter, as well, since that is the currently preferred notebook interface to Sage).

Documentation

Documenting such a huge project is itself a significant undertaking. There is a reference manual in PDF form. I hesitate to link to it, as the one for version 1.0 weighs in at 2115 pages, and has no table of contents nor index. The first hundred pages are release notes, lists of authors, lists of pull requests, and similar material. This leaves search as the only way to find anything; searching through 2000 pages of PDF is not snappy. (But here is the PDF for those who want it despite the warnings.)

A better choice if you're getting started is the "documentation" link on the SciPy home page. This will lead you to other guides, including collections of recipes and some tutorials.

The typical user of SciPy will only use a fraction of its many specialized scientific and numerical libraries, so there is no need to have a huge trove of documentation at hand. One challenge may be to discover whether there exists a SciPy module that might help you in your work. Your options for exploring this space are web searches, specifically site searches aimed at scipy.org, discussion with members of your research community, browsing the scipy directory within your local machine's Python library directory tree (the source code has extensive docstrings), or using Python's online help system, which is a more convenient interface to this last approach.

To use the help system, first execute the ipython command; then, within the REPL, type help(). This works because ipython imports the pydoc module for you, which defines the help() command. This command will place you in a documentation subsystem, which you can exit with a ctrl-D. Within the help system, you can simply type the names of modules or sub-packages that you want to learn about. Typing "scipy" will give you a list of top-level modules in the scipy library; to learn about any one of these, ask about it using import syntax. For example, you will see "stats" in the list; to learn about the functions in this package, type "scipy.stats", and you will get extensive, if concise, documentation about SciPy's statistical functions, including some examples of use. If you've browsed the source code directly, you'll notice that this documentation is built from the docstrings, but organized more conveniently.

Using SciPy

Let's try out SciPy in the terminal, using IPython. The aim in this section is to provide a feel for interactive exploration in the REPL, and demonstrate the expressive power of the SciPy libraries. First, we need to spend a little time getting briefly acquainted with NumPy, which is the core extension to Python upon which everything else is based. Even if you never use any other part of SciPy, if you ever find yourself using Python to do any type of numerical computation, you will want to at least be aware of what NumPy offers.

NumPy adds a new data type to python: the numpy array. These are true arrays, distinct from Python's lists (and different, as well, from the array type provided in the standard library). They are multidimensional, homogeneous (numbers only) collections that store their elements contiguously in memory, which allows fast, vectorized operations by the Fortran or C routines that use them. NumPy provides operators and functions that operate on arrays as a whole or element-wise; while this is not the place for a NumPy tutorial, here is a taste of what is essentially an array mini-language within Python, that will seem familiar to those who have used array languages such as APL, or array Fortran. If you execute the following Python code (in either Python2 or Python3):

    import numpy as np
    a = np.array([1, 2, 3])
    a**2

You will get the output array([1, 4, 9]). Note that each element of the array is squared, and that this would be a TypeError if we tried it with a list rather than an array.

You can have multidimensional arrays, too. Here is a 2D array (a matrix):

    d = np.array([ [1, 2, 3], [4, 5, 6] ])
    d + 1

The way the result is printed shows how matrices are represented, and that the "+" operator is overloaded to act element-wise on arrays:

    array([[2, 3, 4],
           [5, 6, 7]])

NumPy provides versions of many mathematical functions, such as sin() and cos(), that are extended to act element-wise on arrays, and even makes it possible for you to extend your own functions to operate the same way. Using what is essentially a domain-specific-language for arrays can be a mind-expanding style of programming for those who are mainly familiar with explicit looping over data. NumPy enables you to use an interpreted scripting language to perform non-trivial, fast numerical computations, making the rest of the SciPy ecosystem possible.

The first figure is a screen shot of an IPython session showing the use of some of the linear algebra functions from the scipy library. First we do the necessary imports, then set some options for the way NumPy will print numbers, to keep things neat. We define a matrix a, and print it out, then calculate its inverse, and print that. The inverse of a matrix gives you the identity matrix (ones on the diagonal) when you matrix-multiply it with the original. The inverse can be used to solve sets of linear equations, for example. To verify that b is really the inverse of a, we multiply the two matrices, and get an identity matrix as the result.

Now let's import the fast Fourier transform routines to create a filtered square wave:

    from scipy.fftpack import fft, ifft

We'll need one cycle of a square wave to begin; we'll use a numpy array of 100 zeroes followed by 100 ones. We'll also need some x-values for plotting, so we make an x array with the same shape:

    sw = np.array([np.zeros(100), np.ones(100)]).reshape(200,)
    x = np.arange(0, 200, 1)

We haven't seen all of the things in the lines above explained, but their meaning should be fairly clear. Just to make sure we did this right, the figure above shows the square wave. All we need to do to see it is import pylab and then pylab.plot(x, sw) (see the matplotlib article linked above).

Now we pass the array into the fft() routine and store the Fourier coefficients in the array swt:

    swt = fft(sw)

Plotting swt with the pylab.bar() command reveals the spectrum in the figure above. As an exercise for the reader, if you now calculate the inverse transform of the spectrum using ifft(sw) and plot the result, you should get back the original square wave.

Finally, to show the effect of truncating the Fourier series, we make a copy of the coefficient array, zero out all the coefficients above the 10^th, and take the inverse transform of that:

    swtrunk = swt
    swtrunk[10:] = 0
    swtrunki = ifft(swtrunk)

Plotting the result produces a picture of the low-pass filtered square wave in the figure above.

I hope that these brief example sessions give you some idea of the interactive, exploratory powers that Python and SciPy provide. This style of expressive, fluid computation was virtually unknown to scientists a generation ago. Even if one had access to such conveniences as precompiled Fourier transform routines, in order to carry out the simple experiment that we did above would be comparatively laborious, without the immediate feedback that encourages a constructively playful approach to computation. The SciPy community deserves thanks and congratulations on reaching their recent milestone, and support for future development.

Comments (22 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>