Leading items

Welcome to the LWN.net Weekly Edition for October 10, 2019

This edition contains the following feature content:

Free software support for virtual and augmented reality: getting lost in virtual space with Linux.
What to do about CVE numbers: CVE numbers are broken, but what should replace them? A suggestion from kernel developer Greg Kroah-Hartman.
Why printk() is so complicated (and how to fix it): a Linux Plumbers Conference session on the difficulties of printk() and what is being done about them.
Adding the pidfd abstraction to the kernel: fixing long-time issues with process management in Unix systems.
An update on the input stack: lots is happening with input-device support, but the project could use some help.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Free software support for virtual and augmented reality

By Jake Edge
October 9, 2019

XDC

A talk at the recent X.Org Developers Conference in Montréal, Canada looked at support for "XR" in free software. XR is an umbrella term that includes both virtual reality (VR) and augmented reality (AR). In the talk, Joey Ferwerda and Christoph Haag from Collabora gave an overview of XR and the Monado project that provides support for those types of applications.

Ferwerda started by defining the term "HMD", which predates VR and AR. It is a head-mounted display, which basically means "taking a screen and some sensors and duct-taping it to your face". All of the devices that are being used for XR are HMDs. They typically include some kind of tracking system to determine the position and orientation of the HMD itself. Multiple different technologies, including inertial measurement units (IMUs), photodiodes, lasers, and cameras, are used to do the tracking depending on the device and its use case.

AR is intended to augment the real world with extra information; the user sees the real world around them, but various kinds of status and additional data is tagged to objects or locations in their view of the world. AR is a rather over-hyped technology these days, he said. The general idea is that users would wear glasses that would augment their view in some fashion, but, unfortunately, what most people think of as AR is Pokémon Go.

VR uses two screens, one for each eye, to create a 3D world that the user inhabits and can interact with in some fashion. Instead of seeing the real world, the user sees a completely separate world. There are two words that are often used to describe the feel of VR, he said: "presence" and "immersion". That means users are aware of themselves as being part of the VR environment.

XR encompasses both. Ferwerda said that he is not really sure what the "X" stands for; he has heard "cross reality" and "mixed reality" for XR. Haag said that "extended reality" was another definition that he had heard.

Monado and OpenXR

The project that they have been working on is Monado, which is a free-software OpenXR runtime for Linux. OpenXR is a standard from the Khronos Group, which is also the organization behind OpenGL, Vulkan, and various other graphics-related standards.

Ferwerda made way for Haag, who said that prior to OpenXR, each XR engine had to support each different XR runtime. The XR runtimes are generally device-specific. That creates something of a combinatorial explosion to support all of the devices from each engine, as can be seen on slide 16 in their slides [PDF]. With OpenXR, the device runtimes provide an application interface that allows any XR engine to work with any device runtime that supports the standard. Eventually, the OpenXR standard will add a device layer between the device runtimes and the hardware, he said.

The stack for XR handling consists of three layers. There is a program that does VR or AR sitting atop a software platform that does rendering and handles the input devices. That platform interfaces with the hardware that has the sensors to provide the input as well as the devices to display the rendered data.

Haag went through a brief introduction to the application API, which is similar in some ways to the Vulkan API for 3D graphics. More information can be found in the slides, in the YouTube video from XDC, or on the OpenXR site. The standard is still a work in progress, he said, so it doesn't support everything needed for AR yet, for example.

An application will start by creating an XR instance that is a handle to be used throughout. There are modes for handheld devices, such as smartphones, and for HMDs, as well as view modes for single or dual displays, which can be attached to the instance. Sessions are created for a particular rendering type; there are multiple types available, such as OpenGL, OpenGL ES, Vulkan, and several versions of Direct3D. There is no support for multiple sessions at this point, so you cannot overlay two (or more) application's output, though he thinks that will change in the future. VR desktops will need that ability, for example.

Rendering is handled by a cycle of three calls: waiting for the previous frame to finish rendering, beginning the next frame, and ending it. The position data for the next frame will be predicted by OpenXR so that the scene can be rendered for the expected orientation and position (i.e. "pose") when the frame will be displayed. Inputs from controllers are handled in the API as "actions", which are similar to those in the OpenVR API from Valve for its SteamVR platform.

The last year or two have been "really good" in terms of supporting XR graphics on Linux, Ferwerda said, taking back the microphone; there have been additions to the kernel and the display drivers to make things work better. Extended mode has been around for a while; in that mode, the VR display is treated as another screen where windows can be placed to display the content. That works, but there is somewhat painful manual setup for users and it introduces extra latency, he said.

Direct mode for VR graphics has been added more recently. It relies on Keith Packard's work to allow device "leases" for rendering using kernel modesetting (KMS) drivers without X or Wayland getting in the way. In order to get a "less nausea-inducing" display when the generated frame rate is lower than the rate supported by the device, reprojection is used to duplicate frames or make small changes to existing frames based on movement predictions to fill in, he said.

The VR hardware is pretty much all Vulkan-based, so that is what is used for rendering. If the application is OpenGL-based, it can use the Vulkan-OpenGL interoperability mode, but that only works for some hardware, Ferwerda said. Display correction is another piece of the puzzle; lenses are not perfect, they have distortion and chromatic aberrations that need to be corrected for.

Hardware support

For hardware, Monado supports "whatever we can find". That includes the Open Source Virtual Reality (OSVR) Hacker Developer Kit (HDK), which is no longer available but well supported. There has been a lot of reverse-engineering work done by the project for supporting other devices. In addition, the OpenHMD project has a C library that has at least some partial support for a wide range of hardware.

The project has also gotten the Razer Hydra game controller "working to some degree". That week, support for the PlayStation VR and PlayStation Move controller was merged, he said. This is the first full-featured support for a device that people already have or can fairly easily get their hands on (for less than €300), he said.

There are other notable free and open-source software projects, he said. One is libsurvive, which supports the lighthouse tracking used by the HTC Vive HMD. Another is maplab, which supports tracking using simultaneous mapping and tracking (SLAM). For SLAM tracking, the headset is actually mapping the room it is in; it is the "next big thing" in XR.

Even though XR is relatively new, Ferwerda said, there has been a community looking at the devices from a free and open standpoint for some time now. People have gotten the devices and been frustrated that there was no driver for them, so they started reverse engineering them. He has been part of that community since 2013, going to FOSDEM and other events, "getting an apartment and a lot of beer" with like-minded folks to work on making the hardware work on various operating systems. He noted that an OpenBSD developer spent a lot of time getting that system ready for VR, which put it ahead of Linux at the time.

Reverse engineering is a major part of getting the hardware to work, but there is hope that this will eventually change. There have been efforts to create more open platforms, but that is not a reality yet, he said. These devices share a fair amount of the same components, which makes it somewhat easier, but it still takes a lot of looking at Wireshark traces to see if you can find accelerometer and gyroscope information from the IMU, for example. Figuring out how to access the camera and how to interface with the display modes is also part of the fun.

Next up was a demo, of sorts, which can be seen in the photo (and in the video, of course). It was a recording of the classic glxgears running in a VR context. It is displayed by a custom compositor, displaying output rendered by Vulkan, and was shown in the orientation required by a particular VR headset. Each display is rotated and corrected for the needs of that device.

It has been a lot of fun working on Monado, Ferwerda said. The project was able to release an OpenXR implementation at the same time the specification was released. Only Monado and Microsoft were able to release a usable implementation at that time, "which was really cool". They hope to continue working on Monado to add more support for OpenXR and XR hardware; he asked for those interested to get in touch with the project to help make that a reality.

[I would like to thank the X.Org Foundation and LWN's travel sponsor, the Linux Foundation, for travel assistance to Montréal for XDC.]

Comments (8 posted)

What to do about CVE numbers

By Jonathan Corbet
October 4, 2019

Kernel Recipes

Common Vulnerability and Exposure (CVE) numbers have been used for many years as a way of uniquely identifying software vulnerabilities. It has become increasingly clear in recent years that there are problems with CVE numbers, though, and increasing numbers of vulnerabilities are not being assigned CVE numbers at all. At the 2019 Kernel Recipes event, Greg Kroah-Hartman delivered a "40-minute rant with an unsatisfactory conclusion" on CVE numbers and how the situation might be improved. The conclusion may be "unsatisfactory", but it seems destined to stir up some discussion regardless.

CVE numbers, Kroah-Hartman began, were meant to be a single identifier for vulnerabilities. They are a string that one can "throw into a security bulletin and feel happy". CVE numbers were an improvement over what came before; it used to be impossible to effectively track bugs. This was especially true for the "embedded library in our product has an issue" situation. In other words, he said, CVE numbers are good for zlib, which is embedded in almost every product and has been a source of security bugs for the last fifteen years.

Since CVE numbers are unique, somebody has to hand them out; there are now about 110 organizations that can do so. These include both companies and countries, he said, but not the kernel community, which has nobody handling that task. There also needs to be a unifying database behind these numbers; that is the National Vulnerability Database (NVD). The NVD provides a searchable database of vulnerabilities and assigns a score to each; it is updated slowly, when it is updated at all. The word "national" is interesting, he said; it really means "United States". Naturally, there is now a CNNVD maintained in China as well; it has more stuff and responds more quickly, but once an entry lands there it is never updated.

CVE problems

There are a number of problems with CVE numbers, Kroah-Hartman said; he didn't have time to go through the full set listed in his slides [SlideShare]. To begin with, the database is incomplete, with many vulnerabilities missing altogether or rejected for a variety of reasons. Even when CVE numbers are assigned for a vulnerability, the process tends to take a long time and updating the NVD takes even longer.

A big problem, he said, is that the system is run by the US government. People tend not to trust governments in general, and other governments are increasingly distrustful of the US government in particular. The system is erratically funded by the Department of Homeland Security, and is significantly underfunded overall. People need to trust that this sort of vulnerability database will not leak information, but government-run systems are subject to a number of pressures. During a Senate hearing on Meltdown and Spectre, Senators pressed the NVD representatives on why the Senate had not been notified about the vulnerabilities ahead of time, for example. Kroah-Hartman said that he trusts MITRE to run the NVD, but that the number of governmental representatives wanting early access to data is worrisome.

Another problem is complexity. There is a single CVE entry (CVE-2017-5753) for Spectre version 1, but there are over 100 patches addressing it, and more are still coming. A CVE number doesn't point to patches, reducing its usefulness for helping people be sure they have closed a given vulnerability. It is really not possible to handle such complex things with a single ID number, he said.

CVE numbers are abused by security developers looking to pad their resumes. As a result, a lot of "stupid things" are submitted for CVE numbers, and getting the invalid ones revoked is difficult. As an example, he gave CVE-2019-12379, which was published on May 27. It refers to an alleged memory leak in the console driver, one that, Kroah-Hartman said, poses no security threat at all. In fact, it wasn't even a leak, in the end. Even so, the NVD gave the report a security score of "medium" the day after it was submitted. Shortly thereafter the report was disputed, and it turned out that the "fix" introduced a real memory leak of its own. On June 4, Ben Hutchings reverted the patch.

One might think that the story was over at that point, but the CVE entry was only marked "disputed" in July. Distributions like Fedora have policies that require them to ship fixes for all CVE numbers, so they shipped the buggy patch in the meantime. Cleaning everything up took rather longer. This issue was eventually dealt with, but similar things happen every month — or even every week.

Then, he said, CVE numbers are also abused by engineers to bypass internal procedures — in particular, to get their company to ship a particular patch in a product update. Getting a CVE number is a good way to force a patch into an enterprise kernel, for example. Between 2006 and 2018, he said, there were 1005 kernel CVE numbers assigned. Of those, 414 (40%) had a negative "fix date", with the average fix happening 100 days prior to the CVE-number request. Many of these are just worthwhile fixes that couldn't be merged into a shipping kernel without a CVE number behind them. He summarized by saying that this shows that CVE numbers don't really matter; they no longer carry any useful information.

Bug fixes

The kernel community is currently pulling about 22 bug fixes per day into the stable trees; that is about 5% of the volume going into the mainline kernel, he said, and it should be higher. There are one or two stable-kernel releases each week. Each stable kernel is tested as a unified release and given away for free. The kernel developers are fixing about one known security problem per week, along with a vast number of other bugs that are not known to be security issues when they are fixed. All of these fixes are handled in the same way; "a bug is a bug", he said.

He mentioned a TTY fix that was understood, after three years, to close a serious vulnerability. He was the author of both the original code and the fix, and he hadn't realized that there was a security problem in the code. Users of enterprise kernels were vulnerable to this issue for three years, he said; those who were running the stable kernels were not. Only a small portion of kernel security fixes are assigned CVE numbers; anybody who is only cherry-picking CVE-labeled fixes is thus running an insecure system. Even fixes with CVE numbers often have followup fixes that are not documented.

He has audited a number of kernels for phones, he said. One popular handset was running 4.14.85, with three-million added lines of out-of-tree code ("what could possibly go wrong?"). If you compare that with the 4.14.108 stable release that was current in May when this analysis was done, the phone was 1,759 patches behind. The handset vendor had cherry-picked 36 patches from later kernels, but had missed twelve fixes with CVE numbers, and crucial bug fixes across the kernel tree. As a result, this phone can be crashed (or worse) by a remote attacker.

The Google security team, he said, has a "huge tool" that scours the net for security reports. In 2018, every reported problem was already fixed in the long-term stable kernels before they found it; the only exceptions were for problems in out-of-tree code. There was no need for cherry-picking at all; anybody using those kernels was already secure against known issues. As a result, Google is now requiring Android vendors to use the long-term stable kernels. He called out Sony and Essential as being especially good at picking up new kernel releases; the Pixel devices are lagging a bit, he said, but are "basically there".

There are, he said, 2.5 billion instances of Linux running on Android phones; that is where Linux runs now. All other users are a drop in the bucket in comparison. So this is where security matters the most; if these devices keep up with the stable-kernel releases, they will be secure, he said.

How to fix CVE numbers

Kroah-Hartman put up a slide showing possible "fixes" for CVE numbers. The first, "ignore them", is more-or-less what is happening today. The next, option, "burn them down", could be brought about by requesting a CVE number for every patch applied to the kernel. It would be "a horrible intern job for six months", he said, and somebody has even offered to fund such a position. But we know that the system is broken; abusing it will not make things better. Thus, the third option: "make something new".

The requirements for a replacement are fairly well understood. It would need to provide a unique identifier for vulnerabilities, just like CVE numbers are meant to. The system should be distributed, though; asking for identifiers from others doesn't work. It needs to be updatable over time, searchable, and public.

Consider, he said, commit 7caac62ed, which was applied in August. The changelog for this commit cites no less than three CVE numbers. The kernel community insists that developers break down their changes into simple patches, but this fix for three CVE numbers was still acceptable as a single patch. It really is a single issue, he said, that is better identified by the ID of the patch that fixed it than any of the three CVE numbers attached to it. He ran through a number of other patches, many of which included commit IDs as a way of identifying what was being fixed, usually in a "Fixes" tag. The use of those IDs in this way, he said, has become nearly universal in the kernel community.

Thus, he said, fixes already contain a unique ID: the "Fixes" tag showing where the problem was introduced. That ID could be used as the unique ID for a vulnerability; there is no need to introduce another one. We have, in fact, been using commit IDs this way for 14 years, and nobody has noticed. All that remains to be done is to get some marketing for this scheme. After all, CVE numbers are essentially marketing, telling a story about a particular vulnerability; this new scheme needs something similar.

The first thing that is needed to start the marketing effort, he said, is a catchy name. He ran through some possibilities, including Linux Git Kernel ID (LGKI), Kernel Git ID (KGI), or Git Kernel Hash (GKH). He paused for laughter at that last acronym (which is also his initials) before moving on. In the end, he said, the best name to use is "change ID" — the name we've been using for the last 14 years. A change ID is a world-wide, unique ID that works today, so let's use it. The format would look something like CID-0123456789ab.

Kroah-Hartman concluded by returning to his list of things to do about CVE numbers. We should indeed "ignore CVEs", but he supplemented the list with a fourth entry: rebrand what we have been doing all along.

Questions

Dmitry Vyukov led off the questions by asking about the claim that stable kernel releases are fully tested. Subsequent stable releases fix a lot more stuff, he said, so how, exactly, is that testing happening? Kroah-Hartman answered that the kernel certainly has problems with too many bugs. The stable releases in particular, though, benefit from a lot of effort to avoid regressions; he claimed that only 0.01% of the patches going into stable kernels cause regressions now.

Vyukov answered that he is not seeing any tests being added for bugs found by his syzkaller testing. So how can the community actually prevent regressions? The answer was that we certainly need more tests.

Your editor had to question the 0.01% figure, since some analysis done a few years ago showed a rate closer to 2%. Kroah-Hartman said that the number came from the Chrome OS team, which was counting "noticeable regressions".

The final question was about users who are stuck with vendor kernels that will not be upgraded; what are they to do? Kroah-Hartman responded that this is a real problem. Those vendors typically add about three-million lines of code to their kernels, so they are shipping a "Linux-like system". The answer is to force vendors to get their code upstream; to do that, customers have to push back. Sony, in particular, has been insisting that its vendors have their code in the mainline kernel. That is how we solved the problem for servers years ago; it is still the approach to use today.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event.]

Comments (48 posted)

Why printk() is so complicated (and how to fix it)

By Jonathan Corbet
October 3, 2019

LPC

The kernel's printk() function seems like it should be relatively simple; all it does is format a string and output it to the kernel logs. That simplicity hides a lot of underlying complexity, though, and that complexity is why kernel developers are still unhappy with printk() after 28 years. At the 2019 Linux Plumbers Conference, John Ogness explained where the complexity in printk() comes from and what is being done to improve the situation.

The core problem, Ogness began, comes from the fact that kernel code must be able to call printk() from any context. Calls from atomic context prevent it from blocking; calls from non-maskable interrupts (NMIs) can even rule out the use of spinlocks. At the same time, output from printk() is crucial when the kernel runs into trouble; developers do not want to lose any printed messages even if the kernel is crashing or hanging. Those messages should appear on console devices, which may be attached to serial ports, graphic adapters, or network connections. Meanwhile, printk() cannot interfere with the normal operation of the system.

In other words, he summarized, printk() is seemingly simple and definitely ubiquitous, but it has to be wired deeply into the system.

The path to the present

Ogness then launched into a detailed history of printk(); see his slides [PDF] for all the details. The first kernel release — v0.01 — included a printk() implementation; it was synchronous and simply pushed messages directly to a TTY port with a bit of assembly code. It was reliable, but not particularly scalable; once the kernel started supporting multiple CPUs, things needed to change.

Version 0.99.7a added console registration; the "log level" abstraction was added in v0.99.13k. The bust_spinlocks() mechanism, which prevents waiting for spinlocks when the system is crashing and "goes against everything everybody has learned", was added in 2.4.0. With 2.4.10, big changes to printk() made it asynchronous. By 2.6.26, printk() was causing large latency spikes; kernel developers dealt with this problem by ignoring printk() in the latency tracer, thus sweeping it under the rug. The 3.4 release added structured logging, sequence numbers, and the /dev/kmsg interface. The "safe buffers" mechanism, used for printing in NMI context, showed up in 4.10. A problem where one CPU could get stuck outputting messages indefinitely was (somewhat) addressed in 4.15. In 5.0, the concept of caller identifiers was added.

So printk() has seen a lot of development over the years, but there are still a number of open issues. One of them is the raw spinlock used to protect the ring buffer; it cannot be taken in NMI context, so printk() must output to the lockless safe buffers instead. That will create bogus timestamps when the messages are finally copied to the real ring buffer, can lose messages, and cause the buffers to not be flushed when CPUs don't go offline properly.

Then, there is the issue of the console drivers, which are slow but are nonetheless called with interrupts disabled. Most console devices are not designed to work in a kernel-panic situation, so they are not reliable when they may be needed most.

Other problems include the fact that all log levels are treated equally by printk(); chatter treated like urgent information can create latency problems, causing some users to restrict the levels that are actually logged. The problem with one CPU being stuck logging forever has been fixed, but the last CPU to come along and take over log output can still be saddled with a lot of work. That makes any printk() call potentially expensive. The whole bust_spinlocks() mechanism can be described as "ignoring locks and hoping for the best"; there should be a better way, he said.

The better way

The difficulties with printk() over the years, Ogness said, come down to the tension between non-interference and reliability. Trying to achieve both goals in the same place has been shown not to work, so a better approach would be to split them apart. Non-interference can be addressed by making printk() fully preemptable, making the ring buffer safe in all contexts, and moving console handling to dedicated kernel threads. Reliability, instead, can be achieved by providing a synchronous channel for important messages, an "atomic consoles" concept, and the notion of "emergency messages".

Both goals depend on support from the printk() ring buffer. This buffer has multiple concurrent readers and a single writer; it is stored contiguously in memory and is protected by a special spinlock (the "CPU lock") that can be acquired multiple times on the same CPU. This lock, he said, feels a lot like the old big kernel lock.

Like any self-respecting kernel-development project, the printk() work starts with the creation of a new ring buffer meant to address the problems with the current one. It is fully lockless, supporting multiple readers and writers in all contexts. Metadata has been pushed out to a separate descriptor mechanism; it handles tasks like timestamps and sequencing. The ring buffer has some nice features, but it is also complicated, including no less than nine memory-barrier pairs. It is hard to document and hard to review; he is also not convinced that the multiple writer support — which adds a lot of the complexity — is really needed.

The addition of the per-console kernel threads serves to decouple printk() calls from console handling. Each console will now go as fast as it can, and each can have its own log level. Shifting the responsibility for consoles simplifies a lot of things, but leads to some unresolved questions about locking and whether a thread-based implementation can be relied upon to always get the messages out. But reliability, Ogness said, will be handled by other mechanisms; the per-console threads are a non-interference mechanism.

For reliability, the plan is to add an "atomic console" concept. Consoles with this support would have a write_atomic() method that can be expected to work in any context. This method is defined to operate synchronously, meaning that it will slow down the system considerably; it is thus only to be used for emergency messages. The advantage is that there is no longer any need for bust_spinlocks() or the global oops_in_progress variable.

The challenge is creating console drivers that can actually implement write_atomic(). He did an implementation for consoles attached to an 8250 UART; it was "not trivial". There will almost certainly be a lot of systems that never get atomic-console support, so some other sort of solution will be needed. He said that options include creating a special console that fills a memory area instead, trying to print synchronously outside of atomic context, or just "falling back to the current craziness".

Associated with atomic consoles is the idea of "emergency messages" that must go out right away. The biggest problem here may be deciding which messages are important enough to warrant that treatment. Log levels are "kind of a gray area" and, he said, not really the way to go. There are only a few situations where printk() output is really that important; the right solution might be to hook into macros like BUG().

Ogness concluded by noting that this work began in February, with the current version having been posted in August. Most of the features described above have been implemented, he said, giving developers "something to play with".

Further discussion

A separate session was held later in the conference; your editor was unfortunately unable to attend. Ogness has posted a summary of the conclusions that were reached there, though. He thanked the community for its participation in this meeting, which "certainly saved hundreds of hours of reading/writing emails".

From the summary, it seems that an alternative ring buffer implementation posted by Petr Mladek will be used instead; it is much simpler and thus easier to review. Ogness has ported the rest of his work to use this buffer and shown that it works. The per-console kernel threads will be used.

The "emergency messages" idea seems to have been superseded by the idea of an "emergency state" that affects the system as a whole. When the kernel is in that state, all messages will be flushed using the write_atomic() callback where it is available. Flushing to consoles without atomic support will not be supported. The CPU lock will remain, but its purpose will be to synchronize the console threads when the system is in the emergency state.

There will be other changes, including the addition of a pr_flush() function that will wait for all messages to be sent out to all consoles. Patches implementing all this work have not yet been posted, but presumably they can be expected soon.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event.]

Comments (8 posted)

Adding the pidfd abstraction to the kernel

By Jonathan Corbet
October 7, 2019

Kernel Recipes

One of the many changes in the 5.4 kernel is the completion (insofar as anything in the kernel is truly complete) of the pidfd API. Getting that work done has been "a wild ride so far", according to its author Christian Brauner during a session at the 2019 Kernel Recipes conference. He went on to describe the history of this work and some lessons for others interested in adding major new APIs to the Linux kernel.

A pidfd, he began, is a file descriptor that refers to a process — or, more correctly, to a process's thread-group leader. There do not appear to be any use cases for pidfds that refer to an individual thread for now; such a feature could be added in the future if the need arises. Pidfds are stable (they always refer to the same process) and private to the owner of the file descriptor. Internally to the kernel, a pidfd refers to the pid structure for the target process. Other options (such as struct task_struct) were available, but that structure is too big to pin down indefinitely (which can be necessary, since a pidfd can be held open indefinitely).

Why did the kernel need pidfds? The main driving force was the problem of process-ID (PID) recycling. A process ID is an integer, drawn from a (small by default) pool; when a process exits, its ID will eventually be recycled and assigned to an entirely unrelated process. This leads to a number of security issues when process-management applications don't notice in time that a process ID has been reused; he put up a list of CVE numbers (visible in his slides [SlideShare]) for vulnerabilities resulting from PID reuse. There have been macOS exploits as well. It is, he said, a real issue.

Beyond that, Unix has long had a problem supporting libraries that need to create invisible helper processes. These processes, being subprocesses of the main application, can end up sending signals to that application or showing up in wait() calls, creating confusion. Pidfds are designed to allow the creation of this kind of hidden process, solving a persistent, difficult problem. They are also useful for process-management applications that want to delegate the handling of specific processes to a non-parent process; the Android low-memory killer daemon (LMKD) and systemd are a couple of examples. Pidfds can be transferred to other processes by the usual means, making this kind of delegation possible.

Brauner said that a file-descriptor-based abstraction was chosen because it has been done before on other operating systems and shown to work. Dealing with file descriptors is a common pattern in Unix applications.

There are, he said, quite a few user-space applications and libraries that are interested in using pidfds. They include D-Bus, Qt, systemd, checkpoint-restore in user space (CRIU), LMKD, bpftrace, and the Rust "mio" library.

Implementing pidfds

Brauner said that he started by looking at what other operating systems have done. He made a mistake, though, by not looking at how other systems implemented this feature until after he had gotten code of his own written. Illumos has an API — procopen() and friends — that is implemented in user space. Neither OpenBSD nor NetBSD has a pidfd implementation at all, but FreeBSD does in the form of its process file descriptors. The idea is the same, but that implementation differs in the details.

There have been previous attempts to add this idea to Linux as well, he said. These include a forkfd() system call and the CLONE_FD flag for clone(). None of these made it in; Brauner looked at them to try to figure out why. The CLONE_FD idea in particular tried to do too many things at once, he said.

In an attempt to avoid a similar fate, Brauner did the pidfd work over the course of four kernel releases. That gave him (and the community) plenty of time to think about how the various parts of the API should work. The first piece that he bit off was sending signals to processes in a race-free way; it was "the obvious use case", he said. People had a lot of ideas about how this feature should work, so focusing the discussion was a bit of a challenge. These ideas included using /proc files, new ioctl() calls, and more; they were all aimed at the signaling problem in particular, but he had a more general API in mind from the beginning. In the end, pidfd_send_signal() went into 5.1.

There was still a race condition involved, though, since a pidfd had to be obtained for a process after the process had been created. The answer was to return a pidfd directly from clone(). There was some uncertainty about just what should be returned, though; should it be a file descriptor referring to a /proc file or something else? In the end, he sent two separate RFC patch postings, one using /proc and one using anonymous inodes instead. The /proc version was "nasty", he said, and would have probably led to an eventual need to rework procfs. After seeing the two ideas, a consensus formed around using anonymous inodes.

One important design decision, he said, was to mark each pidfd to be closed by default on execve() calls. He didn't want to see pidfds being leaked into unrelated applications.

Returning a pidfd from clone() was added in 5.2. That work left Brauner feeling a little guilty, though, since he used the last available clone() flag bit for CLONE_PIDFD. That led to the implementation of clone3(), which has a dedicated return argument for a pidfd. 5.3 also saw the addition of polling support for pidfds; this is important since it will be the main way to return an exit status to non-parent processes. pidfd_open() was also added in 5.3; it allows the creation of a pidfd for an existing process.

In 5.4, the waitid() system call gained a new P_PIDFD flag, allowing a process to wait on a pidfd directly. That essentially completes the pidfd API as it had been originally envisioned.

Future work and lessons

Like any other kernel API, pidfds will continue to evolve over time, Brauner said. One feature he would like to add is sending a SIGKILL signal to a process when the last pidfd referring to it is closed. That is something FreeBSD supports now, but Linux will need to do things a bit differently. When a FreeBSD close() call returns, all of the work in the kernel is done; Linux, instead, can defer work to a workqueue to be done asynchronously later. Thus, the process may continue to exist for a while after that last close() call returns, which may not be what the application expects. He has a proof-of-concept implementation of how this feature could work in Linux, but he's not entirely happy with it yet.

Another upcoming feature is "exclusive waiting": marking a process so that only a pidfd can be used to wait for it. In other words, an ordinary wait() call (or any of its variants) will not return such a process's exit status, which will go only to a waitid() call that provides the right pidfd. This feature is aimed at the "invisible helper process use case". We probably want it, he said, but he still has to work out all of the semantics for it.

Pidfds also need to be better integrated with the namespace mechanism. One potentially useful feature would be to pass a pidfd to setns(); the result would be to enter several namespaces simultaneously. That is not something that can be done on current Linux systems. He is also thinking about adding a socket option to get a pidfd rather than an ordinary process ID for the peer on a local connection.

Brauner concluded with the lessons he has learned from this work. The first is that "speed matters". But, in this case, he was not arguing for going as fast as possible; instead he recommends picking a sustainable speed for the addition of new features. That will give time to respond to people and get things right the first time. Developers should, he said, be open about what they don't know; that encourages other developers to help out. In this case, he got help from a number of senior kernel developers while implementing pidfds. Finally, he said, "be resilient" in the face of reviews. He felt that he "looked dumb" after the first posting of pidfd_send_signal(), but he is glad he pushed through that experience and got the work into the kernel.

[Your editor thanks the Linux Foundation, LWN's travel sponsor, for supporting his travel to this event.]

Comments (10 posted)

An update on the input stack

By Jake Edge
October 9, 2019

XDC

The input stack for Linux is an essential part of interacting with our systems, but it is also an area that is lacking in terms of developers. There has been progress over the last few years, however; Peter Hutterer from Red Hat came to the 2019 X.Org Developers Conference to talk about some of the work that has been done. He gave a status report on the input stack that covered development work that is going on now as well as things that have been completed in the last two years or so. Overall, things are looking pretty good for input on Linux, though the "bus factor" for the stack is alarmingly low.

High-resolution mouse scrolling

High-resolution mouse-wheel scrolling should be arriving in the next month or two, he said. It allows for a different event stream that provides more precision on the movement of the mouse wheel on capable devices. Instead of one event per 15-20° of movement, the mouse will send two or four events in that span. Two new event types were added to the 5.0 kernel (REL_WHEEL_HI_RES and REL_HWHEEL_HI_RES) to support the feature. The old and new event streams may not correlate exactly, so they probably should not be used together, he cautioned.

Likewise, libinput has added a new event type (LIBINPUT_EVENT_POINTER_AXIS_WHEEL) for high-resolution scrolling; it should be handled with its own event stream as with the kernel events. That code is sitting on a branch; it works but it has not been merged into the master yet. For Wayland, a new event type was also added in a now-familiar pattern. He pointed to a mailing list post where all the gory details of high-resolution scrolling for Wayland was explained.

libinput user devices

Hutterer has been working on input devices simulated in user space for libinput in order to simplify testing the library. Instead of needing a real kernel device to provide evdev events to libinput, a program can pass a file descriptor to libinput as the place to read its evdev events from (instead of it opening /dev/input/event0 for example). Then the client can provide evdev events to itself, in some sense, that will be treated exactly the same as a regular evdev source by libinput. Diagrams on slides 15 and 16 in his slide deck [PDF] may help visualize the idea.

The libinput test suite takes around an hour to run and the only place it currently runs is on his laptop. The test failures are all based on timeouts that are tweaked for his system; beyond that, it needs to be run as root, has dependencies on udev and uinput, and it can mess up the login session when it is run. The test suite will not run in a container as it stands, so it is not part of the continuous-integration (CI) testing. User devices will make it easier to run the test suite in a container, but there are other use cases for the feature as well. He mentioned having a keyboard macro daemon for Wayland that would generate evdev events to be handled just as any other keyboard input. He is currently planning to put this new code in libinput-testing.so, rather than into libinput itself, but that may need to change down the road to support the other use cases.

ratbag-emu

Hutterer also mentioned ratbag-emu, which is a mouse firmware emulator that Filipe Laíns, an intern at Logitech, has been working on. When a gaming mouse needs to be configured, that is done by the ratbag daemon (ratbagd)—part of libratbag—which sends device-specific commands via a raw HID device (e.g. /dev/hidraw0). In order to test that, an actual mouse needs to be plugged in and removed multiple times.

Ratbag-emu will emulate the mouse so that the test suite can exercise ratbagd. There is a REST interface exposed by ratbag-emu that allows specific devices to be chosen for emulation and to check on the settings that were sent to them. The idea is to be able to emulate any mouse, but right now it is mostly for Logitech mice because they have full specifications for those, Hutterer said.

Tuhi

Moving on to things that had been done in the last year or two, Hutterer mentioned Tuhi, which is a GTK program to manage Wacom Smartpad devices (e.g. Bamboo Spark). These are notepads with paper that you can draw on with a pen, but underneath the device records all of the strokes made; using Bluetooth, you can retrieve the drawings. He has a recent blog post about Tuhi as well.

Tuhi does not need to be sophisticated, he said, it simply pulls down the images and provides a way to save them as SVG or PNG format files. The protocol had to be reverse engineered, but the project eventually did get some specifications from Wacom that helped untangle a few misunderstandings. The only way to get a drawing from the tablet is to get the oldest one stored there; if newer drawings are wanted, the older ones must be deleted. That makes it imperative that Tuhi not lose any drawings when it is pulling down several; there are multiple safeguards in Tuhi to ensure that, Hutterer said.

Bus factor of one

The xf86-input-evdev driver is in maintenance mode at this point. The last commit was in May 2018 and, since the 2.10.0 release four years ago, there have been a total of 19 commits. It is still shipped in RHEL 8 in order to support "crazy devices" that don't work otherwise. Similarly, xf86-input-synaptics had a 1.9.0 release in 2016 and has had nine commits since. It is effectively dead and all touchpads should be working with libinput at this point. Since libinput took over from xf86-input-synaptics three years ago, no one has stepped up to say they want to continue maintaining it, Hutterer said.

"Libinput is good and has a problem at the same time", he said. Since the 1.9.0 release roughly two years ago, it has had 1100 commits—980 of those were by Hutterer. "The bus factor is one. The input stack you are all relying on has one developer." Over those two years, there have been 50 other developers, but only four of them have more than five commits.

The move to GitLab has given him the ability to add tags to bugs, but he found that adding the "help needed" tag to libinput bugs was a reliable way to never hear anything about that bug again. Moving to GitLab has been a mixed blessing, he said. He is more efficient and the CI integration will make things even better, but, as far as he can tell, libinput changes are getting no code review. When he used to post patches to the mailing list, he would get the occasional "drive-by review", but that doesn't seem to happen when the patches are "seven clicks away in the GitLab web interface".

The story with libratbag is somewhat familiar. It was meant to be the "one true mouse-configuration API" when it was introduced four or five years ago, Hutterer said. For a year or so, that was going well but it fell by the wayside as he and others ran out of time to work on it—and no one else stepped up. There are a lot of people who want their mouse to work correctly, he said, but there are not a lot who are willing to help make that happen. Given that, he is not sure what the future holds for libratbag.

libinput quirks

Around a year and a half ago, the "libinput quirks" feature was added. There are a lot of devices that are broken in some fashion, so there needs to be a way to indicate that. For example, some devices claim to have buttons they don't have, don't claim buttons they do have, are upside down, and so on. Back in 2014, these quirks were stored in the udev hardware database (hwdb), which is a simple-to-use key/value store that is available on every system.

Over time, though, the hwdb approach became unmaintainable; libinput started using hierarchical and nested quirks that were essentially unable to be debugged. In addition, depending on how the hwdb was updated, quirks would seemingly randomly be applied or overridden. For those reasons, over the last two years the project switched to using .ini files to describe the quirks in one place that should be easy for users to find and work with.

libinput-record

The evemu utilities to record and replay event streams have been replaced with libinput-record and libinput-replay. The custom format used by evemu was not extensible, so the new tools use a YAML format. Most importantly, Hutterer said, the new tools are in the libinput repository and are released at the same time as libinput, which should eliminate the version mismatch problems that have plagued debugging efforts in the past.

Hold gestures

Something that he has been thinking about adding to libinput is "hold" gestures. It already has "swipe" and "pinch" gestures, so adding a hold gesture, where you place three fingers, for example, down on the touchpad without moving them, would make sense. There could be hold gestures for one, two, and three fingers.

The hold event might simply be the start of a pointer movement, though, so some kind of "hold cancel" event would be needed before the pointer movement events. If the fingers are placed down again, another hold event would be reported. This would allow a two-finger flick to start kinetic scrolling, which is implemented in the applications, and then to be able to detect when to stop the scrolling because the user has touched down as they saw what they were interested in scroll by. There is no code for hold gestures as yet, he said, so he encouraged those interested to get in touch to discuss how it should all work.

Hutterer covered some other topics in the talk (YouTube video) as well, including adding support for the Dell Canvas Dial totem device, which is meant as a secondary input device for menu selection on a drawing tablet. While it is now supported in libinput, he was not optimistic that any applications would actually add functionality for it. There were also some changes to simplify XKB configuration. In truth, it all sounds like quite a bit was accomplished with a really small base of developers. We need to hope that changes—and that Hutterer steers clear of buses.

[I would like to thank the X.Org Foundation and LWN's travel sponsor, the Linux Foundation, for travel assistance to Montréal for XDC.]

Comments (9 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>