Leading items
Welcome to the LWN.net Weekly Edition for June 14, 2018
This edition contains the following feature content:
- Linux distributions and Python 2: a Python Language Summit discussion on how distributors will handle the upcoming end of Python 2 support.
- A Python static typing update: tools for adding static type checking to a dynamic language.
- Python virtual environments: a discussion on the ups and downs of Python's virtualenv.
- Year-2038 work in 4.18: changes merged in this development cycle to address the year-2038 issue.
- 4.18 Merge window, part 1: a summary of the first 7,500 changesets merged for 4.18.
- Heterogeneous memory management meets EXPORT_SYMBOL_GPL(): a disagreement over the GPL-only status of an internal function used by HMM.
- Handling I/O errors in the kernel: the perennial LSFMM error-handling topic returns.
- Messiness in removing directories: fixing some unpleasant race conditions in directory removal.
- Filesystem test suites: making it easier for filesystem developers to test their changes.
- XArray and the mainline: getting the XArray patches upstream.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
Linux distributions and Python 2
Python 2.7 will reach its end of life in less than two years—at least for the core development team. Linux distributions need to figure out how to handle the transition given that many of their users are still using that version of the language—and may still be well beyond the end-of-life date. Petr Viktorin and Matthias Klose led a session at the 2018 Python Language Summit to discuss distributions' approaches to deprecating Python 2.
Viktorin works for Red Hat and focused on the Fedora distribution. He wants to help figure out how to help the Python downstreams so that Python 2 can be fully discontinued. There are two different ways to do that; either make sure that everyone switches to Python 3 or simply deprecate Python 2 and "wash our hands" of the problem. He would prefer the first alternative. He will be working on this transition for Red Hat as part of his day job and would like to do it in the community as well; that will minimize the need to maintain Python 2 going forward.
Fedora now has about two-thirds of its packages that are Python 3 compatible. What is left is mostly stuff that no one cares about, he said, but there are still some important packages that need Python 2. There are also packages where the Python 2 version could be dropped in favor of one that works with Python 3, but the distribution does not know if that will upset users or not. In theory, the package maintainer should have some idea, but in many cases, they do not.
The list of things that have not been ported to Python 3 includes the Mercurial and Bazaar distributed version-control systems, plugins for tools like GIMP and Inkscape, and bindings for Samba and various database-management systems. In addition, OpenStack is not yet Python 3 ready.
Fedora is a community distribution and it can at times suffer from a lack of interested maintainers. Some think the model of the interested distribution package maintainer is outdated and that package repositories like the Python Package Index (PyPI) should be used instead. But distribution packagers still have their place, Viktorin said. For example, the licensing information on PyPI is sometimes wrong, which is often caught by the distribution maintainers. One other barrier for community distributions is that there is no real authority to direct the volunteer efforts; so the distribution can't really deprecate Python 2 directly.
In companies it is difficult to get funds allocated for refactoring working Python 2 code to Python 3, so that change is usually pushed by developers from below. On the other hand, advertising the use of Python 3 is useful for recruiting developers, or so he has heard from several sources.
One of the goals for Fedora in this process is to make sure that it doesn't make things harder for those users who have already switched to Python 3. The first step in the transition is to move leaf packages (i.e. those that have no other packages dependent on them) to Python 3. Those changes have to be tested to ensure there are no lingering Python 2 dependencies. If there are problematic packages, where it is hard to switch them over to Python 3, he wants to help get them working.
Fedora still needs to support Python 2.7 as long as it is needed, he said. It also supports Python 3.3, 3.4, and 3.5, but only for testing that packages and applications work with them. Łukasz Langa asked what "as long as needed" means for Python 2.7; Viktorin said that it is as long as someone is depending on it. That didn't sit entirely well; Langa said that people will be reporting bugs in 2.7 to the Python community even though it is past its end of life. But Klose pointed out that the situation is no different than today; "end of life" Python versions are still supported, especially by the enterprise distributions.
Nick Coghlan noted that RHEL 7 shipped with Python 2.7 and it will be supported until 2024. Viktorin said the problem is going to keep happening, no matter who is maintaining old Python versions; what he wants is for whoever does the maintenance to care about it.
There are some "little annoyances" with gradually shifting over to Python 3, however. PEP 394 says that executing "python" should currently invoke Python 2. It also says that "python" in the shebang ("#!") line should mean that the script is compatible with both 2 and 3. However, maintainers of Python scripts may either be unaware of the PEP or just have not gotten around to switching, so the two "requirements" are at odds a bit. As detailed in a pull request to change the PEP, it would make sense to have no python symbolic link at all in some environments, especially in places where removing dependence on Python 2 is desired.
Langa suggested that perhaps switching the python symbolic link to Python 3 could be aligned with a long-term support (LTS) version of Python as was discussed in an earlier session. If, say, the 3.9 release were made in mid-2020 as an LTS, that might make a good point to change the recommendation in the PEP.
Klose took over at that point to give his perspective as a Debian and Ubuntu Python maintainer. For Ubuntu 16.04 LTS and Debian 9 ("stretch"), the python symbolic link points to Python 2. For Ubuntu 18.04 LTS, Python 2 is not part of the installation image. Invoking python will either remind users how to install Python 2 or will note the existence of Python 3 if it is installed.
There is ongoing work to get the distributions Python 3 ready. There are bug reports for things that do not work with Python 3, but they may be ignored if the bugs do not show up for Python 2, he said. What will not be so easily ignored is Lintian, which is a tool that finds problems in Debian packages. Adding Python 3 checks to Lintian should help reduce the number of problems.
Debian has made the decision to ship Python 2 with Debian 10 ("buster"), which will be released in mid-2019; that means Python 2 will be supported in Debian well past its 2020 end of life. For Ubuntu, the plan for the 20.04 LTS release has Python 2 removed from the main repository. That means it will not be supported by Canonical, so the community will need to pick it up if it is to continue at that point.
A Python static typing update
One of the larger features added to Python over the last few releases is support for static typing in the language. Static type-checking and tools to support it show up frequently as topics at the Python Language Summit (PLS) and this year was no exception. Mypy developers Jukka Lehtosalo and Ivan Levkivskyi gave an update on static typing at PLS 2018.
Lehtosalo started things off by talking about stub files, which contain type information for libraries and other modules. If you are going to type-check code that uses outside modules, from the standard library or a third-party library, the tool needs to understand the types used in the public interfaces of the library. The type-checking that can be done is limited if there are no stubs for the libraries used.
Right now, static typing is only partially useful for large projects because they tend to use a lot of packages from the Python Package Index (PyPI), which has limited stub coverage. There are only 35 stubs for third-party modules in the typeshed library, which is Python's stub repository. By comparison, there are 200 stubs for standard library modules even though PyPI is much larger than the standard library.
He suggested that perhaps a centralized library for stubs is not the right development model. Some projects have stubs that live outside of typeshed, such as Django and SQLAlchemy. But that makes those stubs a hassle to use. PEP 561 ("Distributing and Packaging Type Information") will provide a way to pip install stubs from packages that advertise that they have them. The current draft version of PEP 561 is supported by mypy.
There are still some areas where Python's type hints (as defined in PEP 484) are not sufficient, especially where the types are derived from runtime parameters. For example, Django models have a create() method that is dependent on the definition of the model; PEP 484 has no way to express that relationship. Lehtosalo mentioned other examples, including NumPy arrays where the array dimensions are not compatible for a particular operation; once again, type hints cannot help there.
At that point, Levkivskyi took over to describe the changes coming in Python 3.7 for type hints. PEP 560 ("Core support for typing module and generic types") has been accepted and will improve the performance of the typing module. Importing the typing module is seven times faster now, he said.
PEP 563 ("Postponed Evaluation of Annotations") will solve some of the pain points that people have experienced using the typing module, he said. It will mean that forward references to types will no longer need to be escaped as string literals. It will also stop automatically processing type annotations when importing modules, so that only those doing type-checking will pay the price at import time. Importing annotations from the __future__ module will enable the functionality for programs that want it (and are prepared to deal with the break in backward compatibility).
A work that is currently in progress is PEP 544 ("Protocols: Structural subtyping (static duck typing)"). It will allow type-checkers to infer support for some Python protocols (such as Iterable) for a class rather than require that the class be explicitly marked to support the protocol. The PEP is close to acceptance, Levkivskyi said, and mypy fully supports it.
Another addition is the TypedDict type, which allows a dictionary with fixed keys and known value types to be declared. If a Movie type is a dictionary with two keys (name and year), it could be declared this way:
Movie = TypedDict('Movie', {'name': str, 'year': int})That would allow type-checkers to infer types, complain for incorrect keys, and give errors when the wrong types are assigned. TypedDict needs a short PEP that has not been written yet; mypy does not fully support it yet, either.
Python virtual environments
In a short session at the 2018 Python Language Summit, Steve Dower brought up the shortcomings of Python virtual environments, which are meant to create isolated installations of the language and its modules. He said his presentation was "co-written with Twitter" and, indeed, most of his slides were of tweets. At the end, he also slipped in an announcement of his plans for hosting a core development sprint in September.
The title of the session was taken from David Beazley's tweet on
May 1: "Virtual environments. Not even once.
" Thomas Wouters defended
virtual environments in a response:
But Beazley and others (including Dower) think that starting Python tutorials or training classes with a 20-minute digression on setting up a virtual environment is wasted time. It does stop pip install from messing with the global environment, but it has little or nothing to do with actually learning Python. Dower noted that Pipenv is supposed to solve some of the problems with virtual environments, but it "feels a bit clunky", according to a tweet by Trey Hunner.
In another Twitter "thread", there was a discussion of potential changes to pip so that it would gain the notion of local versus global installation. That might be a path toward solving the problems that folks see with virtual environments and Pipenv. Dower said he is willing to create a PEP if there is a consensus on a way forward.
He would like to see a way to do local package installation without using virtual environments. He also would like to have a way to invoke the "right" Python (the right version from the right location) without using virtual environments. But for those who are using virtual environments, he would like them to be relocatable, so that users can copy them elsewhere and have them still be functional. Barry Warsaw suggested making pip --user the default as it is in Debian and Ubuntu; Dower said that only "localizes the damage" and doesn't really solve the problem.
Core development sprint
Dower has volunteered to host a core development sprint to work on CPython. He has scheduled it for September 10-14, 2018 in Redmond, Washington on the campus of his employer, Microsoft. They will have an entire building to use for the sprint. There will be a hotel block reserved in Bellevue, since it is a more interesting place to stay, he said. Around 25-30 developers will be invited to attend; active developers or those with a PEP under consideration should expect to get an invite. He is hoping that the Python Software Foundation will pick up the travel expenses for the invitees, but any core developer is welcome to attend.
Year-2038 work in 4.18
We now have less than 20 years to wait until the time_t value used on 32-bit systems will overflow and create time-related mayhem across the planet. The grand plan for solving this problem was posted over three years ago now; progress since then has seemed slow. But quite a bit of work has happened deep inside the kernel and, in 4.18, some of the first work that will be visible to user space has been merged. The year-2038 problem is not yet solved, but things are moving in that direction.If 32-bit systems are to be able to handle times after January 2038, they will need to switch to a 64-bit version of the time_t type; the kernel will obviously need to support applications using that new type. Doing so in a way that doesn't break existing applications is going to require some careful work, though. In particular, the kernel must be able to successfully run a system where applications have been rebuilt to use a 64-bit time_t, but ancient binaries stuck on 32-bit time_t still exist; both applications should continue to work (though the old code may fail to handle times correctly).
The first step is to recognize that most architectures already have support for applications running in both 64-bit and 32-bit modes in the form of the compatibility code used to run 32-bit applications on 64-bit systems. At some point, all systems will be 64-bit systems when it comes to time handling, so it makes sense to use the compatibility calls for older applications even on 32-bit systems. To that end, with 4.18, work has been done to allow both 32-bit and 64-bit versions of the time-related system calls to be built on all architectures. The CONFIG_64BIT_TIME configuration symbol controls the building of the 64-bit versions on 32-bit systems, while CONFIG_COMPAT_32BIT_TIME controls the 32-bit versions.
Internally, some work has been done to keep the handling of time formats as simple as possible. The new __kernel_timespec type describes how 64-bit timespec values will be passed between the kernel and user space; it is designed to be the same for both 64-bit applications and those running under 32-bit emulation.
The long-term plan for many system calls with year-2038 issues is to create new versions, under new system-call numbers, that handle times in the __kernel_timespec format. The old versions, which will not handle 2038 correctly, will retain the old system-call numbers, so they will still be there for applications that expect them. Applications that are built for 64-bit time values will use the new versions and function correctly. For the most part, the patches for this phase of the work exist but have not yet found their way into the mainline.
One set of system calls that have changed are those managing System V interprocess communication. These system calls, providing access to semaphores, shared memory, and message queues, are not universally loved, but they do have users and need to continue to work. They also have interfaces using time_t values. For example, the semctl() system call uses the semid_ds structure, defined as:
struct semid_ds { struct ipc_perm sem_perm; /* Ownership and permissions */ time_t sem_otime; /* Last semop time */ time_t sem_ctime; /* Last change time */ unsigned long sem_nsems; /* No. of semaphores in set */ };
This structure looks like it would be difficult to extend to 64-bit time values without breaking compatibility, but the reality of the situation is a good illustration of how the view of system calls provided by the C library does not always match the actual interface provided by the kernel. The structure that is actually passed into and out of the kernel is rather different; the C library takes responsibility for converting between the two. The kernel's structure looks like this:
struct semid64_ds { struct ipc64_perm sem_perm; /* permissions .. see ipc.h */ __kernel_time_t sem_otime; /* last semop time */ unsigned long __unused1; __kernel_time_t sem_ctime; /* last change time */ unsigned long __unused2; unsigned long sem_nsems; /* no. of semaphores in array */ unsigned long __unused3; unsigned long __unused4; };
This is the 32-bit version of the structure with some #ifdef lines taken out; the full definition can be found in include/uapi/asm-generic/sembuf.h. What jumps out here is the padding that exists between the time fields. Somebody, years ago (before the beginning of the Git era) decided that the kernel should use the semid64_ds structure on all systems, and to ensure that enough space existed to pass 64-bit time values at some time in the future.
Many years later, that decision is paying off. In 4.18, the kernel will be able to unconditionally return 64-bit times for sem_otime and sem_ctime, with no compatibility issues to worry about. To that end, the structure (on 32-bit systems) now looks like:
struct semid64_ds { struct ipc64_perm sem_perm; /* permissions .. see ipc.h */ unsigned long sem_otime; /* last semop time */ unsigned long sem_otime_high; unsigned long sem_ctime; /* last change time */ unsigned long sem_ctime_high; unsigned long sem_nsems; /* no. of semaphores in array */ unsigned long __unused3; unsigned long __unused4; };
The extra bits in the _high fields will be ignored until the C library is upgraded to use them, but that can happen independently. There are some minor issues to be dealt with (the padding values are in the wrong place on big-endian systems, necessitating a swap operation, for example), but the change is essentially painless.
The one remaining piece, involving a bit more pain, is semtimedop(), which takes a struct timespec parameter. That call will have to be split into old and new versions, as described above — a change that has not found its way into 4.18.
The merging of these changes for 4.18 shows that the work on the year-2038 problem is progressing. There is still quite a bit to do; beyond the new system calls, there are a bunch of ioctl() operations that will need to be found and fixed, for example. But, from the kernel point of view at least, perhaps there is some light visible at the end of the tunnel. A complete solution will also require a lot of work at the C-library, distribution, and application levels, though, so we are likely to be hearing about year-2038 work for a while yet.
4.18 Merge window, part 1
As of this writing, 7,515 non-merge changesets have been pulled into the mainline repository for the 4.18 merge window. Things are clearly off to a strong start. The changes pulled this time around include more than the usual number of interesting new features; read on for the details.
Architecture-specific
- The 32-bit ARM architecture has gained fixes for Spectre variants 1 and 2.
- 32-Bit x86 systems now have a just-in-time compiler for eBPF programs.
Core kernel
- There is a new polling interface for use with asynchronous I/O.
- The new no5lvl command-line parameter turns off five-level paging even if the kernel and the hardware support it. This is essentially a "chicken bit" that can turn off this new feature if it creates problems.
- The power domain performance levels patch set has been merged. This code extends the power-management subsystem to be able to run the entire system (including peripheral devices) according to the needed power/performance balance.
- Trace markers (described briefly in this article) can now be used to fire triggers for actions like histogram generation. See this documentation patch for details.
- The control-group memory controller supports a new memory.min parameter. Like the existing memory.low, it is meant to ensure that the group has a minimum amount of RAM available to it, but it is meant to provide a stronger guarantee even when no reclaimable memory exists. This commit includes the documentation for this new parameter.
Filesystems
- There have been a few Btrfs improvements this time around. An empty subvolume can now be deleted with rmdir(); no special capabilities are required. The new FS_IOC_FSGETXATTR and FS_IOC_FSSETXATTR ioctl() commands can be used to manipulate various file attributes (the append-only and immutable flags, for example). There is also a new set of ioctl() commands to allow unprivileged users to look up subvolume information.
- It is now possible for users running as root within a user namespace to mount filesystems, even if they lack privilege outside of the namespace. The filesystem type itself must be marked to allow this type of mount; the only filesystem so marked at this point is the filesystems in user space (FUSE) module, which has seen a number of changes to enable this functionality
- The fscrypt module, used for encryption of F2FS and ext4 filesystems,
has gained support for the Speck128 and Speck256 ciphers. Speck is
somewhat controversial (the US NSA
seems a little too enthusiastic
about promoting it), but it does enable encryption on the lowest-end
devices. Ted Ts'o said that it's
unlikely to be enabled on higher-end devices at all. "
This is really intended for 'The Next Billion Users'; phones like Android Go that was disclosed at the 2017 Google I/O conference, where the unsubsidized price is well under $100 USD
".
Hardware support
- Audio: Realtek RT1305/RT1306 amplifiers, Realtek RT5668B codecs, Mediatek MT6351 audio codecs, Analog Devices SSM2305 speaker amplifiers, Atmel I2S controllers, and Tempo Semiconductor TSCS454 codecs.
- Graphics: NVIDIA Volta GPUs, Xen paravirtualized front-end displays, Allwinner A31 MIPI-DSI controllers, Thine THC63LVD1024 LVDS decoder bridges, Cadence DPI/DSI bridges, Samsung Exynos image scalers, Broadcom V3D 3.x (and newer) GPUs, NXP TDA9950/TDA998X HDMI CEC engines, and AMD Vega 20 GPUs.
- Media: Video devices using I2C transports, Sharp QM1D1B0004 tuners, Cadence CSI-2 RX/TX controllers, OmniVision OV7251 sensors, Renesas R-Car MIPI CSI-2 receivers, and Sony IMX258 sensors. The (170 KLOC) atomisp driver has been removed from the staging tree due to a lack of progress.
- Miscellaneous: ChipOne icn8505 touchscreens, Crane CR0014114 LED boards, Spreadtrum Communications SC27xx LED controllers, TI LM3601x LED controllers, Lattice MachXO2 SPI FPGA controllers, Rave SP EEPROM controllers, IBM virtual management channel adapters, Rockchip PCIe endpoint controllers, and STMicroelectronics STM32 inter-processor communication controllers.
- Networking: Texas Instruments DP83TC822 PHYs and Microsemi Ocelot Ethernet switches.
- Pin control: Actions Semi S900 pin controllers, Renesas R8A77470 and R8A77990 pin controllers, and Allwinner H6 R_PIO pin controllers.
- USB: Richtek RT1711H Type-C USB controllers, Aspeed vHub virtual hubs, HiSilicon STB xHCI host controllers, Atheros AR71XX/9XXX USB PHYs, and MediaTek XS-PHY transceivers.
Miscellaneous
- The crypto subsystem now supports the Zstandard compression algorithm and the AEGIS and MORUS [PDF] encryption algorithms.
- The /proc interface for IPMI statistics has been removed; that information is still available in sysfs.
- The (scrupulously undocumented) BPF type format mechanism provides a metadata format for the description of the data types used by BPF programs. Its initial use is for the pretty-printing of values in BPF maps.
Networking
- The TCP protocol now supports zero-copy receive operations under some conditions.
- The AF_XDP subsystem has been merged; it allows zero-copy networking under the control of one or more BPF programs loaded from user space. This commit contains a sample AF_XDP application.
- The core bpfilter mechanism has been merged. It is not truly functional for packet filtering at this point, but the infrastructure is now there to build on. That infrastructure includes a reworked user-mode blob helper mechanism that is likely to see use well beyond bpfilter.
- The in-kernel TLS protocol implementation has gained support for offloading that protocol support into suitably capable hardware. The Mellanox mlx5 driver now supports TLS offloading.
- The TCP protocol supports selective acknowledgment (SACK) compression; its purpose is to limit the number of SACK packets sent when the network is already overloaded.
- It is now possible to attach a BPF program to a socket and have it run on sendmsg() calls; that program can do things like rewrite the IP addresses in the outgoing packet.
Internal kernel changes
- As part of the new AIO polling mechanism, the interface to the
poll() method has changed. The new function is:
__poll_t (*poll_mask) (struct socket *sock, __poll_t events);
Many internal poll() implementations have been converted to this interface. To be able to support AIO polling, drivers should also implement the new get_poll_head() method, which returns the wait queue used for polling.
- The qspinlock implementation has been improved to eliminate potential starvation problems.
- There is a new __kernel_timespec structure:
struct __kernel_timespec { __kernel_time64_t tv_sec; /* seconds */ long long tv_nsec; /* nanoseconds */ };
Its purpose is to facilitate the creation of year-2038-safe system calls on 32-bit systems by making the internal time representation be the same for both the 32-bit and 64-bit versions. Various implementations of system calls with year-2038 problems (nanosleep(), for example) have been updated to use this new type.
- The Sys V interprocess communication system calls have seen some work
to make them year-2038 safe.
- The kernel configuration language has grown a new macro definition subsystem; it is intended to facilitate moving various build-time tests from the makefiles into the Kconfig files. See Documentation.kbuild/kconfig-macro-language for details on how it works.
- A number of the improvements to struct page discussed at LSFMM 2018 have been merged.
By the normal schedule, the 4.18 merge window should continue through June 17. The second half is likely to be somewhat slower than the first, though, since Linus Torvalds has indicated that he will be traveling during that time. If all goes to schedule, the final 4.18 release can be expected on August 5 or 12.
Heterogeneous memory management meets EXPORT_SYMBOL_GPL()
One of the many longstanding — though unwritten — rules of kernel development is that infrastructure is not merged until at least one user for that infrastructure exists. That helps developers evaluate potential interfaces and be sure that the proposed addition is truly needed. A big exception to this rule was made when the heterogeneous memory management (HMM) code was merged, though. One of the reasons for the lack of users in this case turns out to be that many of the use cases are proprietary; that has led to some disagreements over the GPL-only status of an exported kernel symbol.The HMM subsystem exists to support peripherals that have direct access to system memory through their own memory-management units. It allows the ownership of ranges of memory to be passed back and forth and notifies peripherals of changes in memory mappings to keep everything working well together. HMM is not a small or simple subsystem, and bringing it into the kernel has forced a number of low-level memory-management changes. After a multi-year development process, the core HMM code was merged for the 4.14 kernel, despite the lack of any users.
The immediate issue has to do with HMM's use of devm_memremap_pages(), which allows the mapping of pages that exist in device memory. Early versions of HMM used this function before switching to an internal version with some changes. Dan Williams recently posted a patch series adjusting devm_memremap_pages() and changing HMM to use it, getting rid of the duplicated code. That change is not controversial, but one other part of the patch set is: he changed the export declaration of devm_memremap_pages() to EXPORT_SYMBOL_GPL().
There are, of course, two ways to export symbols from the kernel to loadable modules, with and without the _GPL suffix. Symbols exported with that suffix will be unavailable to any module that does not declare a GPL-compatible license. It is a statement that, in the developers' belief, any use of those symbols will necessarily make the module a derived work of the kernel. In this case, the proposed changes will make it harder for proprietary modules to use HMM.
Jérôme Glisse, the author of HMM, is naturally opposed to this change, since it defeats part of the purpose for HMM in the first place. Dave Airlie has also questioned the change, noting that devm_memremap_pages() was exported normally for three years and wondering what has changed:
Williams responded that the initial marking
of the symbol was "an oversight
" that is being corrected now.
In support of the claim that any user of devm_memremap_pages()
must be derived from the kernel, he pointed out that turning on this
remapping capability changes the kernel
fundamentally. The reverse of Airlie's logic also works: if a user of this
functionality was a derived work of the kernel before, the non-GPL status
of the export will not have changed that fact.
Williams further explained the reasoning behind his proposed changes as:
The rest of his message perhaps gets closer to the real source of this particular dispute, though: the fact that there are no in-tree users of the HM functionality.
Glisse has a response to all of these
complaints. HMM, he says, is meant to isolate drivers from core
memory-management internals rather than tying them together. There is a
user now in the form of patches to the
Nouveau driver for NVIDIA
GPUs; he said he hopes to get that code upstream in 4.19. And upstreaming
the pieces, he said, has been "a big chicken and egg
nightmare
" with a lot of independent pieces to prepare together;
that has made it hard to get the users merged along with the infrastructure.
The merging of the Nouveau code, if and when it happens, should resolve the question of whether HMM should be in the kernel at all; it might reopen some questions about specific HMM interfaces, though. The question about the GPL-only export may prove harder to reach a conclusion on, though. There is no easy or objective standard for deciding whether the use of a specific kernel function makes a module into a derived work; it usually comes down to the judgment of the developers who wrote the code in the first place. In this case, those developers are Williams and Christoph Hellwig, who has stated that he is willing to enforce the GPL against users of devm_memremap_pages().
While a case could thus be made for changing the status of this symbol, it's not at all clear what will actually happen. Either Andrew Morton or Linus Torvalds will almost certainly end up making the final decision. It is more clear, though, that a number of developers are unhappy with the no-users status of HMM in the kernel. The most likely outcome of this particular episode may end up being a redoubling of the community's determination not to accept new subsystems into the kernel until users exist.
Handling I/O errors in the kernel
The kernel's handling of I/O errors was the topic of a discussion led by Matthew Wilcox at the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) in a combined storage and filesystem track session. At the start, he asked: "how is our error handling and what do we plan to do about it?" That led to a discussion between the developers present on the kinds of errors that can occur and on ways to handle them.
Jeff Layton said that one basic problem occurs when there is an error during writeback; an application can read the block where the error occurred and get the old data without any kind of error. If the error was transient, data is lost. And if it is a permanent error, different filesystems handle it differently, which he thinks is a problem. Dave Chinner said that in order to have consistent behavior across filesystems, there needs to be a definition of what that behavior should be. There is a need to distinguish between transient and permanent failures and to create a taxonomy of how to deal with each type.
Kent Overstreet said that transient errors should be handled by the lower layers so that filesystems never see a transient write error. But others brought up transient errors that might make sense to return to the filesystems (e.g. no blocks on a thinly provisioned device, some kind of authorization problem or expiration). So, Chinner said, that might add another class of error beyond transient and permanent: user-response required.
Because of thin provisioning, ENOSPC can be a transient or permanent error, but what about transient errors that last so long they should be treated as permanent, Layton asked. Overstreet said there should be a global setting for how long operations with transient errors should be retried. XFS has multiple settings, Chinner said: try forever, do not retry, or a timeout coupled with a number of retries.
It might be nice if there were a single knob for administrators to set the behavior they want, Ted Ts'o said. But it would need to be a per-device setting and administrators would probably want to control it per filesystem as well, so it really wouldn't be a single knob. Chinner said that is as it should be, there are complicated systems out there and administrators need to be able to tweak things at every level. Another attendee said that different transports have different characteristics, so treating USB, PCI, and Fibre Channel devices the same does not really make sense.
But you do want a random user to be able to plug a USB drive into their laptop and get reasonable settings, Ts'o said. That's where defaults come in, Chinner said. Layton said there was a need for lots of knobs, but sane defaults. Overstreet said it would help if there was some consistency in where the knobs are, though.
Wilcox asked what changes could come before next year's LSFMM. Chinner said that XFS will probably extend its metadata error handling to its data blocks over the next year. Ts'o wondered if other filesystems should follow XFS's lead on that. Layton asked what XFS would do for pages with transient errors during writeback and whether they would be left in the dirty state, unlike what is done today. Chinner said that it would leave the pages dirty for transient errors that are still being retried.
Layton said that informing user space will be tricky; if the writes are still being retried, that means they could still eventually fail. A call to fsync() is an opportunity to throw the dirty pages away (by marking them as clean) since the error can be reported to user space at that point. Whatever is done, it should be done in the VFS layer so that each filesystem does not have to do it, Wilcox said.
In a final exchange as the time for the session was expiring, Layton wondered about permanent errors. If the application does a read after a failed write that effectively throws away the changes by marking the page clean, it might get the new data, which won't make it to permanent storage. Chinner suggested that perhaps the read operation could return an error under those conditions.
Messiness in removing directories
In the filesystem track at the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM), Al Viro discussed some problems he has recently spotted in the implementation of rmdir(). He covered some of the history of that implementation and how things got to where they are now. He also described areas that needed to be checked because the problem may be present in different places in multiple filesystems.
The fundamental problem is a race condition where operations can end up being performed on directories that have already been removed, which can lead to some rather "unpleasant" outcomes, Viro said. One warning, however: it was a difficult session to follow, with lots of gory details from deep inside the VFS, so it is quite possible that I have some (many?) of the details wrong here. Since LSFMM there has been no real discussion of the problem and its solution on the mailing lists that I have found.
Viro said that some reports from the syzkaller fuzzer bot (syzbot) just prior to the summit had started him looking at rmdir(). The easiest way to trigger the problem syzbot found is to remove a directory with an enormous directory entry (dentry) tree in the cache. The call will fail because the directory is not empty but in the process it will call shrink_dcache_parent() for historical reasons. The code previously checked that the directory inode reference count was one and return EBUSY if it was not. It was an easy check that would prevent anyone from creating an entry in the directory after it was deleted, which could lead to filesystem corruption.
But then the dentry cache (dcache) was added; there was no longer a reference to the inode for a cached reference to the directory dentry. The test could change to check the dentry instead of the inode, but negative dentries would have references to the directory dentry, which would make the test fail. The solution to that was to try to evict child dentries from the cache before doing the check. It was done after the check to ensure the directory is empty, but there is still a race.
The ext2 filesystem added a step where it set the victim's i_size to zero, which would allow removing the directory even when it was busy. Around the beginning of the 2.4 era, Viro got "sufficiently annoyed" by races around directory removal that he lifted the ext2 solution into the VFS layer. Instead of change i_size, though, his code would just mark the victim while it was locked. All of the filesystem primitives would then check that the directory was not marked dead before operating on it.
Around 2011, it was noticed that the dcache could still have negative dentries for children and a positive dentry for the directory itself after it had been removed. The obvious solution was to use shrink_dcache_parent() and to remove the directory dentry after an rmdir(). It turned out that rename() had a similar case with the exact same problems, he said.
The "real mess" that he has spotted recently has to do with removing a directory on a special filesystem (e.g. configfs, debugfs) if something is mounted on it. It used to be that a directory with something mounted on it could not be removed, but the container folks complained about that. One container could block many others from cleaning up by making a directory in a shared filesystem and then mounting something on it. That was changed so that the directory can be deleted, but doing so leaks a struct vfsmount object.
It is not just rmdir() that is affected or it could simply be fixed there. For example, write() has "no idea this kind of thing is possible". It affects other filesystems too, including sysfs, selinuxfs, and apparmorfs, but not procfs.
rmdir() and rename() obviously need to be fixed, Viro said. He looked at NFS and thinks it does not suffer from this problem, but he is not sure about CIFS or AFS (and said he doesn't even want to think about ncpfs). The 4.18 merge window should clear up the ncpfs problem, since that filesystem was removed from the kernel as part of the staging tree pull. Viro hopes to get the cluster filesystem developers looking at those. He also asked that filesystem developers check that all of their filesystem's operations (ioctl(), chmod(), ...) will not operate on a directory that has been removed.
Filesystem test suites
While the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) filesystem track session was advertised as being a filesystem test suite "bakeoff", it actually focused on how to make the existing test suites more accessible. Kent Overstreet said that he has learned over the years that various filesystem developers have their own scripts for testing using QEMU and other tools. He and Ted Ts'o put the session together to try to share some of that information (and code) more widely.
Most of the scripts and other code has not been polished or turned into a project, Overstreet continued. Bringing new people up to speed on the tests and how they are run takes time, but developers want to know how to run the tests before they send code to the maintainer.
Ts'o said that he had a goal for his xfstests-bld tool: give people submitting ext4 patches no excuses for not running the smoke tests. He wants to make sure that patches have been at least minimally tested before spending time reviewing them. Xfstests-bld has support for a handful of different filesystems, but just with default options for any beyond ext4; his hope is that other filesystems will also use it and provide suitable configurations.
J. Bruce Fields said that he runs the smoke tests on all patches anyway, so he would rather have patch submitters run their own tests on the code. But Ts'o was adamant that since there are more submitters than maintainers, he wants to know that the smoke tests have been run before looking at a submission.
Overstreet has a test framework that builds a test kernel, boots into it, and builds the tests and runs them with QEMU. It uses debootstrap to build a root filesystem and users do not have to mess with Kconfig; there is a stripped-down configuration that the test uses. Ts'o said that he has two configurations, one for QEMU and another for Google Compute Engine. Overstreet wondered if it was "silly for us all to maintain our own" test harnesses.
Mimi Zohar had a different complaint: getting started with xfstests is difficult. There is minimal documentation and no default configuration, which means it takes a long time to get going. That is why some have created these test scripts, Ts'o said.
Ts'o further described his framework. It can run tests in KVM/QEMU or push them to Android devices. The smoke parameter will run the default tests for the filesystem being tested; for ext4, those tests take about ten minutes. Full testing of ext4 takes about 20 hours, but an intern he worked with was able to reduce that to two hours by sharding the tests in Google Compute Engine. The Android xfstests require a rooted device and, likely, one you are not particularly attached to; the tests do a lot of writes, which may drastically reduce the lifetime of the device.
Eric Sandeen noted that the documentation for the different tests and frameworks is scattered among various web pages; he wondered if it could be centralized somewhere. Dave Chinner suggested patches to the xfstests documentation that at least pointed those interested to all of the different pages.
XArray and the mainline
The XArray data structure was the topic of the final filesystem track session at the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM). XArray is a new API for the kernel's radix-tree data structure; the session was led by Matthew Wilcox, who created XArray. When asked by Dave Chinner if the session was intended to be a live review of the patches, Wilcox admitted with a grin that it might be "the only way to get a review on this damn patch set".
In fact, the session was about the status of the patch set and its progress toward the mainline. Andrew Morton has taken the first eight cleanup patches, Wilcox said, which is great because there was a lot of churn there. The next set has a lot of churn as well, mostly due to renaming. The 15 patches after that actually implement XArray and apply it to the page cache. Those could be buggy, but they pass the radix-tree tests so, if they are, more tests are needed, he said.
Jeff Layton wondered if XArray should spend some time in the linux-next tree. Chinner said that it should be in linux-next at the time of the presentation (April 25) if it was meant to go into 4.18. Wilcox said that his code is based on linux-next and he would check to see if he could get it into that tree. It would seem that his plan has been delayed by a development cycle based on his post of XArray patches on June 11.
Ted Ts'o asked about merge conflicts between the XArray patches and XFS, ext4, and Btrfs in Linus Torvalds's tree. Wilcox said there were few, since most of those were handled in Morton's merge. Wilcox said that he has only converted the page cache to use XArray in the patches.
David Howells noted that the radix-tree code is still available in parallel for now. Wilcox said that code using XArray needs to use its locking scheme as well or lockdep will complain. The patches add roughly two-thirds the number of lines they delete; the "diffstat is wrongish" because there is lots more documentation and the radix tree is still present.
Chinner asked about performance numbers; Wilcox said that he had not done any measurements, but he would welcome anyone who wanted to. Chinner said he had done some performance testing a ways back and found the difference to be in the noise.
There are more radix-tree users in the kernel, though; Chinner wondered whether those conversions would go through the maintainer trees or via some other mechanism. Wilcox said that there are 50 or 60 radix trees in the kernel; he intends to allow XArray to live for a cycle in the mainline then start submitting conversions of other radix trees to their maintainers. Anna Schumaker asked when he planned to get rid of the radix tree entirely. Wilcox said he had about a dozen users of the radix tree left to convert in his tree and those are "not scary ones"; he thinks the radix tree could be gone within six months.
Page editor: Jonathan Corbet
Next page:
Brief items>>