Leading items

Welcome to the LWN.net Weekly Edition for May 23, 2024

This edition contains the following feature content:

The KeePassXC kerfuffle: a change to the Debian-testing KeePassXC package has users upset.
GitLab CI for the kernel: a proposal for a whole-kernel continuous-integration system.
Trinity keeps KDE 3 on life support: KDE 3 is not for everybody, but some users want to stick with it.
The first half of the 6.10 merge window: the first set of changes to find its way into the mainline for the 6.10 release.
Lots of coverage from the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit:
- An update and future plans for DAMON: DAMON and DAMOS provide a toolkit for the control of memory-reclaim policies from user space. DAMON author SeongJae Park updated the group on recent developments in this subsystem and talked about where it is going next.
- Extending the mempolicy interface for heterogeneous systems: the kernel's memory-policy API has not kept pace with hardware changes; how can that be fixed?
- Better support for locally attached memory tiering: CXL memory holds out the promise of significant cost savings, but only if the kernel can manage it properly.
- What's next for the SLUB allocator: the current and future status of the kernel's one remaining object allocator.
- Facing down mapcount madness; managing the mapping count of pages is trickier than it seems, but the situation is being improved.
- Dynamically sizing the kernel stack: kernel stacks are simultaneously too small and too big; making their size variable would solve that problem.
- Memory-allocation profiling for the kernel: a once-contentious discussion on this new feature refocuses on future improvements.
- Another try for address-space isolation: mitigations for hardware vulnerabilities have cost us a lot of performance; address-space isolation offers protection against present and future vulnerabilities while giving us that performance back.
- Faster page faults with RCU-protected VMA walks: the faster way to search through the VMA tree.
- Toward the unification of hugetlbfs: the hugetlbfs subsystem is arguably an outmoded way of accessing huge pages that imposes costs on memory-management maintenance. Coalescing it into the core will help, but it will not be an easy job.
- Merging msharefs: this proposal to allow the sharing of page tables between processes has been under consideration for some time; what is needed to get it upstream?
- Documenting page flags by committee: an attempt at large-scale collaborative authoring.
- Two sessions on CXL memory: Compute Express Link is promoted as a boon to data-center computing; two sessions looked at how the kernel can support this functionality.
- The path to deprecating SPARSEMEM: the kernel has several ways of representing physical memory; one of them may be on its way out.
- A plan to make BPF kfuncs polymorphic: a proposal that would allow kfuncs to use different implementations depending on where and how they are called.
- Virtual machine scheduling with BPF: a talk about solving the "double scheduling" problem for virtual machines.
- The interaction between memory reclaim and RCU: the reclaim process can be accelerated by using the read-copy-update mechanism to avoid locking, but there are still some problems to work out.
- Supporting larger block sizes in filesystems: another discussion of what needs to be done for filesystems in order to support block sizes larger than 4KB.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (1 posted)

The KeePassXC kerfuffle

By Joe Brockmeier
May 22, 2024

KeePassXC is an open-source (GPLv3), cross-platform password manager with local-only data storage. The project comes with a number of build options that can be used to toggle optional features, such as browser integration and password database sharing. However, controversy ensued when Debian Developer Julian Klode decided to make use of these compile flags to disable these features to improve security in the keepassxc package uploaded to Debian unstable for the upcoming Debian 13 ("Trixie") release.

One of the selling points of KeePassXC, in the age of everything-as-a-service, is that it stores user passwords and secrets locally. It does have a few network features, such as downloading site favicons to display next to passwords for web services and for checking passwords against the "have I been pwned?" service. It also has interprocess communication (IPC) functionality to talk to browsers like Firefox, Chrome, and others that have KeePassXC browser extensions. The project provides build flags to turn these additional features off, if desired.

The idea of turning off optional features in KeePassXC's Debian package is not new, and Klode has gotten requests to do so as far back as 2020, though they had gone unanswered until now. A Debian user filed a bug report against KeePassXC in March 2020, asking for the program to be compiled without some of its optional features. Andreas Kloeckner disagreed, saying that he finds the features useful, and suggested creating a second package if users wanted a keepassxc package without those features. Another user, Tony Mancill, chimed in a year later saying he supported disabling networking. And then the bug sat for almost three years until April 2024, when the bug was closed after Klode uploaded the package with networking and IPC functionality turned off, along with a new keepassxc-full package with the original functionality. This meant that the new package no longer supported any features, like browser integration or password sharing, that depended on networking or communicating with processes like Firefox or USB input from hardware tokens.

XZ-inspired

Why did Klode decide to take action on this bug several years after the fact? In an email response to questions, Klode wrote that the XZ backdoor spurred on the change. It made him "more concerned about building optional code into the default package". He suggested that optional features were more likely to contain vulnerabilities than the core application, because those features may only be reviewed by a single maintainer. It is easier, he said, for vulnerabilities to creep in if features have fewer reviewers.

The minimal build, he said, was well-aligned with what typical Debian users would want. Debian is well-known for stripping out features to improve security and privacy, and these users had chosen a password manager that stores its data locally "because they don't trust sharing their files with other users". Klode said that he did not think there would be much overlap between "these paranoid users" and users who wanted network access and to allow other applications to interact with the password database. "Certainly I've only ever used the clipboard for it and none of these features."

Users hit snags

Klode may not use the disabled features, but others do. Michael Musenbrock filed a bug report against the new package almost immediately after it was uploaded to Debian unstable, and reported that he was unable to open his password database without hardware-key support. He endorsed the idea of splitting the package up so the default package would be a "real network-less password manager", but suggested restoring the hardware-key functionality.

On May 6, Hernán Cabañas commented on the same bug report, after he had asked the upstream about missing browser-integration functionality. Jonathan White, a member of the KeePassXC project team, said that removing the functionality was a terrible decision, and recommended complaining to Debian. "They should have made a keepassxc-min instead."

On May 10, Cedric Schmeits opened an issue on the KeePassXC GitHub repository, complaining that browser integration had stopped working with KeePassXC version 2.7.7, and most of the application's options had disappeared from its settings. White tagged Klode and said that the change "needs to be reverted asap". He later edited the comment to add that Schmeits's bug report was the fourth that the project had received about the package change, and demanded: "Put the base package back where it was and create a keepassxc-minimal."

Klode replied that was not going to happen. It was a mistake to ship with these plugins enabled, and "our responsibility to our users to provide them with the most secure option possible as the default". He acknowledged that it would be painful for a while, "as users annoyingly do not read the NEWS files they should be reading", and went on to say that the disabled features didn't belong in a local password manager anyway. But, if users needed them, they could "install the crappy version but obviously this increases the risk of drive-by contributor attacks". To that, White responded: "Good luck to you. Really bad decision. We will be sure to let everyone know." That day, the KeePassXC project posted a notice on Fosstodon that Debian users should "be aware" that "the maintainer of the KeePassXC package for Debian has unilaterally decided to remove ALL features from it". This escalated things quickly, as the Fosstodon post made the front page of Lobste.rs and Hacker News.

White made it clear that the intention was to put Klode's feet to the virtual fire:

Would have been nice to have [a discussion] about a month ago when this was unilaterally put into action. Alas, here we are. So yah flaming arrows will absolutely be thrown because there was no chance at proper discourse. This thread should be a lesson to downstream folks who think they know what's "best" for the user.

That seems to have succeeded in directing angry users at Klode. In his email to LWN, he acknowledged that some of his wording was unfortunate, but said that he was in a hurry and wanted to provide a response before traveling. The hasty reply ended up sparking a disproportionate backlash: "I don't think it's healthy for people being subjected to a hate mob on multiple channels for several days like this."

Klode's comments did more to inflame upstream (and a number of users and onlookers) than inform, and they responded in kind. That is unfortunate: objectively speaking, a legitimate case can be made for adhering to the upstream's default configuration, or for the packager to exercise their best judgment in the configuration that is best suited to the distribution's users. Neither is correct, neither is wrong, it's a matter of perspective and opinion.

For example, former KeePassXC team member Davide Silvetti, initially agreed with the decision to offer a package with additional options turned off as the default. He said that the option to compile without plugins was added in 2016 to reduce the attack surface and potential vulnerabilities. He suggested that the project add a message in place of the usual browser integration feature tab that that suggests the user install the "full" version if they want that capability.

That did not appeal to Janek Bevendorff, a current member of the project team. "If anything, we will reduce the number of such compile-time flags in the future, so these things cannot be disabled anymore." The safest version of the program, he said, "is not the one with all flags disabled, but the one which is tested best by us and by the majority of our users".

Silvetti disagreed and asserted that disabling compile-time flags for additional features "actually removes code and reduce[s] attack surface", making the final version more secure. He also noted that package maintainers for the Linux distributions "(contrary to KeePassXC maintainers) have actual telemetry data and crash reports affecting the end users". That said, he then wrote that it would be better to continue with the keepassxc package with all features enabled, and offer the minimal version as keepassxc-minimal.

Much ado about...

For all the turmoil over the change, the only users who are actually at risk of a surprise change to the keepassxc package are those running testing or unstable releases. In response to a question about whether it is really reasonable to expect that users read NEWS files, Klode wrote that users of Debian unstable or testing are not average Linux users, but "people who have chosen to test the next Debian release in a rolling development state". Those users should have apt-listchanges installed "which will print these critical news items before the upgrades are installed".

The actual impact will be negligible for users of stable versions of Debian, Ubuntu, and other Debian-derived distributions. Klode said that when Debian Trixie is released, upgrades and new installs of the keepassxc package will receive a transitional package that prompts them to decide between "full" and "minimal" packages. Klode says that this will allow users upgrading from bookworm to preserve their current setup. Future releases will have a "virtual" keepassxc package that, again, requires the user to explicitly select one or the other.

Even if one takes the position that Klode is completely wrong in his rationale for and handling of this change, the real impact is minor. One of the things we should have learned from the XZ backdoor episode is that no one benefits from making participation in open-source development and distribution more unpleasant and stressful. Maintainers should be able to screw up in public without fear of an internet pile-on.

Comments (63 posted)

GitLab CI for the kernel

By Daroc Alden
May 17, 2024

Working on the Linux kernel has always been unlike working on many other software projects. One particularly noticeable difference is the decentralized nature of the kernel's testing infrastructure. Projects such as syzkaller, KernelCI, or the kernel self tests test the kernel in different ways. On February 28, Helen Koike posted a patch set that would add continuous integration (CI) scripts for the whole kernel. The response was generally positive, but several people suggested changes.

Koike's patch set adds a new top-level ci directory that contains YAML configuration files for GitLab's continuous-integration feature, as well as shell scripts to tie those to the existing kernel tests. It reuses some of her existing work from the kernel's graphics subsystem tests, which also use GitLab as a Continuous Integration platform. The patch set currently includes code to run checkpatch and Smatch against proposed patches, and attempt to build the kernel on a few different architectures, but Koike plans to expand the coverage if this initial work is accepted. The patch set also includes a top-level .gitlab-ci file that instructs GitLab to run the tests by default.

The patch set

Several people responded positively to the proposal. Tim Bird thought that the change was useful, saying: "I don't currently use gitlab, but I might install it just to test this out." Maxime Ripard also thought the work could be useful, but questioned how well it could support different use cases. Ripard pointed out that different subsystems and maintainers probably have different uses for continuous integration. "I don't see how the current architecture could [accommodate] for that."

GitLab can be configured to look elsewhere inside the repository for the configuration file, Nikolai Kondrashov pointed out. He suggested that having a top-level configuration might not be best. "This way all the different subtrees can have completely different setup, but some could still use Helen's work and employ the 'scenarios' she implemented."

Linus Torvalds agreed that it would be better not to include the top-level configuration: "I'm not at all interested in having something that people will then either fight about, or - more likely - ignore, at the top level because there isn't some global agreement about what the rules are." Torvalds suggested that it might be better to have the CI project be separate from the kernel.

However, there are benefits to keeping the CI scripts in the same repository as the code itself, according to Kondrashov. This allows developers to ensure that the tests and the code remain in sync. He suggested that "we reframe this contribution as a sort of template, or a reference for people to start their setup with". He also raised the possibility of having the new code live somewhere other than the ci directory.

The suggestion that the code could be a library of pre-made pieces for maintainers to pick and choose from was "a lot more palatable" to Torvalds. He suggested that the tools/ci folder would be the logical place since it is "kind of alongside our tools/testing subdirectory." Koike was fine with that approach, saying that it would make her work as the maintainer of the kernel's graphics subsystem (DRM) tests easier and still support extending test coverage to other subsystems. Her colleague Nicolas Dufresne later recapped the discussion, stating that the top-level configuration file would be removed.

Guenter Roeck thought that there were at least some basic requirements which all kernel developers could agree upon.

Sure, argue against checkpatch as much as you like, but the code should at least _build_, and it should not be necessary for random people to report build failures to the submitters.

Randy Dunlap and Geert Uytterhoeven agreed. Ripard pointed out that running CI like this takes funding, and that the supporters of existing testing efforts like the DRM-CI infrastructure might not want to pay for tests of unrelated parts of the kernel.

I don't really expect, say, the clock framework, to validate that all DRM kunit tests pass for each commit they merge, even though one of them could totally break some of the DRM tests.

A follow-up message clarified that Ripard is in favor of automated tests, but doesn't think it's reasonable to expect people "to pay for builders running things they fundamentally don't care about."

KernelCI

Guillaume Tucker had a different question: "Where does this fit on the KernelCI roadmap?" KernelCI is a project that works to provide a distributed test-automation system for kernel development. It was started in 2012 by Arnd Bergmann, Olof Johansson, and Kevin Hilman to detect build failures for Arm builds of the kernel. In 2019, it was adopted by the Linux Foundation with plans to expand the coverage and infrastructure to cover the entire kernel community's needs.

Kondrashov replied that the work was "an important part of KernelCI the project," although not currently part of the KernelCI service. The project does have existing build infrastructure, which Kondrashov would like to be able to reuse by sending GitLab jobs to the same pipeline.

The existing KernelCI work and the new GitLab-based tooling were meant to serve different purposes, Dufresne suggested. The KernelCI testing is largely "integration" testing that incorporates multiple changes from across the kernel at once. Koike's proposed GitLab testing would run on every push to a repository, effectively testing changes in isolation. That will "help catch submission issues earlier, and reduce [the] kernelCI regression rate", according to Dufresne.

Tucker was not satisfied with that response, saying that the new code was not restricted to that use case, and that it provides "a platform able to cope with the diversity of workflows across the kernel subsystems". Tucker's message also pointed out that the code contained a lot of mentions of KernelCI in variable names and documentation, which serves to confuse whether this work is meant to be part of the project or not. If it is part of the project, Tucker questioned why it was diverging from the project's previous plans to provide a comprehensive platform with new features.

Gustavo Padovan replied that what Tucker was missing was that the community is not really working with the KernelCI project. "If one asks people around, the lack of community engagement with KernelCI is evident." The new work is the result of a renewed effort to provide high quality tests following a change in the project's leadership, according to Padovan. He hopes that the increased involvement will translate not only into better tests, but also the feedback and funding required to bring KernelCI to the level that previous plans had envisioned.

Tucker emphasized that he was in favor of the new CI work, and that he had just wanted clarity on how it was related to the KernelCI project. Despite the overall positive reaction, Koike has not yet sent a new version of the patch set. It seems clear that many kernel developers would like to see more automatic test coverage, however. The BPF developers, for example, have a talk about continuous integration scheduled for the upcoming Linux Filesystem Memory Management and BPF conference. So chances are good that an updated version of this work will make its way in sooner or later.

Comments (8 posted)

Trinity keeps KDE 3 on life support

By Joe Brockmeier
May 20, 2024

As the shiny new KDE Plasma 6 desktop makes its way into distribution releases, a small group of developers is still trying to preserve the KDE experience circa 2008. The Trinity Desktop Environment (TDE), is a continuation of KDE 3 that has maintained the old-school desktop with semi-regular releases since 2010. The most recent release, R14.1.2, was announced on April 28. TDE does deliver a usable retro desktop, but with some limitations that hamper its usability on modern systems.

TDE got its start in the wake of the rocky launch of KDE 4.0 in 2008. The final KDE 3 release was 3.5.10 in August 2008. That final release was followed up in April 2010 by TDE 3.5.11, which brought modest improvements, bug fixes, and made it possible to install TDE alongside KDE 4. The project broke from the 3.5.x versioning with R14.0.0, announced in December 2014. ("R" stands for "release".) One of the highlights of that release was an upgrade to TDE's fork of Qt 3, TQt3, which added multi-threading support.

Since then, the project has not had another major release, but has continued with incremental updates with bug fixes, small feature enhancements, and work to keep the desktop up-to-date with mainstream Linux distribution releases. None of the major Linux distributions have an official TDE spin or include its packages in their official repositories, so a large part of the project's work is creating packages for popular distributions. TDE packages are available for Arch, Debian, Devuan, Fedora, Mageia, openSUSE, PCLinuxOS, Raspbian, Red Hat Enterprise Linux, and Ubuntu. Instructions are also available to build TDE for FreeBSD from source. R14.1.2 comes with a handful of new themes, minor feature enhancements for TDE applications, and a number of bug fixes. It also adds support for Fedora 40 and Ubuntu 24.04, and drops support for several distributions that are at end-of-life.

Obviously, the target audience is the user who loved KDE 3 and has no desire to switch to later KDE releases or alternate desktops. What might make it compelling for other users is its low resource usage, themes, and extensive configurability. The desktop and its applications were snappy even in a virtual machine configured with only 2GB of RAM and two vCPUs. Users can tweak the user interface and behavior of TDE down to the most minute details. Want Windows 95-ish buttons and title-bars, a purple and gray color scheme, and drop-shadows for windows? All of that is possible. A violation of good taste, perhaps, but possible. TDE also works nicely with other old-school applications that have tray icons, like Claws Mail, that don't integrate quite so well with recent desktops like GNOME 46.

Applications

TDE comes with a full suite of classic KDE applications, including the Konqueror double-duty file manager and web browser, Konsole terminal emulator, Okular document viewer, Kontact "personal information manager" (PIM) suite, DigiKam photo manager, and others. These applications are, with a few enhancements and bug fixes aside, largely as they were when KDE 3 was current. Most of these applications have continued to evolve within the KDE project, and have more modern counterparts as part of the KDE Gear set of applications that work on Plasma. Most, but not all. In some cases, TDE resurrects applications that would otherwise be lost to the dustbin of history, though the relevance of some of those programs today is questionable.

For example, the collection includes KPilot, an open source replacement for the Palm Desktop software for Palm Pilot devices. KPilot has long since been dropped from KDE as unmaintained software, but if any users are still depending on a Palm Pilot to organize their affairs, they can rely on TDE.

The Knmap front-end for nmap might be more relevant to a wider audience. That application seems to have disappeared from the KDE library of software, but it's still chugging along in the Trinity collection. One of my old favorite applications, the Basket free-form note-taking tool, is also available and works well with other TDE applications.

Showing its age

For the most part, TDE is a usable desktop, but it does show its age beyond its retro look-and-feel. Though the Trinity web site claims compatibility with newer hardware, it had some significant issues with a high-resolution (HiDPI) laptop display and external monitors over Thunderbolt connections. For example, on a 13" laptop display with 2256x1504 resolution, TDE's user-interface elements were too small to use comfortably. Current GNOME and KDE releases can be scaled up on HiDPI displays to provide a more usable interface, but TDE lacks this feature. Trying to change the display to use a lower resolution caused things to go haywire, with inverted colors and artifacts that made the desktop completely unusable.

TDE's System Settings application is outdated in some areas, or missing functionality entirely. Trying to use the network settings utility pops up an "unsupported platform" warning, and provides a list of supported distributions: the most recent of which is from 2015. The backend for the network settings is the knetworkconf package, a collection of Perl scripts that are far out of date for managing networking on current Linux systems. Network configuration is still possible with NetworkManager, but it isn't integrated into TDE. Users have plenty of configuration options for mice, but no trackpad options at all.

Some of the applications are in need of modernization or replacement to be useful in 2024. Konqueror is still a decent file manager, but it doesn't handle modern web sites well at all. The Kopete instant-messaging application offers to connect users to networks and protocols that are either dead and gone (AIM, Yahoo, Windows WinPopup) or well out of mainstream use (Novell GroupWise, Lotus Sametime). Support for more recent protocols, such as Matrix instant messaging, is not to be found. The vintage version of Amarok that is included still lists internet radio services that are defunct, and it immediately crashes when trying to play AAC files.

While on the topic of modernization, it is worth noting that TDE only has support for X11. Porting to Wayland seems to be considered a problem for the distant future.

With the exception of hardware support, however, these problems are not show-stoppers. Most users will simply choose Firefox or another browser, which works just fine along with TDE. It would be interesting to be able to use Kopete with Matrix, but there are plenty of Matrix clients available. Likewise, users have no shortage of music players to choose from.

In some ways, running TDE is like driving a lovingly restored classic car from the 1950s or 1960s. The commitment and effort toward preserving a cultural artifact is impressive. Its visual appeal and handling are satisfying for a specific audience, and it can be a lot of fun to take out for a weekend spin. It may not be a suitable option, though, for most users who want a desktop that will keep pace with the times. It is perfect for stalwart KDE 3 fans, making use of aging hardware, or for users who want to spend a little time reliving an earlier era of the Linux desktop.

Comments (22 posted)

The first half of the 6.10 merge window

By Jonathan Corbet
May 16, 2024

The merge window for the 6.10 kernel release opened on May 12; between then and the time of this writing, 6,819 non-merge commits were pulled into the mainline kernel for that release. Your editor has taken some time out from LSFMM+BPF in an attempt to keep up with the commit flood. Read on for an overview of the most significant changes that were pulled in the early part of the 6.10 merge window.

Architecture-specific

Support for a number of early Alpha CPUs (EV5 and earlier) has been removed. As noted in the merge message, these were the only machines supported by the kernel that did not provide byte-level memory access, and that created complications for support throughout the kernel. It is also the first non-x86 architecture to which the kernel was ported. Linus Torvalds amended the merge message to add:

I dearly loved alpha back in the days, but the lack of byte and word operations was a horrible mistake and made everything worse - including very much the crazy IO contortions that resulted from it.
It certainly wasn't the only mistake in the architecture, but it's the first-order issue.
So while it's a bit sad to see the support for my first alpha go away, if you want to run museum hardware, maybe you should use museum kernels.
The x32 subarchitecture now supports shadow stacks.
Arm64 systems have gained support for the userfaultfd() write-protect feature.
There is now a BPF just-in-time compiler for 32-bit ARCv2 processors.

Core kernel

Rust abstractions for time handling within the kernel have been added. This work was discussed in early 2023 and has finally found its way in; see this commit for the current form of this interface.
BPF programs now have the ability to use wait queues in the kernel; see this merge message for some more information. It is also now possible for BPF programs to disable and enable preemption.

Filesystems and block I/O

The new F_DUPFD_QUERY operation for fcntl() allows a process to check whether two file descriptors refer to the same underlying file. This functionality is also provided by kcmp(), but in a more restricted form that leaks less information from the kernel and, as a result, should be available even on systems where kcmp() is disabled.
The block-throttling low-limit mechanism, described in the Kconfig file as "a best effort limit to prioritize cgroups", has been removed. It was marked as "experimental" since being introduced in 2017, does not appear to have acquired users, and complicated the maintenance of the block layer.
The EROFS filesystem now supports Zstandard compression.
The dm-crypt device-mapper target has a new high_priority option that allows it to use high-priority workqueues for its processing. This option can improve performance on larger systems, but defaults to "off" to avoid creating latency problems for other workloads (such as audio processing) on smaller systems.

Hardware support

GPIO and pin control: pin controllers using the SCMI message protocol and Intel Granite Rapids-D vGPIO controllers.
Graphics: Samsung S6E3FA7 panels, ARM Mali CSF-based GPUs, LG SW43408 panels, Raydium RM69380-based DSI panels, and Microchip LVDS serializers.
Hardware monitoring: Analog Devices ADP1050 power-supply controllers, Lenovo ThinkStation EC sensors, and Infineon XDP710 hot-swap controllers.
Input: WinWing Orion2 throttles.
Also: the BPF for HID drivers framework2 is finally seeing some use with the addition of a number of small fixup programs to the kernel tree, the first of which is for the XPPen Artist 24 device. Some new udev functionality is used to load these programs as needed.
Miscellaneous: STMicroelectronics STM32 firewall-framework controllers, Arm Trusted Services secure partitions, NXP DCP key-storage devices, NVIDIA Tegra security engines, and Airoha SPI NAND flash interfaces.
Networking: Airoha EN8811H 2.5 Gigabit PHYs, Realtek 8922AE PCI wireless network (Wi-Fi 7) adapters, Realtek 8723CS SDIO wireless network adapters, TI Gigabit PRU SR1.0 Ethernet adapters, Microsemi PD692x0 I2C power sourcing equipment controllers, TI TPS23881 I2C power sourcing equipment controllers, Renesas RZ/N1 Ethernet controllers, and Intel HCI PCIe Bluetooth controllers.
Sound: Rockchip RK3308 audio codecs and Texas Instruments PCM6240 family audio chips.

Miscellaneous

The version of the Rust language used with kernel code has been moved up to 1.78.0. Among other things, this change has made it possible to drop the kernel's forked version of the alloc crate, removing about 10,000 lines of code. A number of other changes have been made as well; see this merge message and this commit for the full list.

Networking

The performance of zero-copy send operations using io_uring has been significantly improved. It is also now possible to "bundle" multiple buffers for send and receive operations, again improving performance.
The sending of file descriptors over Unix-domain sockets with SCM_RIGHTS messages has long been prone to the creation of reference-count cycles; see this 2019 article for one description of the problem and attempts to resolve it. The associated garbage-collection code has been massively reworked for 6.10, leading to a simpler and more robust solution; see this merge message for some more information.
There is now basic support for setting up packet forwarding control protocol (PFCP) filters, though much of the work must be done in user space and only IPv4 is supported.
TCP sockets now support the SO_PEEK_OFF socket option in the same way that Unix-domain sockets do. This allows the specification of an offset to be used when looking at data with MSG_PEEK.

Security-related

The kernel now supports encrypted interactions with trusted platform module (TPM) devices; this documentation commit has more information.
The "crypto usage statistics" feature, which is seemingly unused, has been removed from the kernel. See this commit for a detailed justification for this removal.
BPF programs now have access to the kernel's crypto framework.

The 6.10 merge window can be expected to remain open until May 26. Once it has closed, LWN will be back with a summary of what was pulled into the mainline for the latter part of this merge window.

Comments (3 posted)

An update and future plans for DAMON

By Jonathan Corbet
May 17, 2024

LSFMM+BPF

The DAMON subsystem was the subject of the first session in the memory-management track at the Linux Storage, Filesystem, Memory Management, and BPF Summit. DAMON maintainer SeongJae Park introduced the data-access monitoring framework, which can generate snapshots of how memory is accessed, enabling the detection of hot and cold regions of memory in both the virtual and physical address spaces. The session covered recent changes and future plans for this tool.

While DAMON can acquire memory-usage information, DAMOS extends DAMON by enabling the specification of policies to take action on that information. It can, for example, be instructed to force out any page of memory that has not been accessed in the last five seconds. Recent work on DAMOS includes the addition of a quota feature to control how aggressively it works; it can be used to limit the amount of memory processed in a given time period. There is also a new filter mechanism to better focus its efforts; for example, DAMOS can be directed at specific NUMA nodes, or to only work on file-backed pages.

DAMOS is seeing increased use, Park said. A number of products are using it now for proactive reclaim, and there is interest in using it for Compute Express Link (CXL) memory management. DAMOS has also been picked up by researchers, leading to some 20 citations in the literature.

At a 2023 LSFMM+BPF session, Park was told that better documentation for DAMON would be appreciated; that documentation has since been written and merged. That session also concluded that keeping the DAMON user-space tools in the kernel tree would not be a good idea. Part of the motivation for raising that idea had been to generate better test coverage. Park is now working on adding that to kselftest instead.

Another improvement is pseudo moving-sum-based fast snapshots. By default, DAMON produces snapshots over a period of 100ms. In 6.7, it gained the ability to create "reasonable snapshots" over shorter sampling intervals, 5ms by default. That is useful when the user wants to aggregate data over longer intervals, but would like to be able to get shorter-term data as well.

There are some new filter types. Aggregation can now be filtered on address ranges, and narrowed to NUMA nodes, memory zones, or virtual-memory areas. DAMON can also filter on the "page is young" flag, which can be used to double-check the status of a page before acting on it. The biggest change, though, is "aim-ordering, feedback-driven, aggressive auto-tuning". It allows the DAMOS quota to be automatically adjusted with a feedback loop. The user can provide the quota value, based on a parameter such as workload latency, or the kernel can drive it using existing system metrics, such as targeting a given pressure-stall rate.

What's next

Looking to the future, the first objective is control of tiered-memory management with automatic tuning; this is an area that is being explored now. The initial objective will be two-tier promotion and demotion; some patches are available now. The algorithm, roughly, is proposed to eventually work like this:

If a node has a lower (slower) node available to it, then demote cold pages to that lower node, keeping the amount of free memory above a minimum threshold.
If the node has an upper (faster) node, then push hot pages up to the upper node, trying to keep the utilization rate on that node high.

The objective here is to maximize the utilization of memory on the faster nodes, while keeping pages that are accessed less frequently in slower memory. The algorithm aims for a slow but continuous movement of pages between nodes, and will be extendable to systems with more memory tiers.

Another objective is "access/contiguity-aware memory auto scaling" or ACMA. The model here is that the user will specify the minimum and maximum memory requirements for their workload; a service provider will then run the workload somewhere, aiming for both good performance and low cost. Optimizing this scenario in current kernels requires the orchestration by the provider of four kernel features: memory overcommit, reporting more reclaimable pages with DAMON, periodic compaction, and memory hotplug to set hard limits and to minimize the page structure overhead.

Systems using these techniques have been working well in real-world deployments for years, Park said. But, he added, it is also a rather complex solution. Relying on memory hotplug is both slow and prone to failure — there are many ways to block the hot-removal of memory. System-level memory compaction is wasteful, especially in the absence of access information. Users can access pages at any time, thwarting the system's efforts to better organize memory. As a result, non-collaborative control of guests is difficult or impossible.

Park proposed an alternative for allocation of memory to guests based on two core actions. damos_alloc will allocate a memory region with a minimum level of contiguity, then inform the user about that allocation; damos_free returns memory to the system, also maintaining minimum levels of contiguity. These actions are driven by the system's current pressure-stall level. Memory is allocated to keep the stall level below an acceptable maximum, while freeing happens to keep that level above a minimum threshold. Since notifications are provided for memory changes, collaborative guests can react accordingly; ballooning can be used to control non-collaborative guests.

The objective is to limit the complexity involved in making such a system work; there are just three parameters to adjust. Since ACMA scales memory in 2MB chunks, it maintains the contiguity of memory on the host, even under high memory pressure. This system could also be extended to support the contiguous memory allocator or for power management by powering down memory banks when they are not needed.

Michal Hocko pointed out that the kernel should be providing mechanisms rather than policy, and asked how user space would control this feature. Park answered that control is currently managed through the DAMON sysfs interface, but the plan is to create simpler modules with fewer knobs to adjust. Hocko said that he was concerned about creating long-term API issues; developers are still trying to figure out what the best interfaces should be for the control of memory tiering, and it is important to be careful about which interfaces we commit to. "Sysfs is terrible", he continued; it allows the addition of too many interfaces without sufficient review. There needs to be more consideration of the API before this work can be merged.

Dan Williams asked whether there was a path to migrate DAMON-based features to more formal kernel interfaces. DAMON is a good way to do "science experiments", he said, but perhaps there should be a promotion path into the core kernel for the experiments that succeed. David Hildenbrand expressed worries about interference with the core memory-management code, and said that it was important that DAMON doesn't start taking on too much work. As the session ran out of time, Park said that he is trying to keep DAMON simple and to avoid that kind of interference.

Comments (1 posted)

Extending the mempolicy interface for heterogeneous systems

By Jonathan Corbet
May 18, 2024

LSFMM+BPF

Non-uniform memory access (NUMA) systems are organized with their CPUs grouped into nodes, each of which has memory attached to it. All memory in the system is accessible from all CPUs, but memory attached to the local node is faster. The kernel's memory-policy ("mempolicy") interface allows threads to inform the kernel about how they would like their memory placed to get the best performance. In recent years, the NUMA concept has been extended to support the management of different types of memory in a system, pushing the limits of the mempolicy subsystem. In a remotely presented session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit, Gregory Price discussed the ways in which the kernel's memory-policy support should evolve to handle today's more-complex systems.

Heterogeneous-memory systems may seem like exotic beasts, Price began, but they are actually common; even a simple two-socket server, with its two banks of memory with different access characteristics, is a heterogeneous-memory system. On such systems, developers have to think about where their tasks run, or performance will suffer. Future systems will be worse, though; they will be "a monstrosity", equipped with ordinary DRAM (at various distances), CXL memory, high-bandwidth memory, and more. The kernel's mempolicy API was not designed for this kind of system — or even today's basic two-socket system, he said.

Memory tiering has been a frequent topic of discussion at LSFMM+BPF for some years now, and memory policy clearly will be a part of the tiering solution, but tiering and mempolicy are aimed at slightly different problems. The tiering discussion is all about memory movement between different memory tiers, while the mempolicy interface is about allocation. The former is focused on migration, while the latter is about node selection. In a perfect world, the kernel would always place memory allocations perfectly, but we do not live in that world. Allocations will be wrong, or usage patterns will change over time. Thus, he said, tiering is useful and necessary — but so is better allocation policy.

In current systems, every thread can have its own memory policy; that policy can even be different for each virtual-memory area in the thread. There are four policy types available to control where allocations are placed: default to the local node, allocate on a set of preferred nodes, interleave across a set of nodes in a round-robin fashion, and weighted interleaving.

The last option, weighted interleaving, was added for the 6.9 kernel. It is controlled with a set of global weights managed via sysfs. The administrator can use these weights to try to obtain optimal bandwidth use across all memory interconnects; putting some frequently used data in slower memory can improve performance overall if it keeps all of the interconnects fully busy. Weighted interleaving can thus improve throughput, but can also complicate the latency story. This mechanism is sufficient for simple tasks, and a number of useful lessons have been learned from its implementation.

Lessons learned

One of those lessons is simply that the kernel's memory-policy features have not kept up with the evolution of the computing environment in which they run. Consider, he said, a single-socket system running with attached CXL memory, which is slower than DRAM. A streaming benchmark will run 78% slower on that system than on a machine with DRAM only. But, with a proper, task-wide weighted-interleaving policy, that benchmark will run somewhere between 6% slower and 4% faster. That is better, "but it still sucks". It is possible to get good results on such systems, but processes are forced to be NUMA-aware to get those results.

The current mechanism is built around the idea that either the administrator or some sort of daemon will manage the weights used for interleaving. He has an RFC patch circulating to do this automatically using information from the system's heterogeneous memory attribute table (HMAT), but that is not an easy thing to do, especially in systems where memory hotplugging is in use, on complex NUMA systems, or on systems with other types of complex memory topologies. Task-local weights can help, but that feature was dropped out of the patch set merged for 6.9, because it needs some new system calls; he has another RFC patch set out there that adds them.

While the current memory-policy API can be made to work, it is unwieldy at best on large NUMA systems. Sub-NUMA clustering (a recent hardware feature that partitions NUMA zones into smaller sub-zones) is hard to use well with this API. In general the number of nodes showing up on systems is growing, but that makes the system as a whole harder to reason about, he said.

The memory-policy interface is entirely focused on the currently running task; there is no way for one thread to change another's policies. Within the memory-management subsystem, policy changes require a level of access to the virtual-memory areas (VMAs) that will be painful to extend. The current design is not without its advantages; it allows the implementation of memory policies to be lockless in the allocation paths. Widening access without hurting performance will require some significant refactoring and movement toward the use of read-copy-update (RCU). Memory policies also have complex interactions with control groups, and must not violate any restrictions imposed by control groups.

Michal Hocko asked how VMA-level manipulation could be implemented without creating other problems; Price answered that there is a patch for a new system call (process_mbind()) circulating now. Hocko answered that the patch "is not wrong", but that it is complicated and has security implications.

David Hildenbrand asked whether Price was thinking that a system would run a process that would be adjusting the VMAs of others, or would applications opt into some sort of management scheme? Price answered that allowing the first case is the important part of this work; other types of mechanisms can come later if need be. There is no agreement on the existing work yet, though, so there will be changes to those patches, including trying to make more use of existing system calls (like madvise()) when it makes sense.

Liam Howlett asked how memory policies would be affected if the scheduler moves a task elsewhere in the system. This is a problem that has been talked about a lot, Price answered. One of the reasons for the global interleaving weights is that they ease the problem of dealing with process migration. That is also part of why the other system calls have been pushed back.

Proposals

Price concluded with a quick look at what is being proposed for the memory-policy subsystem. It would be good to get to the point where a process running with a reasonable policy would get performance close to what can be had by explicitly binding memory to nodes. That involves finding ways to not interleave memory for data that is not driving the system's memory-bandwidth use. The plan is to implement process_mbind() in some form; it will use the pidfd abstraction and be analogous to process_madvise(). This mechanism could be seen as a sort of crude tiering solution that would be useful to job-scheduling systems.

There is also a wish to improve how mbind() performs memory migration. Currently, bound memory will only be migrated if a node is removed from the allowed set. But if a process is set up for interleaving, and a new node is added, there will be no migration to rebalance the allocations. That would be a nice feature to have, but implementing it could be expensive, he said. If it can be done, though, he would like to see redistribution in the interleaved case — and the configured weights should be applied when this happens.

Finally he asked whether memory policies should be integrated with control groups. That could be awkward, since memory policies are lockless, while control groups are not. Hocko was skeptical, saying that control groups are all about hierarchies, and he does not see a way to define a reasonable hierarchical policy. Price said, though, that control-group integration would ease the management of sets of policies, and simplify the handling of migration. But he acknowledged that this idea has not found any sort of consensus; he will continue looking for solutions.

Comments (4 posted)

Better support for locally-attached-memory tiering

By Jonathan Corbet
May 20, 2024

LSFMM+BPF

The term "memory tiering" refers to the management of memory placement on systems with multiple types of memory, each of which has its own performance characteristics. On such systems, poor placement can lead to significantly worse performance. A memory-management-track discussion at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit took yet another look at tiering challenges with a focus on upcoming technologies that may simplify (or complicate) the picture.

A quick note: this session mentioned future plans from a number of different companies, and some participants were worried about revealing too much or breaking non-disclosure agreements. For this reason, it was requested that this session be reported without naming the people involved or attributing any statements. Apologies for the forthcoming vagueness, but hopefully the important parts get through.

The specific focus of this session was obtaining optimized placement on systems with CXL memory attached. This memory is large and flexible in its use (it can, for example, be moved from one server to another in some configurations) but, since it is more distantly attached, it is also slower. If, though, the system can learn to use this memory properly and place the right data there, there is a huge potential for both performance improvements and cost savings. Preferably, this would happen without the need for new, specialized interfaces. The kernel's tiering support should be useful for all systems, and it should be extensible, since memory types will change in the future. The hope for the discussion was to reveal any constraints in the memory-management subsystem that would impede this support and to bring the developers working in this area together.

The trouble with NUMA balancing

The current approach to tiering is based on NUMA balancing — different types of memory appear to be different (CPU-less) NUMA nodes, and the kernel manages the placement of memory on each node. The advantage is that the kernel's NUMA-balancing code is ten years old at this point, and is relatively mature. Tiering support has been added more recently, with a special mode that attempts to optimize memory placement.

The use of NUMA balancing for tiering is not ideal, though; it is too slow in a number of ways. Perhaps the biggest problem is page promotion. It is relatively easy for the kernel to notice data that is not seeing frequent use and demote it to slower memory. The promotion path — observing frequent use of data on slow memory and moving it to a faster tier — is harder. Promotion needs to be fast; once a process starts using some data, it tends to work on it for a while; if that data is not promoted immediately, performance will suffer.

NUMA balancing uses a sliding-window technique, where memory is access-protected and the resulting faults (on the pages that are actually accessed) are noted. This algorithm takes time and is not responsive enough for the promotion case; performance will decay while pages are waiting to be promoted. NUMA balancing is also a system-wide task, but it really needs to be job-wide, and should eventually be controlled with memory control groups. An additional challenge in making all of this work is a lack of good benchmarks to measure the effectiveness of tiering algorithms.

It was pointed out that one type of memory — that which is hosted on peripheral devices like GPUs — is special in its own way. Unmapping that memory (prior to migrating it and mapping it in its new location) can stop the device in its tracks and kill performance, so automatic tiering has to be disabled on such systems. The lack of device awareness in the kernel's tiering mechanisms needs to be fixed.

One possible approach that was suggested was to focus on DAMON (which was discussed earlier that day) as a flexible way to implement tiering algorithms. On the other hand, DAMON also feels a bit like a separate memory-management subsystem, and it could be better to keep this support in the core.

An upcoming change that should help with this task is that, in the future, CXL controllers will allow the kernel to easily observe which pages are being accessed. That will be a fast source of truth, under the kernel's control. But it is not clear how that information can be used. It seems that either NUMA balancing or DAMON could be extended to take advantage of CXL hot-page detection. One developer said that hot-page detection looks like many hardware-assistance features that promise to help, but where the hardware developers always get it wrong and the problem still has to be solved in software.

NUMA balancing was designed to converge on an optimal solution for a given workload and not move a lot of data around. That makes it hard to extend to this case, where active migration of data is needed. Trying to create a complex policy that will work for all workloads is impossible, developers said, so it will be necessary to make NUMA balancing more extensible — or to use a different mechanism entirely. It is important, one developer suggested, to avoid conflating the mechanism for detecting hotness for the one that moves pages; the two need to be firmly decoupled.

Possible solutions

Various ideas flew around the room. A 2023 session had looked at the use of hardware performance counters for page aging; perhaps that work could be extended here. It seems, though, that not all CPUs have performance-management units that provide the information that is needed. The multi-generational LRU already contains several tiers internally that could be used to manage tiering, but one developer said that experiments with LRU-based hot-page detection did not work out as well as had been hoped. It was also said that "hot-page detection" should really be "hot-folio detection", and that scanning should work better in general for larger folios.

One component of NUMA balancing is often called "workload follows memory"; if a task finds itself running far from its memory, it can be migrated to a closer node. That doesn't work for CXL, though, since CXL nodes have no CPU, so tasks cannot be migrated there. It was suggested that task migration should be disabled in general, that balancing workloads across CPUs is no longer relevant in our world. Task migration can throw NUMA systems out of balance, forcing the migration of memory to follow tasks around. Migration can also split communicating tasks apart from each other. Often, it was said, it is better to just leave the system alone.

This was, of course, a room full of memory-management developers; had there been CPU-scheduler developers present, that assertion would likely have been challenged. Even in this crowd, one developer disagreed, saying that the problem isn't task migration; instead, the CPU scheduler just isn't being given enough information to make the best decisions.

There is, it seems, a need for some sort of "hot-memory abstraction" for the kernel. It could take advantage of "accessed" bits in the page tables, performance-monitoring units, the upcoming CXL hot-page detection feature, or any "future hardware innovation" that might be in the works. Whatever information is available should be brought in and provided in a single interface. It could be useful for more than tiering; NUMA balancing would also benefit from better information. One possible problem is that, while tiering does not normally need to know which CPU is accessing data, NUMA balancing depends heavily on that information.

Toward the end of the session, mechanisms for acting on this information were discussed. One possibility is to push as much of it to user space as possible; the migration of memory will often require changes elsewhere in the system (such as redirecting interrupts) that only user space can know about anyway. The kernel currently provides a memory.reclaim knob to trigger reclaim; perhaps a memory.demote could be provided as well? Maybe there needs to be a kpromoted thread, or perhaps that task should be handled by user space.

Memory promotion, it was said, should be done asynchronously, unlike how NUMA migration is done. Moving memory synchronously can create latency blips that user space might notice; performing promotion asynchronously would still be noticeable, but it would not stall an application in the same way.

At the conclusion of the session, it was repeated that a proper solution in this area could lead to "massive amounts" of money being saved, especially in hyperscaler deployments. A memory-tiering working group is being formed to continue work in this area and to ensure that all of the known use cases are handled.

Comments (1 posted)

What's next for the SLUB allocator

By Jonathan Corbet
May 20, 2024

LSFMM+BPF

There are two fundamental levels of memory allocator in the Linux kernel: the page allocator, which allocates memory in units of pages, and the slab allocator, which allocates arbitrarily-sized chunks that are usually (but not necessarily) smaller than a page. The slab allocator is the one that stands behind commonly used kernel functions like kmalloc(). At the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit, slab maintainer Vlastimil Babka provided an update on recent changes at the slab level and discussed the changes that are yet to come.

Once upon a time, the kernel contained three slab-allocator implementations. That number had dropped to two in the 6.4 release, when the SLOB allocator (aimed at low-memory systems) was removed. At the 2023 summit, Babka began, the decision had been made to remove SLAB (one of the two general-purpose allocators), leaving only SLUB in the kernel. That removal happened in 6.8. Kernel developers now have greater freedom to improve SLUB without worrying about breaking the others. He thought that nobody was unhappy about this removal, he said, until he saw the recent report from the Embedded Open Source Summit, which contained some complaints. Even there, though, the primary complaint seemed to be that the removal had happened too quickly — even though he thought it had taken too long. Nobody seems to be clamoring to have SLAB back, anyway.

Last year, some concerns had been expressed that SLUB was slower than SLAB for some workloads. But now, nobody is working on addressing any remaining problems. David Rientjes said that Google is still working on transitioning to SLUB; in the process it has turned up that using SLUB resolves some jitter problems that had been observed with SLAB, so folks there are happy with the change.

Babka said that he has been working on reducing the overhead created by the accounting of kernel memory allocations in control groups; this cost shows up in microbenchmarks, and "Linus is unhappy" about it. There are some improvements that are ready to go into 6.10, but there is more work to do. Another area of slab development is heap-spraying defense; these patches are a bit of a problem for him. He can review them as memory-management changes, but he lacks the expertise to judge the security aspect.

Work is being done on object caching with prefilling. This feature would maintain a per-CPU array of objects that users could opt into; they would be able to prefill (preallocate) the objects prior to allocation so that they are ready to go when needed. That would be useful for objects allocated in critical sections, for example. The initial intended user is the maple tree data structure, which is currently bulk-allocating a worst-case number of objects before entering critical sections, then returning the unused objects afterward. The object cache would eliminate that back-and-forth while ensuring that objects could be allocated when needed.

Michal Hocko pointed out that the real problem that is driving this feature is the combination of GFP_ATOMIC allocations with the __GFP_NOFAIL flag; that combination is difficult for the kernel to satisfy if memory is tight. The allocator currently emits a warning when it sees that combination; avoidance of it on the part of developers would be appreciated, he said. The prefilled object cache is one way of doing that. In the future, some sort of reservation mechanism may be added for such situations as well.

Another problem exposed by the maple tree has to do with its practice of freeing objects with kfree_rcu() — an approach taken often in kernel code. The problem is that memory freed in this way is not immediately made available for other uses; it must wait for an RCU grace period to pass first. That can lead to an overflow of the per-CPU arrays used by kfree_rcu(), causing flushing and, perhaps, a quick refill starting the cycle all over again. To complicate the issue on Android, RCU callbacks are only run on some CPUs, which isn't useful for processing the per-CPU arrays on the CPUs that don't run them.

The plan is to create a kfree_rcu() variant that puts objects in an array and sets them aside to be freed as a whole. Once that has happened, the entire array can be put back into the pool and made available to all CPUs. This array is to be called a "sheaf"; it will be stored in a per-node "barn". One potential problem is that it may become necessary to allocate a new sheaf while freeing objects; allocations in the freeing path need to be avoided whenever possible. The group talked about alternatives for a while without coming to any conclusions.

Meanwhile, Babka is not satisfied with removing just SLOB and SLAB; next on the target list is the special allocator used by the BPF subsystem. This allocator is intended to succeed in any calling context, including in non-maskable interrupts (NMIs). BPF maintainer Alexei Starovoitov is evidently in favor of this removal if SLUB is able to handle the same use cases. The BPF allocator currently adds an llist_node structure to allocated objects, making them larger; switching to SLUB would eliminate that overhead. It would also serve to make SLUB NMI-safe and remove the need to maintain yet another allocator.

Babka would also like to integrate the objpool allocator, which was added to the 6.7 kernel without any consultation with the memory-management developers at all. Finally, as the session ran out of time, Babka mentioned the possibility of eventually integrating the mempool subsystem (which is another way of preallocating objects). The SLUB allocator could set aside objects for all of the mempools in the system, reducing the overhead as a whole. That, though, looks like a topic for discussion at the 2025 summit.

Comments (1 posted)

Facing down mapcount madness

By Jonathan Corbet
May 21, 2024

LSFMM+BPF

The page structure is a complicated beast, but some parts of it are more intimidating than others. The mapcount field is one of the scarier parts. It allegedly records the number of references to the page in page tables, but, as David Hildenbrand described during the memory-management track at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit, things are more complicated than that. Few people truly understand the semantics of this field, but the situation will hopefully get better over time.

There are a number of problems surrounding the page mapping count, starting with the fact that a page-table mapping is only one way to create a reference to a page. Reference-tracking confusion has led to severe bugs in the past. The adoption of folios has, in the short term at least, made things worse in some ways (while improving it in others), since mappings can happen at both the folio and page levels. Determining if a folio is mapped can require iterating over the mapping counts of all the pages it contains, which gets slower as folios get larger. All of this leads to a desire to clarify the use of mapping counts, and to eliminate the use of page-level mapping counts whenever possible.

Hildenbrand started by referring back to "simpler times" when the kernel maintained a simple, 31-bit map count for each page. If that count was zero, then the page was not mapped into user space; a count of one indicated that there was a single user, while anything larger meant that the page was shared. But then the kernel added transparent huge pages, and life got more complicated. It was a natural evolution that led to flags like PG_double_map, which indicated that the page was mapped at both the page-table (PTE) and page-middle-directory (PMD) levels — that it was mapped as both a base page and a huge page, in other words. There followed a whole series of functions for handling the mapping count with names like page_trans_huge_map_swapcount(). Increasingly, nobody really understood what mapcount really meant.

That said, things have improved; the folio work has helped to straighten a lot of things out. The semantics of mapcount are "almost clear" now, he said. A count of zero means that a folio is not mapped; if it is greater than zero, then mappings exist. A count of one indicates an exclusive mapping; a count greater than one says that the folio might be mapped shared. There is a function, folio_likely_mapped_shared(), in linux-next that makes an "educated guess" as to whether a given folio is shared.

Part of the objective here is to stop keeping track of mappings at the page level; doing so requires fixing code that is using that information. The page_mapped() function is easy to remove, and total_mapcount() went away in 6.9. page_mapcount() is harder, since there is no direct translation to a folio function. Instead, most of the users of page_mapcount() have been removed; the last few call sites (including those in KSM and khugepaged) are going to be challenging to fix, though.

There are some other problems yet to be solved as well, he said. Large folios that are smaller than the PMD size cannot have PMD mappings, so they only have PTE-level mappings. That means maintaining the per-page map counts in each page, which is expensive; the atomic operations on each page add up. Some other planned optimizations may make maintaining those per-page counts impossible. As a result, the kernel may not be able to tell if a folio is mapped shared; it is also not possible to handle folios that are larger than the PMD size. That latter problem could perhaps be addressed by adding a map count for each PMD entry covered by a folio, and perhaps extending that solution to higher levels of the page-table hierarchy as well. That is not a pleasing solution and should be avoided if possible, but it can be a backup if nobody comes up with anything better.

Hildenbrand put up a slide showing the various use cases for the mapping count, both in the present and the future. All is good for small folios, he said, but it is hard to keep track of whether large folios are shared in current kernels. That situation is somewhat improved in the mm-stable tree (which may have moved into the mainline for 6.10 by the time you read this), but there is still work to be done.

One place where the shared status of a folio is important is in memory-use accounting. There are three different sizes used to describe a process's memory use. The resident-set size (RSS) is the number of pages that a process has resident in memory at any given time. The unique set size (USS) only counts pages that are unique to the process, not counting the shared pages. The proportional set size (PSS) is calculated by dividing the number of shared pages by the number of processes sharing them. If a process maps 100 pages shared with three others, its PSS will increase by 25.

If a process maps a single page from a 16-page folio, all three set sizes will grow by one page — 4KB. That is wrong, Hildenbrand said, since the full 16 pages are all in memory; the increase should be 64KB. But there is no way to get that result in the kernel currently. On the other hand, the current model works correctly if a folio is split.

Calculating these set sizes requires page_mapcount() to determine if a page is shared and, if so, how widely it is shared. In the absence of a per-page map count, some other solution will have to be found. One possibility is to just use the folio mapping count, and to keep a count of mapped pages at the PMD level. For most other uses, including the USS calculation, all that is really needed is to know whether a folio is mapped exclusively or not.

Upcoming changes will cause the kernel to lose its ability to track the number of pages mapped within a folio; that will result in charging a user for an entire folio if any page is mapped. It might also cause USS to be too small if a folio is mapped with a combination of exclusive and mapped pages, and PSS may lose precision. It is not clear that this will be a big problem; there will be a debugging option to allow developers to get a better handle on the situation.

One potential problem for the future is an overflow of the page reference count, which includes the map count but also any other types of reference that a page might have. Overflow is not seen as a problem for small folios now; Matthew Wilcox pointed out that it would require a system with terabytes of installed memory to even get close. But large folios, with more pages (and thus more reference counts to add up) are a different story, especially on 32-bit systems. Michal Hocko suggested just making the reference count a 64-bit quantity and seeing if anybody complains. Hildenbrand said that the kernel could also simply avoid incrementing the reference count if the mapping count is greater than zero; that would save some atomic operations as well.

By this point, time had run out. As the session closed, it was pointed out that some drivers use the mapcount field for their own purposes on pages that are not otherwise mapped. Wilcox suggested that such uses need to be "excised from the kernel".

Comments (1 posted)

Dynamically sizing the kernel stack

By Jonathan Corbet
May 21, 2024

LSFMM+BPF

The kernel stack is a scarce and tightly constrained resource; kernel developers often have to go far out of their way to avoid using too much stack space. The size of the stack is also fixed, leading to situations where it is too small for some code paths, while wastefully large for others. At the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit, Pasha Tatashin proposed making the kernel stack size dynamic, making more space available when needed while saving memory overall. This change is not as easy to implement as it might seem, though.

Every thread has its own kernel stack, he began. The size of each thread's kernel stack was 8KB for a long time, and it is still that way on some architectures, but it was increased to 16KB on x86 systems in 2014. That change, which was driven by stack overflows experienced with subsystems like KVM and FUSE, makes every thread in the system heavier.

Expanding the stack size to 16KB has not solved all of the problems, though; kernel code is using more stack space in many contexts, he said. I/O is becoming more complex, perf events are handled on the stack, compilers are more aggressively inlining code, and so on. Google has stuck with 8KB stacks internally for memory efficiency, but is finding that to be increasingly untenable, and will be moving to 16KB stacks. That, he said, is an expensive change, causing an increase in memory use measured in petabytes. There are applications that create millions of threads, each of which pays the cost for the larger kernel stack, but 99% of those threads never use more than 8KB of that space.

Thus, he proposed making the kernel stack size dynamic; each thread would start with a 4KB stack, which would be increased in response to a page fault should that space be exhausted. An initial implementation was posted in March. The proposed solution takes advantage of virtually mapped stacks, which make it relatively easy to catch overflows. A larger stack is allocated in the kernel page tables, but only one 4KB page is mapped. The result is a significant speedup because the kernel does not have to find as much memory for kernel stacks, and tests have shown a 70-75% savings in memory used for the stacks. That, he said, was from a "simple boot test"; tests with real workloads would have shown a larger savings.

There is an interesting challenge associated with page faults for stack access, though: page faults are also handled on the kernel stack, which has just run out of space. When a thread tries to access an unmapped page and causes a page fault, the fault handler will try to save the current processor state onto the kernel stack, which will cause a double fault. The x86 architecture does not allow handling double faults; code is simply supposed to abort and clean up when that happens. If the kernel tries, instead, to handle that fault and expand the stack, it is operating outside of the rules defined by the architecture, and that tends not to lead to good things.

Solutions to that problem seem to be expensive. One idea, suggested by Matthew Wilcox but also already present on Tatashin's slides, is to add an expand_stack() function that would be called by subsystems that know they will need more stack space. It would map the extra space ahead of its use, avoiding the double-fault situation. Michal Hocko responded that this solution seemed like a game of Whac-A-Mole, with developers trying to guess where the stack might overflow. But direct reclaim, which can call filesystem or I/O-related functions with deep stack use, can happen just about anywhere. If that causes an overflow, the system will panic.

A second possible solution, Tatashin said, would be to take advantage of some of the kernel-hardening work to automatically grow the stack as needed. Specifically, he would like to use the STACKLEAK mechanism, which uses a GCC plugin to inject stack-size checks into kernel functions as they are compiled. That code could be enhanced to automatically grow the stack when usage passes a threshold. This solution adds almost no overhead to systems where STACKLEAK is already in use — but it is rather more expensive if STACKLEAK is not already enabled.

Finally, a third option would be to limit dynamic stacks to systems that either do not save state to the stack on faults or that can handle double faults. Tatashin suggested that x86 systems with FRED support might qualify, and 64-bit Arm systems as well.

Time for the session was running short as Hocko said that he liked the second solution but wondered what the cost would actually be. Tatashin said that he has been working on reducing it; he has refactored the STACKLEAK code to be more generic, so that it can be used for this purpose without needing to include the hardening features. A stack-frame size can be set at build time, and the plugin will only insert checks for functions that exceed that size. David Hildenbrand said that this scheme could be thwarted by a long chain of calls to small functions; Hocko said that would make the feature less than fully reliable. Tatashin answered that there is probably at least one large function somewhere in any significant call chain, but Hocko said that is not necessarily the case with direct reclaim.

Steve Rostedt said that, perhaps, the frame-size parameter could be set to zero, causing the check be made at every function call; Tatashin answered that, while he has not measured the overhead of the check, it would certainly add up and be noticeable in that case. The final suggestion of the session came from Hocko, who said that perhaps the ftrace hooks could be used instead of the STACKLEAK infrastructure, but Rostedt said that option would be too expensive to be worth considering.

Comments (14 posted)

Memory-allocation profiling for the kernel

By Jonathan Corbet
May 21, 2024

LSFMM+BPF

Optimizing the kernel's memory use is made much easier if developers have an accurate idea of how memory is being used, but the kernel's instrumentation is not as good as it could be. When Suren Baghdasaryan and Kent Overstreet presented their memory-allocation profiling work, which is meant to address this shortcoming, at the 2023 Linux Storage, Filesystem, Memory Management, and BPF Summit, their objective was uncontroversial but the proposed solution ran into opposition that played out at length on the mailing lists (example) over the last year. So it may be a bit surprising that, when the two returned to the memory-management track in the 2024 gathering, the controversy was gone and the discussion focused on improving details of the implementation.

As a review: the allocation-profiling work tracks all allocations of memory in the kernel and maps them back to the code that performed the allocation. It can be used to see where memory is being used and to track down memory leaks. The profiling, in turn, relies on code tagging, which inserts special structures into the code allowing locations to be identified. Both features are new to the mainline kernel.

Baghdasaryan started by saying that the patch set had been accepted into the mm-stable tree and was poised to go upstream into the mainline (that has since happened in the 6.10 merge window). The discussion on whether this code should be merged was over, so it was time to talk about what comes next.

The main topic was reducing the memory and performance overhead of the profiling mechanism. If it is enabled, it consumes about 0.2% of the system's total memory — enough to be concerned about. It turns out that almost all of that overhead is in the page_ext structures used to hold the back pointer from a page of memory to the tag identifying the code where it was allocated. That pointer is used to decrement the associated counters when the page is freed. On the performance side, allocation profiling makes page allocations 40% slower, and has a smaller, 7% impact on slab allocations.

One way of reducing that overhead would be to pack the code-tag references, of which there are 4-5,000 in the kernel. With some care, there is no need to use a 64-bit pointer for each. Instead, the references could be made smaller and, possibly, packed into the page flags, eliminating the need for the page_ext structure and reducing the allocation overhead. On the other hand, this approach would introduce complications with loadable modules, Baghdasaryan said. The group then spent a while discussing possible linker tricks to solve that problem without reaching any specific conclusions.

Assuming the loadable-module problem can be solved, the allocation-profiling code would store 16-bit references rather than 64-bit ones, resulting in a 75% reduction in the memory used — for page allocations. The overhead for slab allocations actually increases to 9.5%, though, suggesting that perhaps those references should not be packed. But if that 16-bit reference can be crammed into the page flags, then the memory overhead goes away completely and the performance overhead at allocation time goes from 40% to 7%. Without this additional step, he said, the packed references are not worth the extra complexity cost.

John Hubbard was the one to ask a question that was likely on the mind of many of the developers in the room: is it really possible to find 16 free page flags to use for this purpose? Page flags have long been in short supply, and developers have had to fight hard to use even a single one of them. There was not a clear answer to that question. Pasha Tatashin suggested that perhaps fewer than 16 bits would suffice for 5,000 references. There followed a winding discussion on the kernel configurations used by various distributions, their effect on the availability of page flags, and whether any of them could be changed, that did not reach any specific conclusions

Tatashin said that it would be nice to have the ability to selectively enable and disable tags to, for example, avoid slowing down a critical network driver while profiling allocations in an unrelated subsystem. He would also like to separate accounted and unaccounted allocations; the latter, which are not charged to any specific process, represent pure overhead imposed by the kernel. Overstreet answered that the profiling could show the allocation flags used along with other information, but also asked whether it might not be better to just turn on accounting for all allocations. He acknowledged that accounting would have to be made cheaper for that to be an option.

The allocation-profiling subsystem's path into the kernel was eased by the dropping of a number of features that it had initially included. Now the developers would like to bring some of those back, Baghdasaryan said. These include capturing more information about allocation context and dynamic fault injection (which wasn't discussed in the session; this feature allows allocation failures to be injected into specific code paths to test error handling). Some sort of selection mechanism, as requested by Tatashin earlier, is also on the list. Overstreet closed the session by saying that interest in allocation profiling (and code tags) is increasing, and that some interesting uses that he had never thought of were emerging.

Comments (3 posted)

Another try for address-space isolation

By Jonathan Corbet
May 21, 2024

LSFMM+BPF

Brendan Jackman started his memory-management-track session at the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit by saying that, for some years now, the kernel community has been stuck in a reactive posture with regard to hardware vulnerabilities. Each problem shows up with its own scary name, and kernel developers find a way to mitigate it, usually losing performance in the process. Jackman said that it is time to take back the initiative against these vulnerabilities by reconsidering the more general use of address-space isolation.

In a typical exploit, he said, an attacker will start by carefully mistraining a CPU's branch-prediction hardware. Then, a call into the kernel will cause speculative execution to take a wrong path; the erroneous speculation will be mostly cleaned up when it becomes clear that it was wrong, but not without leaving a secret behind somewhere. The attacker then recovers that secret and leaks it by way of some sort of covert channel.

Keeping data unmapped

The key to getting out of the reactive mode is the realization that speculative execution cannot leak data that is not mapped at the time. Thus, keeping sensitive data unmapped when it is not needed can mitigate a number of known exploits — and those we don't yet know about as well. Address-space isolation has been pursued by a number of developers over the years; Jackman was there to talk about a specific patch set that was first implemented for Hyper-V in 2019, and which has been partially deployed at Google this year. The company plans to reach full deployment in the future, and intends to maintain the work going forward.

This work was covered here in 2022; see that article for details on this work (which has not been publicly posted since then). Jackman began a brief overview of this work by pointing out that Linux uses address-space isolation now to keep much of the kernel inaccessible (even via speculation) from user space; there are separate page tables for user and kernel mode. Keeping the kernel's address space isolated from user space protects it from Meltdown vulnerabilities.

The proposed patch addresses Spectre vulnerabilities by providing address-space isolation within the kernel. It splits the kernel page table into two: a "restricted" page table that only maps readily available (nonsensitive) data, and an "unrestricted" table that maps all of the kernel, including sensitive data. The restricted table is active until there is an actual need to access sensitive data; any attempt to do so will cause a page fault, at which point the kernel will flush caches, perhaps halt sibling processors, then continue with the unrestricted table. That switch is expensive, so the best performance will be had if most paths through the kernel only access nonsensitive data.

This is, he said, a naive solution, in that everything is either sensitive or not, with no shades of gray in between. Making it less naive involves adding a third level, called "local nonsensitive" (this approach was already reflected in the 2022 patch set). Data in this class can be leaked back to the calling process without ill effect; it is, essentially, information that this process already has access to. But locally nonsensitive data should be protected from any other process in the system. In this mode, each process will have its own set of restricted kernel page tables; it adds complication, so Jackman would like to proceed without this aspect in the beginning, if possible.

He put up a performance chart showing that existing mitigations for Spectre vulnerabilities have a significant performance cost. With address-space isolation in place and the other mitigations turned off, almost all of that performance was regained and the system was still protected against speculative vulnerabilities.

There are, he said, some questions that need to be answered about this work; the first of those is about how sensitivity of data is annotated. There is a new set of GFP flags that are used at allocation time for that purpose, Jackman said. In the future, it might also be possible to use the subsystem context more directly; perhaps everything touched by the crypto layer should be seen as sensitive. Eventually the desire will be to figure out sensitivity at run time.

Even with allocation flags, there are two alternatives that need to be considered, given the need to minimize the amount of restricted data in the interest of better performance. One would be to consider all allocations to be sensitive unless they are specifically marked otherwise; that is, he said, "the only competent security answer". The other is to consider data nonsensitive unless specifically marked as sensitive — "the only competent performance answer". In the end, he said, there are three objectives to aim for: full mitigation, good performance, and reviewable patches. The community, somehow, has to pick which two of those it wants.

An audience member grumbled that all of this work is just a band-aid, that the proper solution is to just keep sensitive data on a separate processor. David Hildenbrand complained that the community is stuck writing code with the assumption that the hardware is compromised. That is the situation we are in, but he worried that address-space isolation would make it easier for hardware companies to just not care about speculative-execution vulnerabilities. Address-space isolation is designed around the idea that speculative-execution bugs will always be severe, and that may end up perpetuating that situation. Jackman responded that he did not believe that it is possible to create a CPU that is entirely free of this kind of problem, so speculative vulnerabilities will be with us for a long time regardless.

He returned to his question of whether the initial version of this work should start by emphasizing security or performance. His instinct is to prioritize security, then work on performance until it reaches a point where people actually want to run it. Until that happens, though, bad performance is likely to inhibit testing of the patches. As the session closed, Dan Williams pointed out that Spectre mitigations like retpolines started by emphasizing security, leaving performance for later. That has worked out well, he said; the community tends to be more motivated to innovate around performance than security. So, chances are, that is the tradeoff we are likely to see when this patch series returns to the mailing lists.

Implementation details

The discussion was not finished at that point, though; Jackman was able to schedule another slot the next day to get into a few of the details that he was trying to resolve. The core challenge, he said, is that the kernel has to take pains to flush the translation lookaside buffer (TLB) as part of the transition between the unrestricted and restricted modes to prevent use of the TLB as a covert channel. This flushing is expensive, so it should not be done more often than is strictly necessary.

The most conservative approach, he said, would be to perform a flush every time a page is freed; that would clearly slow things down considerably. So the current approach is to free pages in batches in a kernel thread, then perform the flush once at the end. A proper solution would look different, but would require the kernel to remember the sensitivity of every free page — whether it had been mapped into the restricted address space, in other words. Then, if an allocation request comes in, and the page used to satisfy it was nonsensitive, there is no need to bother with a TLB flush before returning a page.

Jackman was unsure of how to remember the previous sensitivity of free pages, though. One possibility might be to add a new migration type to track it. Another could be to add a new memory zone; this idea was met with a resounding "no" from the room.

Michal Hocko asked how developers would request sensitive memory; the answer is to use the new __GFP_SENSITIVE allocation flag. Since all of user-space memory is considered sensitive (the kernel has no way to know which user pages actually contain sensitive data), that flag is folded into GFP_USER and need not be added separately. There is a new page flag used to mark sensitive pages. Jackman said that he hadn't realized prior to the conference that adding new GFP flags is discouraged; Hocko answered that those flags are in short supply, and that kernel code tends to use them incorrectly in any case.

Jackman asked for alternative suggestions; Hocko mentioned the scoped interface that is used to modify allocations performed from within the filesystem and I/O paths. Perhaps something similar could be done for sensitive data; that could be better than annotating specific allocations, he said. There are a lot of allocation sites in the kernel, annotating them all is not really feasible and the end result is sure to be incorrect.

As this session came to a close (for real, this time), Jackman noted that some allocations must be marked as nonsensitive, regardless of the data to be stored there. Specifically, the kernel cannot take page faults around the system-call entry path, so memory accessed then must be nonsensitive.

Comments (10 posted)

Faster page faults with RCU-protected VMA walks

By Jonathan Corbet
May 22, 2024

LSFMM+BPF

Looking up a virtual memory area (VMA) in a process's address space, for the handling of page faults or any of a number of other tasks, in multi-threaded processes has long been bedeviled by lock contention in the kernel. As a result, developer gatherings have been subjected to many sessions on how to improve the situation. At the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, developers in the memory-management track met, in a session led by Liam Howlett, to talk about a situation that has improved considerably in recent times, but which still offers opportunities for optimization.

Howlett began by referring back to a 2022 LSFMM+BPF session where Mel Gorman had suggested performing locking during the VMA-walk process at the VMA level itself, rather than locking the whole VMA tree. At that level, Gorman thought, the level of contention would be far less. In current kernels, Howlett said, that is what happens; the fault-handling code will first try locking the VMA tree with the read-copy-update (RCU) read lock, only falling back to the mmap_lock if it has to. The VMA of interest can be locked individually once it is located; after the fault is handled, the code calls release_fault_lock(), which will either drop the mmap_lock or the RCU lock as appropriate. It is not the most elegant solution, he said, but it does hide the details nicely.

With regard to performance, he noted that fault-handling actually got slower in the kernels between 5.19 and 6.2 as this work began; distributors were starting to get nervous, he said. But then, in 6.4, the per-VMA locking work went in, and performance doubled. By the time 6.6 came around, fault handling was almost three times better than it had been before the work began, a result that he called "pretty awesome".

For code that needs to walk through the page tables in current kernels, he said, the common pattern is to take the RCU read lock before locating the specific VMA of interest. Code can then call lock_vma_under_rcu() to try to take the VMA-specific lock and ensure that the VMA does not go away until the work is done. That attempt could fail, though, so code has to be prepared to fall back to mmap_lock in that case. Page-fault handling is trickier, though, especially for unpopulated, anonymous memory. In that case, the code may need to examine the neighboring VMAs, and the per-VMA lock won't cover them. Locking multiple VMAs is a quick path to deadlocks, so that is not really an option. The userfaultfd() subsystem adds its own special cases as well.

For anybody else writing code that works through the page tables, he said, looking at the RCU-protected approach rather than taking the contended mmap_lock would make sense. There is still a need to work out the best API for all of the use cases out there, though.

There is also a little problem in that the VMA tree is not atomic in the absence of mmap_lock. Holding the per-VMA lock will keep the VMA from going away, but some changes may appear in intermediate states. For example, if an munmap() call has to split a VMA, the splitting will become visible before the unmapping does. Matthew Wilcox said that developers need to better define what is being promised; if you found the VMA under the RCU lock, the VMA will continue to exist, but it might not still be a part of the process's address space. Suren Baghdasaryan added that some fields of the VMA, including the file pointer, are not stable under RCU.

The discussion (and the first day) ended with a winding discussion on one of the use cases driving this work: making /proc/pid/maps have less impact on the system. There are systems out there with a high-priority process doing work, and a low-priority monitoring process that occasionally needs to read that file. If the low-priority process takes memory-management locks that block the high-priority process, the result is the sort of priority inversion that makes users unhappy.

Having /proc/pid/maps work under the RCU lock prevents that sort of inversion, but at the cost that the VMA tree might change while the file is being read. The contents of that file can always be out of date even in current kernels, since the situation can change immediately after it is read, but now it could also be internally inconsistent. There was some debate over how much of a problem that actually is. There various suggestions of returning sequence numbers that user space could use to detect this situation, or detecting it in the kernel and retrying, perhaps taking the mmap_lock after a few failures to ensure that the job gets done. The session came to a close with no definitive conclusions.

Comments (none posted)

Toward the unification of hugetlbfs

By Jonathan Corbet
May 22, 2024

LSFMM+BPF

The kernel's hugetlbfs subsystem was the first mechanism by which the kernel made huge pages available to user space; it was added to the 2.5.46 development kernel in 2002. While hugetlbfs remains useful, it is also viewed as a sort of second memory-management subsystem that would be best unified with the rest of the kernel. At the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, Peter Xu raised the question of what that unification would involve and what the first steps might be.

In theory, the kernel's transparent huge page mechanism makes hugetlbfs unnecessary. There are, though, reasons for the longevity of hugetlbfs. It allows huge pages to be reserved, so that they will remain available even if system memory as a whole is fragmented. It also implements page-table sharing across multiple processes, which is not otherwise available in Linux (a later LSFMM+BPF session talked about mshare(), which is meant to fill that gap). And, of course, software has been written using the hugetlbfs ABI, so it must continue to be supported.

Consolidation

Xu began by saying that his objective was not to remove hugetlbfs, but to consolidate it into the rest of the memory-management subsystem. There are 11 different code paths that are specific to hugetlbfs; he thinks that can be reduced to two or three. Making hugetlbfs into an ordinary filesystem is not a goal; doing so would likely increase complexity for little benefit.

Hugetlbfs, thus, will remain a "special", RAM-based filesystem. It is, he said, ancient stuff, much of which is aimed at use cases that may not even exist anymore. But developers are afraid to touch it. Hugetlbfs is a maintenance nightmare, inflicting its special code paths on the rest of the kernel; users have requested new features, but they have been rejected out of fear of increasing the complexity of the system. So, he said, there is no time like the present to deal with this problem. Fortunately, the large-folio work is making it easier to coalesce at least some of the hugetlbfs code into the rest of the kernel.

Xu wondered whether this work should be done by creating a new, better version of hugetlbfs, or by working to unify the existing code. His feeling, though, is that a new version would not be justified; there is no need for any sort of ABI break, which would be the biggest reason to start over. Unifying hugetlbfs means working with an ugly ABI implemented by ugly code, but starting over would bring an entirely different kind of pain.

David Hildenbrand agreed that the hugetlbfs ABI is ugly; for him, though, the biggest problem is all of the "if (hugetlbfs)" calls sprinkled through the rest of the memory-management subsystem. Many of these tests are driven by alignment requirements. Creating a new version of hugetlbfs would be too much, he said, but there would be value in being able to set a flag to remove some of the hugetlbfs restrictions; that would make it possible to, for example, free half of a hugetlb folio. Xu agreed with that view.

Hildenbrand mentioned high-granularity mapping as a proposed hugetlbfs enhancement that ended up being rejected out of fear of adding more hugetlbfs-related complexity to the memory-management subsystem. Rather than add special-case exceptions like that, he said, it would be better to just drop the hugetlbfs restrictions everywhere. Michal Hocko, though, asked the group to take a step back and summarize the features that are actually needed. Hugetlbfs came about in a time when transparent huge pages didn't exist; perhaps it would be better to make more use of transparent huge pages than to add more hugetlbfs features.

Xu answered that the use of transparent huge pages has its own performance impact; the realtime configuration disables it, for example. There are also use cases that insist on 1GB huge pages, and hugetlbfs is the only way get them in current kernels. He would, he said, be happy to see a proposal based on transparent huge pages that addresses those concerns.

The 1GB page reservation

John Hubbard said that there are a lot of artificial-intelligence applications out there that can benefit from huge pages; some of those applications need huge pages badly, and so they use hugetlbfs. Others can just take advantage of the kernel's improving transparent huge page support and get faster with no additional effort. There are, he suspects, some applications out there that have been well tuned and benefit from not having to wait for the kernel to collapse their memory into transparent huge pages. Some applications will always need huge pages that are always available.

A remote participant said that hugetlbfs is often most useful to allocate memory for virtual machines; this use case really wants the 1GB guarantee that hugetlbfs can provide. In this case, the 1GB aspect is the only thing that matters. Another remote attendee said that the high-granularity-mapping code was an attempt to add transparent huge page features to hugetlbfs, but that it would be better to support 1GB huge pages in the core memory-management subsystem than to add more hugetlbfs features.

Jason Gunthorpe said that he would really like to see the hugetlbfs code taken out of the core; after that, he doesn't care about any "craziness" hidden within it. Matthew Wilcox said that the biggest problem is the hugetlbfs page-table walker, which has a lot of special cases and needs to be gotten rid of, somehow.

Xu tried to reach a sort of conclusion by saying that there is still sense in having a separate allocator that can provide the guarantees that some applications need. But, he said, if he cannot implement high-granularity mappings on top of that allocator, he will lose a lot of his motivation to do this work. Hildenbrand said that, if this work is done right, high-granularity mappings should just come naturally.

Xu continued, saying that anybody who wants partial mappings in hugetlbfs should go ahead and post a patch; it will be interesting to see how that works with the 1GB-page allocator. There is still a need for a better interface to consume hugetlbfs pages, though. Gunthorpe said that memfd is that interface; it just needs to be taught how to reach into hugetlbfs, which could provide a single reservation for all users needing 1GB pages. Hildenbrand said that plans for guest_memfd() need a number of the proposed features, including partial mappings and high-granularity mapping. Gunthorpe added that there is merit in separating the various hugetlbfs components; the 1GB page pool is generally useful and should be a separate feature. In general, users want the reservation feature, but would rather do without a "screwy ABI". Accessing the reservation with an mmap() flag would be nice, he said.

Dan Williams read a suggestion from the online chat: hugetlbfs should be removed and reimplemented as an fallocate() option on the tmpfs filesystem. Xu said that, in that case, the challenge would be getting users to move over; a deprecation process would be needed. Another participant said that adding hugetlbfs features to tmpfs would require unifying the page-table walker.

Gunthorpe said that, once features become available in the core memory-management subsystem, everything else just falls into place. A new ABI could then be simply implemented as a memfd ioctl() call providing access to the 1GB-page reservation. Hocko, though, said that pushing users away from hugetlbfs would take 15 to 20 years; it is better to just leave it in place, clean up its internals, and make them usable elsewhere.

For 1GB pages, Xu said, the mechanism is already in place; all that is needed is to expose a better ABI for it. Hildenbrand suggested, again, simply dropping the restrictions on hugetlbfs pages, allowing 1GB huge pages to be mapped as needed. Xu continued that existing users do not see the hugetlbfs ABI as ugly; they are happily using it. The memory-management developers, instead, are not happy with it; is that a sufficient reason to introduce a new ABI?

As this (two-slot) session ran out of time, Hildenbrand mentioned the strange semantics that hugetlbfs imposes on MAP_PRIVATE mappings. Among other things, that makes it impossible to insert a uprobe or a breakpoint in a hugetlbfs 1GB page. He said that it was clear that Xu would have to clean up the page-table walker, but that the kernel would have to continue to provide hugetlbfs as it is, since there are users out there.

The next steps

The discussion was not done, though; another slot was scheduled later in the day. Xu got more deeply into the details, saying that, in his first attempt, he was trying to clean up the get_user_pages() code path (which is the way that the kernel maps user-space pages). After some work, that project was mostly successful; patches have been posted and since merged for the 6.10 release.

There are numerous challenges remaining, though. One of those is the "hugepd" mechanism used by the PowerPC architecture to handle huge pages. Hugepd is imposed by that architecture's special page-table requirements, but it can evidently be gotten out of the way for huge pages, simplifying the unification of the code. Christophe Leroy has posted a patch set doing that work; Xu would like some help reviewing it.

Huge pages can be represented in three ways in the kernel, he said. They can be a huge mapping as defined by the architecture (a PMD-level mapping, for example), the "cont-pte" format (where the huge page is mapped as base pages, but with a special flag set to tell the CPU that a group of physically contiguous pages exists — see this article), and the PowerPC hugepd format. The page-table-walker ABI supports only the first two of them. Unification requires adding generic support for hugepd, or just removing it; the latter approach is the direction taken by Leroy's patch set, but it needs to be extended to remove hugepd completely.

A generic page-table walker that handles all cases would be an elegant solution, he said, if it could be achieved. Wilcox said that work needs to be done to make page-table walkers easier to write, starting with figuring out what all the needs are. Gunthorpe agreed, noting that the kernel is full of duplicated page-table-walking code. It would be good to abstract out the details to create a generic ABI; Wilcox said he was tempted to just try it.

Xu asked the group if there was a need to support P4D huge pages; these are mapped one page-table-level higher than 1GB pages, and are 512GB in size. Wilcox said that 512GB pages would be ridiculous, with no practical use; the consensus in the room was that there was no need to support that size anytime soon.

As time (once again) ran low, Xu said that it may never be possible to unify all of the hugetlbfs paths in the kernel; he may have to just give up on some of them. Page-fault handling and PMD-level page-table sharing may be cases in point. There are some hugetlbfs quirks to work around. For example, a read on a MAP_PRIVATE page does not result in a page-cache entry; instead, it creates a read-only anonymous page. It makes no sense to port features like this to generic code, he said.

Wilcox agreed that there was no problem with not unifying quirks like that; they don't affect other users of the system. The PMD-sharing problem is better solved with mshare(). Perhaps the page-table sharing supported by hugetlbfs could eventually be dropped, he said. Xu concluded by listing a set of paths that he intends to address in the near future. These included page-table walking, handling userfaultfd() faults, mprotect(), mremap(), fork(), and more. Some of those, he noted, would be difficult. The session ended with Wilcox expressing his thanks to Xu for addressing this "long overdue" problem.

Comments (5 posted)

Merging msharefs

By Jonathan Corbet
May 22, 2024

LSFMM+BPF

The problem of sharing page tables across processes has been discussed numerous times over the years, Khalid Aziz said at the beginning of his 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit session on the topic. He was there to, once again, talk about the proposed mshare() system call (which, in its current form, is no longer actually a system call but the feature still goes by that name) and to see what can be done to finally get it into the mainline.

Threads, he said, naturally share page tables, but independent processes do not. An individual page-table entry (PTE — mapping a single page) is small, but a process's page tables contain many PTEs and can add up to a significant amount of memory use. The problem is exacerbated when many processes share the same memory region; each of those processes will have its own full set of page tables for that region. He mentioned a case of a large, well-provisioned database server that had 1,500 processes all sharing the same memory area; the resulting page-table overhead was larger than the size of the shared region and ran the system entirely out of memory.

To avoid this kind of problem (and to put that memory to better use), Aziz would like to bring thread-like page-table sharing to processes; that is the purpose of mshare(), which was originally created by Matthew Wilcox. It provides an opt-in mechanism by which a process can inform the kernel that it wants to share the page tables for a given region; the kernel then makes it possible for other processes to map that region. Since page tables are shared, page protections are also shared, a fact that application developers need to keep in mind. Pasha Tatashin pointed out that, when page tables are shared, the virtual address must also be shared — the region must be mapped at the same address in all processes.

The first version of the mshare() patches was posted in January 2022; it was then discussed at LSFMM+BPF that year, resulting in some significant changes. The system call was renamed to ptshare() then, but Aziz would now like to move forward with mshare(), which has been redesigned around the filesystem-based msharefs concept rather than as a new system call.

To use this feature, Aziz continued, the first step is to mount the msharefs filesystem. A process will then create a file on that filesystem and map it as MAP_SHARED. The fact that the file lives in this special filesystem is the indication to the kernel that the creating process wants to share the page tables for that region with others. Those others can open the file, and read this structure from it:

    struct mshare_info {
    	unsigned long start;
	unsigned long size;
    };

The start and size values can then be used by the new process to map the region at the correct location.

The kernel maintains an mm_struct structure for each process, describing its address space. Use of msharefs causes the creation of a separate mm_struct, independent of any process, to describe the shared region. The kernel, running in the context of the creating process, will end up copying the relevant virtual memory areas (VMAs) over to the new mm_struct; its original VMAs will be marked with a special flag pointing to the new mm_struct.

David Hildenbrand asserted that msharefs needed to identify the new VMAs as a sort of special container that would prohibit use with features like userfaultfd(), but others objected to that idea. There is no reason, Wilcox said, that these VMAs cannot be used with userfaultfd(), just like memory shared between threads can be used there.

Michal Hocko asked how the shared memory, which has no owning process, is accounted for in control groups; Aziz admitted that accounting was "a little complication" that has not been fully solidified yet. Hocko said that it was important for the kernel to be able to find all of the processes mapping the region and kill them in out-of-memory situations; msharefs cannot be merged without this ability, he said. He added that basic memory accounting also matters; which process is charged when a new page is allocated in response to a fault? Shakeel Butt said that the kernel has no good solution for the accounting of shared memory in general, currently; memory is simply charged to the process that faults it in first.

Another complication, evidently, is that potential users of this feature want the creating process to exit. Hildenbrand, though, said that the page tables should be torn down when the original process goes away. That process should also be the one that is charged for the shared memory solving the control-group problem. Wilcox worried, though, that keeping the original process around would be an easy way to create unkillable processes.

The final topic covered was locking; Jason Gunthorpe was concerned that it would now be necessary to take two independent mmap_lock locks (one in each mm_struct) to make changes to the VMA tree. Wilcox said that there is only a single level of lock nesting, in a well-defined order, so there can be no cycles (and thus no deadlock worries). Hildenbrand said that most page-table walkers should simply refuse to deal with the special mm_struct, but Gunthorpe said that get_user_pages() needs to work, and that opens a whole can of worms. There are other use cases out there as well, he said. As the session ended, Hildenbrand suggested special-casing things as much as possible, and not trying to do complex things around this strange mechanism initially.

Comments (2 posted)

Documenting page flags by committee

By Jonathan Corbet
May 22, 2024

LSFMM+BPF

For every page of memory in the system, the kernel maintains a set of page flags describing how the page is used and various aspects of its current state. Space for page flags has been in chronic short supply, leading to a desire to eliminate or consolidate them whenever possible. That objective, though, is hampered by the fact that the purpose of many page flags is not well understood. In a memory-management-track session at the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, Matthew Wilcox set out to cooperatively update the page-flag documentation to improve that situation.

Wilcox had no presentation to give; instead, he put up an editor window containing a new documentation file for page flags, then told the audience "shout at me, I'll write it down". The first flag to be covered was Locked; the text that resulted was:

This flag is per-folio. If you attempt to lock a page, you will lock the entire folio. The folio lock is used for many purposes. In the page cache, folios are locked before reads are started and unlocked once the read has completed. The folio is also locked before writeback starts; see the writeback flag for more detail. The truncation path takes the folio lock, and folios are also locked while being inserted into page tables in order to prevent races between truncation and page fault.

These semantics, Wilcox said, are why the lockdep locking checker does not work with this flag; it is taken and released in different contexts, which lockdep cannot handle.

The next flag was Writeback, which ended up being described as:

Per-folio. This is kind of a lock. We downgrade to it having taken the lock [Locked] flag. Released after writeback completes, but lock flag may be released any time after writeback flag set. Depends on filesystem whether needs to do more between. We can wait for writeback to complete by waiting on this flag. Folio put to tail of LRU for faster reclaim.
Can prevent tearing write is filesystem needs stable folios. Truncate will wait for flag to clear.

Clearly, there is some editing work yet to be done.

For the Dirty flag, the result was:

Also set during buffered IO. Referenced first time, accessed second time. Used during reclaim to determine disposition (activate, reclaim, etc). At least one byte of the folio contents is newer than on disk and the writeback flag is not yet set. Folios may be both dirty and not uptodate. Lazyfree pages can drop the dirty bit. Dirty flag clear for file folios when we start writeback. Set dirty flag when removed from swapcache. If already dirty, folios can be mapped writable without notifying filesystem. Complicated interfaces to set, easy to get wrong.

Jason Gunthorpe added that there are a lot of users of get_user_pages() that set this flag; all of them are wrong.

For Uptodate: "Every byte of the folio contents is at least as new as the contents of disk. Implicit write barrier". In the room, it was suggested that some filesystems clear this bit when writeback fails, but others thought that perhaps this behavior had been removed.

For the LRU flag, all that was said was: "Folio has been added to the LRU and is no longer in percpu folio_batch". The Head flag was described equally tersely as: "This folio is a large folio. It is not set on order-0 folios". The Waiters flag means: "Page has waiters, check its waitqueue. Only used by core code. Don't touch". For the Active flag: "On the active LRU list. Can be set in advance to tell kernel to put it on the right list".

When it came to Workingset, it seemed that nobody really knows what this flag means. Wilcox wrote down:

Set on folios in pagecache once readahead pages actually accessed. Set on LRU pages that were activated have been deactivated, treat refault as thrashing. Refault handler also sets it on folios that were hot before reclaimed used by PSI computation.

The Referenced flag means:

Per-folio flag. At least one page table entry has a accessed bit set for this folio. We set this during scan. Also set during buffered IO. Referenced first time, accessed second time. Used during reclaim to determine disposition (activate, reclaim, etc).

The flag named Owner_Priv_1 was described as: "Owner use. If pagecache, fs may use Used as Checked flag by many filesystems. Used as SwapBacked flag by swap code". The final flag discussed in the session was Arch_1, with this result:

Many different uses depending on architecture. Often used as a "dcache clean" or, confusingly as "dcache dirty". Check with your architecture.
s390 uses it for basically everything.
Historically was used on a per page basis. Think we've eliminated all per-page uses now so should only be set on folios.

After the session, Wilcox posted the result on the linux-mm mailing list, where there have been a couple of follow-on comments. Whether this kind of whole-room documentation authoring will (or should) catch on remains to be seen; the information that was captured is more than was available before, but one might be forgiven for concluding that the use of these flags remains obscure for almost everybody.

Comments (none posted)

Two sessions on CXL memory

By Jonathan Corbet
May 22, 2024

LSFMM+BPF

Compute Express Link (CXL) is a data-center-oriented memory solution that, according to some in the industry, will yield large cost savings and performance improvements. Others are more skeptical. At the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, two sessions covered CXL and how it will be supported in future kernels.

CXL development

The first session, led by Adam Manzanares, covered the kernel's support for CXL in general. He started by saying that CXL is often mentioned in connection with memory tiering, but there is more to it than that. He would like to see more attention given to some of the other CXL-related code, such as the driver layer. CXL development is using the kernel.org patchwork server now, so it is easy for interested developers to see where the work stands.

Manzanares would especially like some help from developers with an understanding of the PCI bus. CXL, he said, is a bit of an awkward fit with the PCI core, so some effort is needed to make the pieces work well together. He is also interested in reliability, availability, and serviceability (RAS) issues, and would like to talk with developers from other subsystems with experience in dealing with memory errors. Having a memory controller on a device complicates things, he said. He wondered why the CXL code does its own event handling rather than using the existing error detection and correction (EDAC) code.

Dan Williams answered that EDAC was invented to abstract the information about the memory controller; CXL is a standardization of that abstraction. So, in the future, the kernel will only need to understand CXL rather than EDAC; and other vendors will find themselves having to make their devices look more like CXL. He has been working on translating CXL events into the EDAC subsystem, which RAS Daemon, which is used to collect and report on error notifications, knows how to deal with. RAS Daemon may be a legacy tool, but there is value in its ability to handle errors; there is, however, no desire to modify it to handle a new interface.

Hannes Reinecke pointed out that RAS Daemon is running in memory; what happens if a memory problem affects it? Williams answered that "if it kills the daemon, you lose". The result of the killing of the RAS Daemon will be a machine-check error, Manzanares said.

Williams said that there is ongoing work in defining a new scrub subsystem that is designed to proactively find memory problems. There is always a tradeoff between scrubbing frequency and performance. Both ACPI and CXL have mechanisms to handle scrubbing; EDAC does too. There are a lot of people independently solving the same problems, he said; it would be better if they worked together.

Turning to benchmarks, Manzanares said that it would be good to have a general agreement on a few workloads to run for performance measurements. Since he works for a CXL vendor, he said, he might not be the best person to be doing benchmarking; end users are better suited to that sort of task. The Open Compute Project might be a good home for this work; the newly formed tiering working group might be another. Williams echoed the need for good benchmarks; touching the memory-management code is hard, and developers never know when they are regressing somebody's workload.

The session concluded with a note that CXL is moving quickly. Hardware is currently hard to get, which does not make life easier for developers who are trying to support it. It would be good, Manzanares said, to have a central site where developers could report information about specific devices.

CXL compression

Normally, CXL memory is thought of as being voluminous and cheap, but with higher latency than normal DRAM. There is potential for other types of CXL memory as well, though. Presenting remotely, Yiannis Nikolakopoulos described the use of compression within CXL devices and how it might work with Linux.

In a conventional tiered layout, the top tier of memory lives in the host, while a lower tier is stored on a CXL device. The "densemem" concept extends that design by adding yet another CXL box, adding a third tier to the system. The address space on that box is oversubscribed — the box claims to have more memory than is actually installed. When data is written to that memory, it is compressed by the densemem box and mapped accordingly. The host is charged with managing this space and reacting to notifications about capacity changes; it can configure the size of the address space and the oversubscription factor.

Making this work requires the addition of a "backpressure" API that will inform the host about how much free space actually remains on the device. There are four watermark levels that can be established, and the host will be interrupted whenever usage passes one of them. The host can respond by delaying writes, but it can also take actions like changing the compression algorithm for better (but presumably slower) compression. The host can also defragment the device, or simply free memory.

Most of the upstream support for this hardware will run in user space, but there will be some kernel components too. Nikolakopoulos is working on a driver to expose the control knobs and give user space control over the device.

Davidlohr Bueso asked why developers should care about compression; it seemed to him to be a way to add latency to a technology that is already slow. Manzanares answered that compression is in the Open Compute Project specification; it is a desired feature, and it is not up to the kernel community to fight it. It is, in the end, a cost-saving measure, he said.

David Hildenbrand said that the only reasonable use for densemem is zswap; the kernel could be configured to use it while avoiding the overhead of page structures. The kernel would not have to manage the compression, it could just swap to the device. Williams agreed that it would be like zswap, but it could provide an additional advantage: since it is directly addressable, there would be no need to swap data back in to access it.

Matthew Wilcox repeated the complaint that CXL already has high latency, and that compression will make it worse. Williams answered that densemem is intended for cold memory; it is better to move that memory there than to swap it out to disk. Wilcox said that the PCI bus is intended for storage, not memory access; he agreed that using PCI-attached CXL memory as a swap device might be workable, though.

The session wound down at that point with Williams asking CXL vendors for a couple of features from this technology. One current problem is the lack of a good promotion signal — an indication that memory is being accessed and should be moved to faster storage. He also requested an interface to identify the least-compressible pages stored on the device; those could be migrated back to faster memory to free space on the densemem device.

Comments (4 posted)

The path to deprecating SPARSEMEM

By Jonathan Corbet
May 22, 2024

LSFMM+BPF

The term "memory model" is used in a couple of ways within the kernel. Perhaps the more obscure meaning is the memory-management subsystem's view of how physical memory is organized on a given system. A proper representation of physical memory will be more efficient in terms of memory and CPU use. Since hardware comes in numerous variations, the kernel supports a number of memory models to match; see this article for details. At the 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit, Oscar Salvador, presenting remotely, made the case for removing one of those models.

The SPARSEMEM model, he said, is suited to systems with sparse memory — systems with large gaps in the physical address space. The newer SPARSEMEM_VMEMAP model also works on such systems; it makes life easier for the higher layers by virtually mapping physical memory in a way that makes it appear contiguous. Salvador said that the time for SPARSEMEM has passed, and that it was time to consider removing it in favor of using SPARSEMEM_VMEMAP instead.

Michal Hocko immediately asked what the motivation was for the removal of SPARSEMEM. Salvador answered that it duplicates a lot of functionality with SPARSEMEM_VMEMAP. When the latter was initially introduced, developers did not convert all SPARSEMEM systems out of concern for the extra memory used for the virtual mapping. SPARSEMEM_VMEMAP also simply will not work on systems where the amount of physical memory exceeds the virtual address space. These concerns have abated in recent years, so a complete conversion to SPARSEMEM_VMEMAP can be considered; it would allow the removal of a fair amount of code.

A participant agreed that the four architectures that only support SPARSEMEM — arm, mips, parisc, and sh — could be converted to SPARSEMEM_VMEMAP, though parisc might be better served with the simpler FLATMEM model. David Hildenbrand worried, though, that SPARSEMEM_VMEMAP could still be a problem for 32-bit architectures, which have limited virtual address spaces. Perhaps, he said, SPARSEMEM support could be dropped entirely for 32-bit systems; memory hotplugging, which had been one of the motivations for SPARSEMEM in the first place, is no longer supported there. Mike Rapoport, though, said that 32-bit Arm systems use SPARSEMEM to represent widely spread memory banks, a usage that is not related to hotplugging. Switching those systems to FLATMEM would require a lot of virtual address space that would have to come from the (already tight) vmalloc area.

Hocko asked what problems are caused by SPARSEMEM; one of them, it seems is that SPARSEMEM complicates the addition of new hotplug features. He suggested just dropping hotplug support from SPARSEMEM, and not adding new features to it in general. Salvador, though, made it clear that he would rather remove the model entirely.

Rapoport said that Arm systems can support a "sparse FLATMEM" model that would allow them to reduce the address-space usage; perhaps the other 32-bit architectures could do the same. That is a question that the various architecture maintainers would have to answer.

Hocko concluded the session by saying that the removal could be a nice thing to try, since it would take out a lot of code. The first step would be to simply disable hotplug in the SPARSEMEM model. After that, it will be a matter of talking to architecture maintainers, trying to get each to move away from it.

Comments (none posted)

A plan to make BPF kfuncs polymorphic

By Daroc Alden
May 20, 2024

LSFMM+BPF

David Vernet kicked off the BPF track at 2024's BPF track at the Linux Storage, Filesystem, Memory Management, and BPF Summit with a talk about polymorphic kfuncs — or, with less jargon, kernel functions that can be called from BPF which use different implementations depending on context. He explained how this would be useful to the sched_ext BPF scheduling framework, but expected it to be helpful in other areas as well.

Alexei Starovoitov gave a talk later in the conference about the history of BPF, including the origin and motivation for kfuncs — stay tuned for an article on that. For now, knowing more about kfuncs is not really needed to understand Vernet's problem and proposed solution.

There are 151 kfuncs in the kernel as of version 6.9, so it should probably not be too surprising that they vary wildly. Some kfuncs, Vernet pointed out, are used for extremely common, basic functionality — such as the functions for acquiring and releasing locks. These kfuncs have the same meaning and implementation in every possible context, because what they do is fairly simple. Other kfuncs, however, can have context-specific semantics. Some may only have "any meaning at all [...] within specific contexts", Vernet said.

One example of this is the functions for manipulating dispatch queues — structures used in sched_ext to store lists of pending tasks. Vernet called them the basic building blocks of scheduler policy. One of the main functions for manipulating them from BPF, scx_bpf_dispatch(), always has the same meaning: adding a task to a different queue. But when called from different BPF callbacks, there are subtle variations in how the function can be used.

When called from a select_cpu() or enqueue() callback, scx_bpf_dispatch() cannot drop the run-queue lock relevant to the task, and can only dispatch tasks to the CPU that triggered the call. Furthermore, only tasks that are being woken or enqueued can be dispatched.

In contrast, when called from a dispatch() callback, scx_bpf_dispatch() is free to drop the run-queue lock, dispatch to any CPU, and dispatch multiple tasks. The difference is that dispatch() is called by a CPU that is about to otherwise go idle, and so there is no existing work on the CPU that needs to be carefully worked around.

In both cases, scx_bpf_dispatch() presents the same logical API, but the differing constraints mean that the implementation in these two cases is quite different. Right now, the code tracks which case it is in with a per-CPU variable, and then uses that to choose which implementation to use. "So you can work around it," Vernet admitted, but he wanted to see if the implementation could be better.

Vernet's proposal

Right now, every kfunc is associated with a BPF Type Format (BTF) ID. This is an ID used to represent the kfunc in the debugging information for the BPF program, but it is also used with the BPF instruction that calls a kfunc to indicate which one it wants to invoke. When the BPF program is loaded and then just-in-time compiled, the BTF IDs get resolved, and the resulting code can call them directly.

Vernet suggests extending this mechanism by having the BPF verifier support multiple kfuncs with the same ID — whenever it encounters a call to a kfunc, it would ask the subsystem associated with that kfunc ID what the real kfunc should be (using a new callback). The subsystem would then reply with a "concrete" kfunc ID, and loading would proceed in the same way. This approach moves the tracking of the context of a call from run-time to load-time, and eliminates the need for tracking the state in a per-CPU variable.

Vernet said that the advantage of this approach is the ergonomic API it presents, and the control it gives subsystems over how their kfuncs can be called. But the approach does have its drawbacks. For one thing, adding additional callbacks in the verifier threatens to make one of the most complicated parts of BPF even more so. For another, it would use load-time logic for what is really a static configuration — if the compiler understood the different contexts that the kfuncs care about, the correct kfunc implementation could be chosen at build-time.

A build-time configuration would be nicer, Vernet stated, but it would be "kind of a pain in the neck to implement". He suggested that implementing it statically was probably not a high priority. Vernet did think any mechanism for polymorphic kfuncs would probably be useful to areas of the kernel other than sched_ext.

Discussion

The other attendees had questions about Vernet's proposal. One member of the audience pointed out that there is already a similar mechanism for BPF helper functions (a different kind of kernel function callable from BPF programs, with a different interface), and asked that Vernet "look at this more holistically". Vernet replied that the equivalent aspect of helper functions lets the implementations differ depending on the BPF program type — so the same helper function can be implemented differently for a BPF program attached to a trace point or registered as a callback. But that approach won't work for his use case, because the program types in question are not sufficiently granular. As far as the verifier is concerned, all of the callbacks involved in sched_ext are of the same type, because they are all struct_ops programs (a mechanism where different parts of the kernel can define a struct full of function pointers to which BPF programs can be attached). He wants to be able to handle calls from different struct_ops programs differently — which almost certainly requires information the verifier doesn't have, since it is the other subsystems or modules which define struct_ops callbacks that would know which functions should be handled differently.

The discussion went back and forth a little bit, with the other attendee trying to identify ways that the mechanism could be generalized beyond struct_ops programs. Vernet agreed that "if we can abstract it that would be much better for sure," but didn't seem to think that the existing helper mechanism was a suitable basis for that.

Another member of the audience asked whether it would be possible to have kfuncs that behave differently based on the type of their arguments. The motivating use case would be to enable different data types being inserted into a BPF map to be handled differently. "I want to skip the ownership check when the argument is an sk_buf", they explained. Vernet agreed that this would be technically feasible, since the verifier knows the types of the arguments to the kfunc. The question, in Vernet's eyes, is whether this mechanism would be confusing.

The first participant in the conversation suggested that this use case could be served just by adding new kfuncs and letting the developer use the right one. The second commenter pushed back, saying that they did not want to introduce many new kfuncs for what is effectively the same behavior — especially not when it seems likely that there will be more types that need special handling to keep in maps in the future. Vernet agreed that it makes sense to give kfuncs the flexibility to decide what they want to do.

That was the end of the discussion at the time, so it remains to be seen whether the proposal will be adopted, and if so in what form.

Comments (2 posted)

Virtual machine scheduling with BPF

By Daroc Alden
May 22, 2024

LSFMM+BPF

Vineeth Pillai gave a remote talk at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit explaining how BPF could be used to improve the performance of virtual machines (VMs). Pillai has a patch set designed to let guest and host machines share scheduling information in order to eliminate some of the overhead of running in a VM. The assembled developers had several comments on the design, but seemed overall to approve of the prospect.

VMs have a variety of potential performance footguns, but a particularly persistent one is "double scheduling". When using KVM, the implementation of a virtual machine hypervisor in the Linux kernel, virtual CPUs correspond to threads. This means that the host system's scheduler will assign the thread for a given virtual CPU to a physical CPU, and then the guest system's scheduler will assign threads to those virtual CPUs. This results in a certain amount of unavoidable overhead just from running two schedulers, but it also increases the amount of jumping around between physical cores that processes on the guest need to tolerate.

This problem can be partially mitigated using CPU pinning, but that is a manual solution that still doesn't address the more subtle aspect of double scheduling: that useful information is lost between the two schedulers. Pillai and his collaborator Joel Fernandes have been working on a solution that allows the guest and host to share scheduling information, allowing the host scheduler to make more intelligent decisions about where to put vCPU threads and how to schedule them.

To make this work, their proposed system would use memory shared between the guest and the host. The guest runs a pvsched driver that allocates the necessary memory and shares it with the host. The driver then streams relevant scheduling information into that memory, and reads any information that the host wants to provide in return. The most recent version of the patch set is version 2, published in April, but Pillai is already working on a version 3 to address comments from the KVM maintainers.

On the host side, this scheme is integrated into the scheduler using BPF. The BPF program reads information from KVM, including the PIDs and assigned physical CPUs of the virtual CPU threads, and the location of the guest's shared memory, from a BPF map. The BPF program can then make scheduling decisions, and call hooks in the scheduler to override its decisions about how to schedule the virtual CPUs, Pillai said.

David Vernet asked whether it would make sense to define a (supposedly immutable) user-space API around the pvsched driver, or whether it would make sense to do communication wholly over a BPF channel. BPF interfaces are not considered part of the kernel's API stability promises — but the KVM interface to guest VMs is. Pillai responded that the idea of using a BPF-to-BPF channel makes sense. Vernet later suggested adding a new BPF map type that goes directly between the host and the guest. Pillai concurred with the idea of a guest-to-host map type.

Pillai said they did have one question about the design for the assembled developers — should their patch set use struct_ops callbacks or raw tracepoints to hook into the KVM subsystem? Vernet questioned whether Pillai was proposing calling kfuncs (to manipulate the scheduler) from inside a tracepoint. Pillai agreed that he was. Steven Rostedt pointed out that calling kfuncs from some tracepoints could deadlock the scheduler, so you would need some kind of allowlist of which tracepoints could be used this way.

Vernet agreed, suggesting that you could use a per-CPU variable to check whether the BPF function associated with a tracepoint was being called from one of the allowed locations. Rostedt responded by asking whether this was something that the verifier could check. Vernet indicated that this was not yet possible — and that it was an example of the need for more granularity around deciding how kfuncs can be called, as he suggested in his earlier session on polymorphic kfuncs.

Rostedt pointed out that an advantage of using tracepoints is that there would be no need to add anything to the KVM subsystem to support it, since vmm_enter() and vmm_exit() (functions that bracket any code being run in the virtual machine) already have tracepoints. Pillai clarified that those tracepoints are too late for their purposes. Rostedt suggested that it could make sense to ask the KVM maintainers whether those could be moved.

The audience had some concerns about the entire idea of opening up the ability to override the scheduler in this way. Rostedt noted that once the possibility exists for BPF to change the scheduler properties of a thread, there will be more uses for that than just this KVM change. Vernet said that this was a question for the scheduling folks, pointing out that user space can already do it "but they probably don't want BPF setting scheduler knobs".

Despite some questions about the implementation, everyone seemed receptive to the idea of eliminating the double-scheduling problem. When Pillai finishes the third version of the patch set, we will see whether the KVM and scheduler maintainers feel the same way.

Comments (2 posted)

The interaction between memory reclaim and RCU

By Jonathan Corbet
May 22, 2024

LSFMM+BPF

The 2024 Linux Storage, Filesystem, Memory-Management and BPF Summit was a development conference, where discussion was prioritized and presentations with a lot of slides were discouraged. Paul McKenney seemingly flouted this convention in a joint session of the storage, filesystem, and memory-management tracks where he presented about 50 slides — in five minutes, twice. The subject was the use of the read-copy-update (RCU) mechanism in the memory-reclaim process, and whether changes to RCU would be needed for that purpose.

Readers who are unfamiliar with RCU may want to have a look at this article for a refresher.

After the slide deluge (for which it was not possible to take effective notes without severe keyboard damage), McKenney got to his real point: before making complicated changes to the RCU subsystem (which does not lack complexity already), a real problem with the current code will need to be demonstrated. The concern seems to be that RCU is simply too slow in getting around to freeing memory, causing the system to go into an out-of-memory state. What can be done about that?

The question of whether RCU can keep up with the work presented to it is, he said, dependent on the workload. There can be a few reasons why it would fail; perhaps the grace periods (the delay before RCU callbacks can be run) are too infrequent, or there may be readers holding the RCU lock for too long. Kent Overstreet tried to give some background for the current topic, which is a spin-off from a lengthy pre-conference discussion on buffered-I/O performance. The buffered read path is fast now, he said, but it can be made faster by using RCU. But that means using RCU to free page-cache pages, which is a critical cleanup path in the kernel.

McKenney suggested that, as an experiment, developers could try just forcing page-free operations through RCU for no particular reason, just to see if anything breaks. Josef Bacik said that, while there are places where RCU can be improved, this use case is pushing for a solution to a problem that is not RCU's fault. Writeback from page-cache pages can take a long time; page reclaim is an unpredictable process in general.

Overstreet agreed that reclaim is a hard problem, and that a lot of different developers have responsibility for parts of it. He is a filesystem developer who finds himself having to solve reclaim problems, but the kernel lacks the sort of introspection that would help him to see where the problems are. Thus, he said, there is a need for a wider discussion about the interactions around the reclaim problem.

James Bottomley asked whether it was appropriate to use RCU in this way; perhaps there is a need to invent a new mechanism instead? McKenney answered that, instead, developers could use a different flavor of RCU, such as sleepable RCU. Steven Rostedt asked whether a new RCU flavor aimed at the reclaim problem is needed, but McKenney said he did not think that was the case.

It was this far into the session before Dave Chinner got up to ask what the problem to be solved was. The short answer is "making the buffer cache faster". Matthew Wilcox said that taking folio references for small reads is simply too expensive; RCU can be used instead to keep pages around while data is copied from them without the need to take a reference. McKenney suggested that perhaps hazard pointers could be used for this purpose. That would allow the immediate freeing of any object that is not currently referenced; RCU, instead, must wait for all readers to complete their work.

Bottomley said that the reference-count problem comes down to the cost of converting cache lines to exclusive access. If there is not actually a lot of contention for those reference counts, perhaps a different solution is called for. Overstreet answered that, even in the no-contention case, the reference-counting overhead is a problem; Wilcox suggested that Bottomley was underestimating the number of places in the kernel that take references.

McKenney tried to direct the conversation toward an understanding of the performance problem; Overstreet answered that better numbers are needed. He would like to be able to track just how much memory is waiting in the RCU system to be freed. McKenney answered that, while kvfree_call_rcu() is aware of the size of the memory block it has been asked to free, it is used infrequently. Most memory is freed using call_rcu(), and that function has no idea of how much memory it will eventually free (or whether it is freeing memory at all). There is also no per-subsystem accounting in RCU. Hannes Reinecke said that he would like to see subsystem-level accounting, along with the ability to force a grace period for a specific subsystem. The problem there, as somebody pointed out, is that the ability to free a specific range of memory may depend on other subsystems, and there is no way to know for sure.

Chinner said that this is a problem of tracking objects in flight. It is possible to count slab objects, since they know which slab they belong to and their size; it's just a matter of adding the tracking. Calls to kfree_rcu() could recognize slab objects and account for them. McKenney said that he would like to see kfree_rcu() merged into the slab allocator; slab maintainer Vlastimil Babka said that he had plans to do exactly that. Now that the SLOB allocator has been removed, he said, kernel code can pass any memory pointer to kfree() (and thus kfree_rcu()) and the right thing will happen.

As this somewhat inconclusive session came to a close, McKenney said that there were two problems to be solved. If the system is loaded with memory demands, how are those to be accounted for? And, for memory freed with call_rcu(), more information will need to be provided somehow. Overstreet got in the last word by saying that, if a kernel subsystem is using call_rcu(), the duty of performing the accounting is also there. kfree_rcu() should be used instead whenever possible.

Comments (4 posted)

Supporting larger block sizes in filesystems

By Jake Edge
May 22, 2024

LSFMM+BPF

In a combined storage and filesystem session at the 2024 Linux Storage, Filesystem, Memory Management, and BPF Summit, Luis Chamberlain led a discussion on filesystem support for block sizes larger than the usual 4KB page size, which followed up on discussion from last year. While the session was meant to look at the intersection of larger block sizes with atomic block writes that avoid torn (partial) writes (which was also discussed last year), it mostly focused on the filesystem side. Over time, the block sizes offered by storage devices have risen from the original 512 bytes; Chamberlain wanted to discuss filesystem support for block sizes larger than 4KB.

Chamberlain started by saying that he wanted to gauge the interest of filesystem developers in adding large-block support to their filesystems; in order to do so, a filesystem needs to be able to support large folios. The next obvious targets for this work are bcachefs and EROFS. Matthew Wilcox said that the large-folio support for EROFS is mostly done at this point, though there are a few places where it still uses struct page, for decompression in particular. For supporting large block sizes, EROFS is ready, he said, but the full folio-conversion job is not yet complete.

Adding this support will require a lot of testing, Chamberlain said; beyond that, fstests has some baked-in assumptions about block size that need to be fixed. Some of those problems were found when testing with page sizes larger than 4KB, so they have been fixed at this point, but others may be lurking. He warned that filesystems with their own test suites may also have those kinds of assumptions.

Damien Le Moal said that zonefs developers are also interested in adding support for large block sizes. There are no fstests for zonefs, which Chamberlain suggested would be a useful addition to the suite. But Wilcox noted that zonefs uses iomap, so there is probably little work that needs to be done. It is mostly a matter of making a few calls to tell iomap that zonefs wants to use large folios. Le Moal said that large-folio support was being actively worked on for zonefs and was close to being ready.

Iomap only handles the data path, Jan Kara pointed out, not the metadata path, which will still need more work for large blocks. Chamberlain agreed that was an outstanding problem. He wondered if filesystem developers even wanted to tackle it, because if they did not, there was not much point in going down the large-block path.

Josef Bacik said that Btrfs is backward from other filesystems; by default it uses 16KB blocks for metadata, so it is the data path that needs converting. The plan is for Btrfs to switch to using iomap, then to turn on support for large folios "and pray". At that point, though, Btrfs should be able to use larger blocks. The iomap conversion is in progress, with direct I/O working now; buffered I/O is next. The support for the metadata follows what XFS has done, Bacik said.

That led Dave Chinner to suggest that the XFS buffer-cache code be turned into a library that other filesystems can also use. Iomap came out of a similar process. Chamberlain wondered what other filesystems could benefit. Chinner said that any filesystem that uses the (deprecated) buffer heads API; the XFS buffer cache can support up to 64KB block sizes and already uses large folios. It could be pulled out of XFS, as it is already fairly generic; it is a wrapper around the page and slab allocators that provides compound buffers, which are made up of multiple discontiguous blocks but treated as a single contiguous buffer. Filesystems like ext4 that use buffer heads could be adapted to use this buffer cache in a fairly straightforward way.

Neal Gompa thanked the developers working on making it easier to support larger block sizes in more filesystems, in part because he works on different distributions. He has encountered lots of problems trying to use filesystems that were created on distributions that use a different page size or block size on other distributions that made different choices. But, the terms he used for larger groups of pages, a superblock or superpage, were not popular; James Bottomley said that "superblock" was confusing because of its long-established use for filesystems, while Wilcox pointed out that a superpage should simply be called a folio.

Ritesh Harjani asked about the benefits of supporting larger block sizes in filesystems, apart from the portability considerations. Chamberlain said that the hardware vendors are driving the move to larger blocks, but that he wanted to stick with the software side. He thinks larger blocks will help reduce file fragmentation, but deferred to the filesystem developers in the room.

Darrick Wong said that he would actually like to get rid of discontiguous buffers for XFS because they are difficult to work with and to test, since they "cause all sorts of weird bugs to show up" in fstests. His advice to the other filesystem developers is: "try not to do that". It is not truly desirable to have metadata scattered in memory that way anyway.

He has some patches for fs-verity support for XFS "stuck in the three-mile freight train of everything that's in my development tree that's blocking traffic all over the city". As part of that, he found a need for a buffer cache, so he reused some of the XFS code for it. That work could be used as the basis of a new library for filesystem metadata handling as Chinner had suggested. He is trying to figure out how to integrate ("staple") that work onto the jbd2 journal layer; doing that would mean that ext4 could use it, but that also requires porting OCFS2 to use the new buffer cache. Since he believes no one actually uses OCFS2, perhaps the filesystem could just be deprecated instead.

Before even setting up iomap, though, there needs to be a mechanism to read from the disk, Hannes Reinecke said. Chamberlain suggested using iomap to read that data, but Reinecke insisted that it cannot read the data for the superblock, from which iomap can be configured. That requires buffer heads. But Chamberlain said that block-device operations can be used to retrieve the needed superblock data, thus buffer heads were not required. He agreed that more discussion on that was needed, however.

Comments (11 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>