Leading items
Welcome to the LWN.net Weekly Edition for May 19, 2022
This edition contains the following feature content, once again dominated by coverage from the 2022 Linux Storage, Filesystem, Memory-Management, and BPF Summit:
- The netfslib helper library: a relatively new library to collect up common operations for network filesystems.
- Dynamically allocated pseudo-filesystems: a discussion on the right path for adding a general facility that would allow pseudo-filesystems (e.g. tracefs, debugfs) to reduce their memory footprint by allocating inodes and directory entries only when needed.
- Bringing bcachefs to the mainline: The bcachefs filesystem may be getting close to ready for merging.
- Snapshots, inodes, and filesystem identifiers: Filesystems that support snapshots, thus duplicate inode numbers, can be problematic.
- Unique identifiers for NFS: how to create and manage unique IDs needed by NFS.
- Solutions for direct-map fragmentation: a number of new technologies need to carve pages out of the kernel's direct map; how can that functionality be supported without hurting performance?
- Merging the multi-generational LRU: an extended discussion concluded that the time has come to merge this huge change, but some open questions remain.
- CXL 1: Management and tiering: the first three sessions on the Compute Express Link and how it should be managed by the Linux kernel.
- Proactive reclaim for tiered memory and more: there are reasons to want to reclaim memory in a more proactive manner, but it is not clear how any such feature should be controlled.
- Sharing page tables with mshare(): page tables are not normally shared between processes, which can lead to massive overhead in situations where memory is highly shared. This session discussed a proposal for a new system call to enable page-table sharing between cooperating processes.
This week's edition also includes these inner pages:
- Brief items: Brief news items from throughout the community.
- Announcements: Newsletters, conferences, security updates, patches, and more.
Finally, note that LWN is hiring. This is your chance to write for one of the best reader communities on the net.
Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.
The netfslib helper library
A new helper library for network filesystems, called netfslib, was the subject of a filesystem session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). David Howells developed netfslib, which was merged for 5.13 a year ago, and led the session. Some filesystems, like AFS and Ceph, are already using some of the services that netfslib provides, while others are starting to look into it.
Howells launched right into netfslib and some of its features without much in the way of a high-level introduction to the library. His topic proposal email does some of that, however:
I've been working on a library (in fs/netfs/) to provide network filesystem support services, with help particularly from Jeff Layton. The idea is to move the common features of the VM interface, including request splitting, operation retrying, local caching, content encryption, bounce buffering and compression into one place so that various filesystems can share it.This also intersects with the folios topic as one of the reasons for this now is to hide as much of the existence of folios/pages from the filesystem, instead giving it persistent iov iterators to describe the buffers available to it.
Goals
The basic goal, he said in the session, is to get the virtual-memory (VM) handling out of the network filesystems and into a common library. The library sits between the memory-management subsystem and the filesystem and handles all of the address-space operations, except, perhaps, for truncation. All of the folio handling will go into the library as well. Local caching is done there too, which allows the cache to use multi-page folios more easily.
Netfslib will allow for content encryption, which is distinct from transport encryption; a client can access the content of its files locally, without the server having any way to do so because the content is encrypted. This means that the local cache should only have encrypted file data; the client will decrypt it on read operations and encrypt it on write operations. Keeping the decrypted data out of the cache helps ensure that losing your laptop does not mean someone can access the contents of those files, he said.
It is easier to do all of that handling in one place and give all network filesystems access to the same services. To get the content encryption part working, he had to add buffering capabilities to netfslib, so it can handle read, modify, and write operations: it can issue a read to the file server, allow modifications to the data, then write it back. The write will not necessarily be using data in the page cache, he said; the library can do large batch of writes directly to the server from memory, and then remove the data from memory.
The library allows network filesystems to get rid of all knowledge of pages or folios in their code, he said. The library uses hooks for two operations: asynchronous read and write. Those hooks are passed iov_iter structures, which point to data stored using a variety of mechanisms, "maybe in a bvec, maybe in an XArray, maybe in the page cache", and the filesystem does not need to know which it is. The library can thus handle direct I/O, encrypted direct I/O, and buffered I/O (possibly with encryption); all of that is working, he said.
There are two functions that network filesystems have to provide if they want to support content encryption: functions to encrypt and decrypt blocks. The idea is that filesystems that use fscrypt, as Ceph is looking at doing, can simply point the hooks at fscrypt. The fscrypt information will simply be stored in the inode, he said.
Beyond that, netfslib also uses a hook for readahead that can handle filesystems with complicated requirements. He gave the example of Ceph, which has 2MB blocks for its files and those blocks may be scattered around on different servers. The readahead hook can queue up multiple blocks, from multiple servers, then issue all of those reads at once. Or they can be dispatched in order, which is a feature the CIFS filesystem needs, he said; the library effectively provides some basic queueing services.
Other support?
Steve French asked about compression support; many of the network filesystems can do compression over the wire to reduce the bandwidth required. Howells said that he is working on making that available as well. It is a bit tricky to do, he said, because the compression block size is usually bigger than the page or folio size. Since there are different compression schemes used by the filesystems, there will need to be hooks for compressing and uncompressing.
Amir Goldstein asked about support for directory caching. Howells said that he had some patches to support AFS directory caching, but AFS directories are just blobs that get passed back and forth. He can look at adding directory information caching, where the directory entries are read from the server and stored in some standard format locally.
Josef Bacik asked about the eventual goal: is it to replace a bunch of code in NFS, Ceph, CIFS, and others? Howells agreed that was the goal; the Plan 9 filesystem (9P) is another target and he has been asked about FUSE. Goldstein said that FUSE would make sense and should be converted.
Bacik continued by wondering about the status of this work. Howells said that the read helpers are all working and that AFS, Ceph, and 9P are using them; he has patches for CIFS, which were tested and did not seem to have any performance impact. He is working on the write helpers, and they are mostly working, other than truncation support, which is up next. The write helpers might get added to the mainline in the next merge window, though that may be a bit tight timing-wise. Bacik asked if the overall goal was simplification; Howells said that it was, and he has already been able to remove around 8000 lines of code.
Chuck Lever asked about support for direct placement of data; it is important for CIFS, NFS, and 9P, so he wanted to know what Howells planned to do for RDMA transports. Howells said that he had not really looked at it much and did not have hardware to test with, though he thought he could probably come up with some. Lever said that hardware was not needed, since there are two software RDMA drivers in the kernel that work with standard Ethernet cards. Howells said that he would look into it and Lever said that he was volunteering to help; "it's not as bad as you think". With a chuckle, Howells said: "I've heard that before."
On the chat, Layton said that he did not see any reason that netfslib could not add that RDMA support. Howells said that when doing buffered reads and writes using the page cache, netfslib hands off an iov_iter with the page cache pages in it to the network filesystem. Similarly, direct I/O reads and writes simply get an iov_iter. Presumably, the network filesystem will do whatever is needed to do RDMA to or from those pages, he said. Layton agreed with that.
Bacik said that he thought that the netfslib work was a good start, though there were some things, like RDMA and FUSE that would need to be looked at before too long. Converting network filesystems to use netfslib is probably a more pressing concern. Howells (and the rest of the room) seemed to agree with that.
Dynamically allocated pseudo-filesystems
It is perhaps unusual to have a kernel tracing developer leading a filesystem session, Steven Rostedt said, at the beginning of such a session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). But he was doing so to try to find a good way to dynamically allocate kernel data structures for some of the pseudo-filesystems, such as sysfs, debugfs, and tracefs, in the kernel. Avoiding static allocations would save memory, especially on systems that are not actually using any of the files in those filesystems.
Problem
He presented some statistics on the number of files and directories on one of his systems in /sys, /proc, /sys/kernel/tracing (the usual mount point for tracefs), and /sys/kernel/debug (debugfs). In all, he found 29,384 directories and 290,807 files. That's a lot of files, but, he asked, why should he care about that? To answer that, he noted that at one point, he had suggested that Alexei Starovoitov use tracing instances, which add another set of ring buffers for trace events and add a bunch of control files in tracefs. But Starovoitov tried that and complained that new instances used too much memory. The ring buffers are fairly modest in size, a bit over a megabyte per CPU, so Rostedt dug in a bit deeper. It turns out that whenever another instance gets added to tracefs, it adds around 18,000 files. Adding up the in-memory size of the inodes and directory entries (dentries) shows that 14MB is consumed for each tracing instance that gets added.
Looking beyond that, /sys consumes 42MB and /proc uses a whopping 202MB for these in-memory inodes and dentries, he said. But David Howells pointed out that /proc does not keep dentries and inodes around. Rostedt said that if he can use the same technique as procfs, "my talk is over". Ted Ts'o cautioned that it was a procfs-specific hack that had never been generalized, though Howells thought that perhaps it could be.
On the other hand, Chris Mason looked at a Meta production server to see what its /proc looked like; a find from the root took multiple minutes, and pegged the CPU at 100%, to find that there were 31 million files in it. He suggested that the procfs-specific hack "might not be the right hack" to use.
Christian Brauner said that since tracefs is its own filesystem, the procfs technique could simply be used there. But Rostedt was adamant that he did not want a hack just to fix the problem for tracefs; he wanted to find a proper solution that could be generalized for others to use. There should be a generic way for any pseudo-filesystem to opt into a just-in-time mode, where the inodes and dentries are allocated when the files and directories are accessed.
eventfs
Rostedt noted that Ajay Kaher gave a presentation at the 2021 Linux Plumbers Conference (LPC) on eventfs, which dynamically allocates the dentries and inodes for all of the tracing events that appear in tracefs. It is a kind of sub-filesystem for tracefs to handle the event files dynamically so that new instances do not consume so much memory. It only does the dynamic allocation for the events, and not for the other control files that appear in tracefs, Rostedt said. He did some testing with and without eventfs and found that it made a huge difference. Creating a new instance without eventfs used around 11MB extra, while doing that with eventfs only used about 1MB. At LPC, some attendees said that the feature is something that should be added as an option for all pseudo-filesystems, which is what brought Rostedt to LSFMM. He wanted to get a sense for the best way to accomplish this goal and to figure out what the internal API would look like.
In particular, since the event dentries and inodes are only present while they are being used, at least in eventfs, he is concerned that the API needs to have a way to keep them in memory while a trace involving them is running. The worry is that memory pressure could cause eventfs to be unable to create the file to disable the event. David Howells suggested that an emergency pool could be used to handle that particular problem.
Brauner asked which API was used for tracefs; did it use the sysfs API, for example? Rostedt said that tracefs has its own API and is completely separate from any of the other pseudo-filesystems. Tracefs came about because people wanted tracing information available on production systems but did not want to build debugfs into them. So, at Greg Kroah-Hartman's suggestion, Rostedt started with the debugfs code and turned it into tracefs.
Since tracefs has its own API, and does not rely on sysfs or kernfs, for example, that gives it more leeway to define an API for the just-in-time feature without having to convert the others, Brauner said. He thinks it will be difficult to come up with something that could be shared between tracefs and procfs, however, because procfs is so special.
Rostedt said that perhaps tracefs "could be the guinea pig" for the feature, then other filesystems could convert over in time if that was seen as useful. He too wonders if procfs is too special to fit in, however. Mason's concern about procfs being slow because it creates its entries on the fly may also mean that other filesystems will not want the feature. Howells said with a chuckle that if Rostedt wanted to thoroughly test the feature, "putting it in procfs would be one good way to do that".
Approach
Currently eventfs covers just a portion of the control files in tracefs; Rostedt would like to handle all of the tracefs files that way. But the feedback he has gotten from virtual filesystem (VFS) layer developers is that this should not be done solely for tracefs, so he was wondering what the right approach would be.
Amir Goldstein asked if Rostedt had talked with Kroah-Hartman to see if he would be interested in this feature for debugfs. It would seem that debugfs might also benefit from it. Rostedt said he had not asked Kroah-Hartman about that. But Brauner said that debugfs and sysfs have an ingrained idea that it is the responsibility of the creator of the directories and files to clean them up, which is different from the centralization in eventfs (or something along those lines); it might be difficult to rework those other filesystems to use a different model.
Rostedt is also concerned about race conditions and lock-ordering problems, based on his review of the eventfs code. Howells said those kinds of problems "have all been pretty well sorted in procfs". Processes come and go, as do their entries in procfs, even if they are being used. Procfs has its own structure that describes just the pieces it needs, he said, and it creates dentries and inodes on demand. It already deals with the problem of the process directory going away when the process does, though files in that subtree may still be open.
Rostedt wondered whether he should continue working on eventfs with Kaher or if they should drop that and try to make it work for all of tracefs. Eventfs might make a good test case for where the problem areas are. Brauner asked if there were other users who wanted this functionality, which might help guide which way to go. Howells reiterated the idea that procfs might provide the best model to look at since it already handles many of the same kinds of problems.
Overall, Rostedt said that he was not hearing anyone argue that he should not continue working on the idea. In addition, he said that he now has some good ideas of what code to look at as well as names of people to ask questions of. Patches are presumably forthcoming once he and Kaher determine the path they want to pursue.
Bringing bcachefs to the mainline
Bcachefs is a longstanding out-of-tree filesystem that grew out of the bcache caching layer that has been in the kernel for nearly ten years. Based on a session led by Kent Overstreet at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), though, it would seem that bcachefs is likely to be heading upstream soon. He intends to start the process toward mainline inclusion over the next six months or so.
Overstreet is often asked what the target use cases for bcachefs are; "the answer is everything". His longstanding goal is to be "reliable and robust enough to be the XFS replacement". It has been a few years since he last gave an update at LSFMM, so he began by listing the features and changes that have been added.
Support for reflinks, effectively copy-on-write (COW) links for files, has been added to bcachefs. After that support was added, Dave Chinner asked him about snapshots; he had been avoiding implementing snapshots but some reworking that he did on how bcachefs handles extents made it easier to do so. He added snapshot support and there are no scalability issues; he has done up to a million snapshots on test virtual machines without any problems. Snapshots in bcachefs have the same external interface as Btrfs (i.e. subvolumes), though the internal implementation is different.
More recently, the bcachefs allocator has been rewritten. Bcache, which is the ancestor of bcachefs, had some "algorithmic scalability issues" because it was created in the days where SSDs were around 100GB in size. But he has bcachefs users on 50TB arrays; things that work fine for the smaller sizes do not scale well, he said. So he has been reworking various pieces of bcachefs to address those problems.
There are now persistent data structures for holding data that used to require the filesystem to periodically "walk the world" by scanning the filesystem structure. Backpointers have been added so that data blocks point to the file that contains them, which is important to accelerate the "copygc" operation. That operation does a form of garbage collection, but it (formerly) required scanning through the filesystem structure. He said that it is also important for supporting zoned storage devices, which is still a little ways off but is coming.
Merging
Overstreet wants to be able to propose bcachefs for upstream inclusion "but not go insane and still be able to write code when that happens". The to-do list is always expanding, but the "really big pain points" have mostly been dealt with at this point. There is good reason to believe that upstreaming is close, he said.
Amir Goldstein asked about where and how bcachefs is being used in production now. Overstreet said that he knows it is being used, but he does not know how many sites are using it. He generally finds out when someone asks him to look at a problem. Bcachefs is mostly used by video production companies that need to deal with multiple 4K streams for editing multi-camera setups, he said; they have been using it for several years now. Bcachefs was chosen because it had better performance than Btrfs for those workloads and, at the time, was the only filesystem with certain features that were needed.
Josef Bacik said that he looked at the to-do list and noted that it was mostly bcachefs-internal items. He said that the goal when bcachefs was discussed at LSFMM in 2018 was to get the interfaces to the rest of Linux into good shape, since that would be the focus of any mailing-list review. None of the other filesystem developers know much about the internals of bcachefs, so they would not be able to review that code directly. He wondered what was left to do before the upstream process could begin.
Overstreet said that the ioctl() interface was one of the things discussed, but it has not changed in a while. He is more concerned about ensuring that the on-disk format changes are settling down. He had been pushing out those kinds of changes fairly frequently, and the backpointer support requires another, but after that, he does not see any other changes of that sort on the horizon.
Bacik asked how much more work Overstreet wanted to do internally before he would be ready to start talking about merging bcachefs and what was holding it back. Bacik also wanted to know what Overstreet needed from other filesystem developers as part of that process. The biggest thing holding him back, Overstreet said, is that he wants to be able to respond to all of the bug reports that will arise when there are lots more users of bcachefs. So he wants to make sure that the bigger development projects get taken care of before he gets to that point.
He said that it is far faster for him to fix a bug when he finds it himself, rather than having to figure out a way to reproduce a problem that someone else has found. So he is hoping to get rid of as many bugs as he can before merging. That process has been improved greatly by the debugging support he added to bcachefs over the last few years; over the last six months, he said, that effort "has been paying off in a big way". For example, the allocator rewrite went smoothly because of those tools.
Much of that revolves around the printbuf mechanism that he recently proposed for the kernel. That work came out of his interest in getting better logging information for bcachefs. There are "pretty printers" for various bcachefs data structures and their output can be logged. He is now able to debug using grep, rather than a debugger for many of the kinds of problems he encounters. He said that he would be talking more about that infrastructure in a memory-management session the next day.
Wart sharing
Chris Mason said that he had a question along the lines of those from Bacik, but "a lot more selfish". Btrfs has a lot of warts in how it interfaces with the virtual filesystem (VFS) layer, in part because its inode numbers are effectively huge, but also due to various ioctl() commands for features like reflink. He is looking forward to some other filesystem coming into Linux that is "sharing my warts"; that may lead to finding better ways to solve some of those problems, he said.
Overstreet said that bcachefs has the same basic problem that Btrfs does with regard to inode numbers, subvolumes, and NFS; he has not spent a lot of time thinking about it but would like to use the Btrfs solution once one is found. Mason said that every eight months or so, someone comes along to say that the problem is stupid and easy to fix, then the Btrfs developers have to show once again that the problem is stupid, but hard to fix. Bacik agreed that a second filesystem with some of the same kinds of problems will help; it is difficult to make certain kinds of changes because there "seems to be an allergic reaction" to interface changes that are only aimed at Btrfs problems.
Ted Ts'o had two suggestions for Overstreet; first, before adding a whole lot of new users, some kind of bcachefs repair facility is probably necessary. Overstreet said that part was all taken care of. Ts'o also said that having an automated test runner that exercised various different bcachefs configuration options would be useful. He has a test harness, and Luis Chamberlain has a different one, either of which would probably serve the needs of bcachefs. Bacik noted that there is a slot later in LSFMM to discuss some of that.
Overstreet returned to the subject of debugging tools, as it is "the thing that excites me the most". The pretty-printer code is shared by both kernel and user space, which makes it easier to find problems, he said. grep is his tool of choice for finding problems, even for difficult things like deadlocks. He demonstrated some of the kinds of information he could extract using those facilities.
Mason suggested looking into integrating this work into the drgn kernel debugger, which was the subject of a session at LSFMM 2019. It is a Python-based, live and post-crash kernel debugger that is used extensively at Facebook; every investigation of a problem in production starts by poking around using the tool. Bacik agreed, noting that drgn allows writing programs that can step through data structures in running systems to track down a wide variety of filesystem (and other) problems. Overstreet said that he would be looking into it.
Overstreet pointed to the bcachefs: Principles of Operation document as a starting point for user documentation. It is up to 25 pages at this point, organized by feature, and will be getting fleshed out further soon.
While Overstreet's hesitance to push for merging bcachefs is understandable, Bacik said, he and others have some selfish reasons for wanting to see that happen. He said he did not want to rush things, but did Overstreet have a timeline? Overstreet said that he would like to see it happen within the next six months. Based on the recent bug reports, he thinks that is a realistic goal.
Goldstein wondered when the Rust rewrite would be coming. Overstreet said that there is already some user-space Rust code in the repository; as soon as Rust support lands in the kernel, he would like to make use of it. There are "so many little quality-of-life improvements in Rust", such as proper iterators rather than "crazy for-loop macros". Bacik said that many were waiting for that support in the kernel; Overstreet suggested that those who are waiting be a bit noisier to make it clear that there is demand for it. With that, time expired on the session, but it seems we may see bcachefs and Rust racing to see which can land in the kernel first.
Snapshots, inodes, and filesystem identifiers
A longstanding problem with Btrfs subvolumes and duplicate inode numbers was the topic of a late-breaking filesystem session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). The problem had cropped up in the bcachefs session but Josef Bacik deferred that discussion to this just-created session, which he led. The problem is not limited to Btrfs, though, since filesystem snapshots for other filesystems can have similar kinds of problems.
Background
Bacik started with an overview of the problem, in part because he has to re-explain it every few years when it is "discovered" again. Btrfs has subvolumes that contain their own unique inode-number space. Subvolumes can be used for snapshots, so a common use case is to have a subvolume for a home directory so that it can be snapshotted. A snapshot is just a metadata block with a pointer to an existing block and a reference count. That means it has the same files, the same data, and the same inode numbers as the subvolume at the time of the snapshot.
That situation confuses tools like rsync, so Chris Mason came up with a way to make separate subvolumes appear to be on different filesystems, which meant that the tools would do the right thing. rsync (or find) will use the st_dev value returned by stat() to decide if they have traversed into a different filesystem; otherwise, the duplicate inode numbers causes tools to think they have already seen the files. So Btrfs assigns an anonymous block device to each subvolume, which is what it reports via stat().
That was an easy way to solve the problem, but every time it comes up, "people yell and complain about how terrible and broken it is". There is no other filesystem that does this, he said, so it may not be a great solution, but it did resolve the problem at hand. Internally, Btrfs has a subvolume ID that distinguishes the different inode-number spaces; it is used when Btrfs is being exported via NFS or Ceph to create the unique ID (or filehandle) needed, which works well, he said.
On the client side, though, the fact that those identical inode numbers are on different subvolumes gets lost, at least for NFS. So if a directory containing a subvolume and its snapshots gets exported, the fact that they are separate subvolumes is not available to the client, so find and rsync get confused by the duplicate inode numbers. Periodically, someone encounters this problem and "then tells me all the ways that it is easy to fix"; they realize quickly that it is not that easy to fix, Bacik said. The most recent attempt to do so was by Neil Brown who tried multiple solutions but it still is not resolved.
Possible solution
What Bacik would like to do is to extend statx() to report the subvolume ID, which is something that Bruce Fields said would probably work for NFS. The current st_dev behavior is still sometimes problematic when tools want to know if two files are on the same filesystem, so he would like statx() to report two things: the universally unique ID (UUID) of the containing filesystem and some way to identify the subvolume. Btrfs has a unique object ID for the root of the filesystem, which is a 64-bit value, or the subvolume ID, which is a 128-bit UUID. Either of those could be used by NFS (and others) to determine if the inode numbers are in their own space. But the subvolume UUID is Btrfs-specific, while the root object ID may apply more widely.
Amir Goldstein asked how the situation was different for ext4 snapshots. Bacik said that the problem was the same for any filesystem that does snapshots. It is only different for snapshots at the block layer, for example using the Logical Volume Manager (LVM).
On the Zoom chat, Jeff Layton said that Bacik's idea would be formalizing the idea of filesystem and subvolume IDs, which might be a good thing, but other filesystems need to be considered. Bacik agreed, but said that all of the local filesystems he is aware of have a UUID; others wondered about filesystems like FAT. Ted Ts'o said that some FAT filesystems have a 32- or 64-bit ID, but not a UUID. That has come up before in the context of adding a generic mechanism to set the UUID on a filesystem, since some do not have that concept.
Ts'o also wondered what it meant when Bacik said that a file was in the same filesystem but in a different subvolume. One definition of "the same filesystem" might be that files can be renamed or hard linked within it, but he did not think that was true for Btrfs subvolumes, which Bacik confirmed. Ts'o said it will be important to clearly define what it means for two files to be in the same filesystem, since there may be different expectations among user-space tools. The main use for whether two files are on the same filesystem, Bacik said, is for maintenance tasks to determine which filesystem to mount or unmount, for example.
Not perfect
In general, this mechanism does not have to be perfect, Bacik said, it just needs to give NFS and others some additional information so that they can do whatever it is they need to do. NFS itself works fine, he said, because it uses the unique ID, but find and such have problems in those exported directories, so he wants to provide a standard way that network filesystem clients can differentiate those files with the same inode numbers.
David Howells wondered if statx() was the right place for this kind of information; it might make more sense in the statfs() information. While Bacik thought that might be a reasonable place to report the UUID for the filesystem, there is still a need to specify which filesystem a given file belongs to, which means statx(), he thinks. But, at some level, that is a "nice to have" feature; the real crux of the problem is being able to differentiate the inode-number spaces, which requires a way to identify the subvolume.
Ts'o pointed out that POSIX-following tools (e.g. rsync, find) are not going to change to start calling statx(); beyond that, those tools are already baked into various enterprise distributions and will need to be supported for a long time. That means the problem will still exist on exported filesystems, unless the NFS client does something different.
Bacik said that Btrfs has various unique IDs that can be used to recognize and handle the problem, somehow; he just wants to know which IDs are desired and how he should deliver them. Historically, his attitude has been "play stupid games, win stupid prizes"; he suggests not combining the local subvolume and the snapshot in the same export. "Problem solved."
Bacik said that Christoph Hellwig always suggests that each subvolume have its own VFS mount, but that is a non-starter, because each VFS mount needs its own superblock. That could potentially change, but the problem remains because there are often thousands of subvolumes on a given filesystem. Goldwyn Rodrigues pointed out that each mount gets its own thread, which is "another nightmare to take care of". He said there had been some work on "views" a few years back that had a lightweight superblock for each sub-mount, though he was not sure how far that work progressed.
Bacik said that he vaguely remembered that work, but, overall, he is tired of talking about this problem. His solution is to extend statx() to give NFS and others a way to figure things out. The st_dev solution will stay forever, he said, since it works for local filesystems. But for network filesystems, he suggests exporting the UUID of the filesystem and the UUID of the subvolume or the 64-bit object ID of the root, either of which would work. No one present really objected to that plan, so patches should presumably be forthcoming.
Unique identifiers for NFS
In a combined filesystem and storage session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Chuck Lever wanted to discuss the need for a permanent, globally unique ID for network filesystems. He was joined by Hannes Reinecke who has worked on the problem for NVMe storage devices; Lever said something along those lines is needed for NFSv4. He was hoping to find a solution during the session, though it would seem that the solution may lie in user space—and documentation.
The general problem is that network filesystems and network storage devices need to have a unique ID, durable over reboots, that clients can use to identify them, Lever said. In addition, clients need these IDs as well so that servers can keep track of the clients when they are rebooted. On a physical host, something like the machine ID can be used, but once virtualization enters the picture, "things get a little foggier".
There are a number of questions, he said. When a container is created, how is the ID created and where is it stored? If a virtual machine (VM) is cloned from an existing VM, how does the system ensure that the unique ID changes for the new guest? He and others are looking for a solution for NFS, so he was soliciting ideas and thoughts from the assembled developers.
The connection to NVMe was not entirely clear to me from the session, though the problems described have a similar scope. Reinecke said that for NVMe, it is just a matter of storing the right value; there is already a defined location for it. But the question is how that value should be generated and who should be able to change it. Part of the problem is in defining what "the host" is in a world where containers and VMs are constantly being created and destroyed. A system may have several interfaces that are partitioned or shared among the VMs and containers, so what does it mean to be a "machine" or a "host" in those settings? To a certain extent, that governs when and how these unique IDs can and will change.
Ted Ts'o gave an example of an NFS server that is implemented in a VM and exports a filesystem that is stored in cloud storage. If the VM needs to be killed and restarted at some point for maintenance, the new VM is effectively the same server as the old. It is analogous to swapping the motherboard of a hardware server; the underlying "machine" has changed, but the disks and the functionality it provides are still the same. So the definition of the host depends on various factors that may not be amenable to a set of rules.
But Lever said the server side is easier because it has persistent storage where a unique ID can be placed; clients do not necessarily have that. On the server, it could be put into an /etc file. Clients can get a unique ID as a module parameter from the kernel command line, for example; it could be calculated as a hash of the machine ID. A hash would be used since machine IDs are not supposed to be put on the wire, he said. That works fine for real hardware, but containers on the same system would get the same "unique" ID, which is a problem.
James Bottomley said that the problem was basically solved on the server side by using its persistent storage. Clients could simply use a random ID, he suggested, but Lever said those values need to be preserved over a reboot. Bottomley wondered why it mattered since restarting the container was effectively bringing up a new instance, but others cautioned that not all containers work that way. Bottomley said that containers that continue living from generation to generation will need to have persistent storage, though, so those can store the unique ID there.
Christian Brauner said that it should be up to the container manager to store that information and provide it as needed to the containers it creates; it just needs to be standardized. Lever agreed, noting that he and others have been trying to document the requirements for use by container orchestration system developers. Those developers will need to figure out where they want to store those values in order to provide them to the containers.
Bottomley asked about systems that scale containers up and down by a factor of ten or 100; he suggested that new IDs would be created whenever these new containers were created, not reused from previous instances. Lever agreed and said that while each container needed its own unique ID, he did not think the values needed to persist across container instances, since once the container is destroyed it no longer has any open or locked files. The unique ID (or "uniquifier") is used to recover when clients go away and come back while files are open or locked.
Steve French said that a container might be moved, so it could be checkpointed and then restored somewhere else. The server needs to be able to detect that it is the same client in order to maintain its state. In that case, though, the ID should still be available in the restored container.
Ts'o said that maybe clients that care about preserving their open/locked-file state need to have a persistent location in /etc to store the ID. If there is nothing there (or no persistent storage), then the ID should be random and that client does not participate in the state-recovery handling.
Containers on Linux generally rely on separate network namespaces, an attendee said, but each namespace needs its own unique ID. Reinecke disagreed with that, however, as it is dependent on the kind of container and application being run. If the namespace has its own IP address, Lever said, then it will need its own ID.
Josef Bacik said that Facebook uses containers exclusively and it would expect that the IDs would be provided by some central authority. Those values would be configured per container by consulting some service running in the internal network. He suggested that NFS just provide a generic interface to set the client ID and allow user space to figure out how to set it to the proper value based on the use case.
Lever asked if administrators of these kinds of systems with thousands of containers needed tools to configure and manage the IDs or if documentation would suffice. Bacik said that documentation is all that's needed. "Tell us what to do" in order to use the facility, he said, and the user-space developers would run with it.
Lever said that he was concerned that some would not read the documentation, then their filesystem would not work correctly out of the box. But Bottomley said that the fallback should be to use a randomly generated ID; those who want something different will have to arrange to make that happen. That is not what happens today, Lever said; if there is no ID provided, it uses the same value as the host. "That's probably wrong."
Part of the difficulty here is that containers are a user-space concept, Ts'o said. That means that the container orchestration system needs to handle setting these values; the kernel is really in no position to do so.
Lever said that he has some documentation that he had been working on. He would be updating that and asked Bacik to review it to see if it would be sufficient for the container developers at Facebook. Bacik agreed to do that and the session soon trailed off.
Solutions for direct-map fragmentation
The kernel's "direct map" makes the entirety of a system's physical memory available in the kernel's virtual address space. Normally, huge pages are used for this mapping, making it relatively efficient to access. Increasingly, though, there is a need to carve some pages out of the direct map; this splits up those huge pages and makes the system as a whole less efficient. During a memory-management session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Mike Rapoport led a session on direct-map fragmentation and how it might be avoided.Rapoport started by saying that the direct-map fragmentation problem is specific to the x86 architecture at this point; some other architectures cannot fragment their direct map at all. There are a number of activities that can lead to direct-map fragmentation, including allocations for BPF programs, various secret-memory mechanisms, and virtualization technologies like SNP and TDX. Other changes envisioned for the future, including the permission vmalloc() API and using protection keys supervisor (PKS) to protect page tables, will make things worse. As more subsystems carve pieces out of the direct map, the performance of the system will decline; this is an outcome worth avoiding.
Rapoport's proposal is to coalesce these various uses into a single region of memory as a way of minimizing the fragmentation they create. Once a huge page has been split for carved-out memory, further requests for such memory should be satisfied from the same huge page, if possible. To that end, he suggests adding a new GFP flag (__GFP_UNMAPPED) so that normal page-allocator calls can be used to obtain memory that has been removed from the direct map. Callers using this flag would have to map the allocated memory in whatever way makes sense for their use case. A new migration type (MIGRATE_UNMAPPED) would be added to prevent this memory from being accidentally migrated back into direct-mapped memory. He has posted a patch set implementing this idea in a prototype form; it "kind of works", he said.
Michal Hocko said that using the page allocator might not be the best approach; it will be adding overhead to highly optimized fast paths for a rare case. Mel Gorman agreed that using the page allocator was overkill, creating a special case for a single user. Rapoport's addition of a separate migration type, he added, would end up fragmenting memory anyway because those pages cannot be moved. Rapoport answered that, in a long-running machine, direct-map fragmentation is inevitable, leading Gorman to answer that he does not want to see the extra complexity added to the page allocator to address a problem that will still happen.
An alternative, Rapoport said, would be to have a separate allocation mechanism that sits next to the page allocator. In this case, each user would have their own cache, which is a less attractive option. But Gorman replied that migration types are not free either; each new one adds a set of linked lists and increases the size of the page-block bitmap. A better solution, he said, might be a special slab cache.
David Hildenbrand said that, in his role working on memory hotplug, he hates memory that is not movable; Rapoport's proposal would create more unmovable memory and make the problem worse. Rapoport said that his patch tries to avoid movable zones when performing unmapped allocations, which should minimize the problem. Hocko repeated, though, that the page allocator is not the best place to make this type of allocation; users "count every CPU cycle" for memory allocations, and any extra overhead there is unwelcome. It would be better to build something like a slab allocator on top of the page allocator, he said.
At the end of the session, Rapoport said that he would try to create some sort of slab-like solution. Vlastimil Babka cautioned that the existing slab allocator cannot be used for BPF programs; the slab allocator hands out objects of the same size, but every BPF program is different. Rapoport concluded by saying he wasn't sure how to solve all of the problems, but would be making the attempt soon.
Merging the multi-generational LRU
Many types of kernel changes can be hammered into shape on the mailing lists. There are certain types of patches, however, that have a hard time getting to the finish line that way; they are sufficiently large and invasive that they need an actual gathering of the developers involved. The multi-generational LRU work (MGLRU) falls into this category, which is why it was the subject of a full-hour session at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). The discussion held there may well have opened the doors for this code to be merged in the near future.
MGLRU introduction
The session was led by Yu Zhao, the developer of the MGLRU work. He started by saying that RAM performance will continue to play a major role in the performance of our systems as a whole. Getting the best performance, he said, requires overcommitting memory, leading to a couple of problems. One of those is deciding which objects should be present in memory at any given time; that is what MGLRU is for. Increasing the number of pages that can be kept in memory is also an active area of work — defragmentation and zram, for example — but he wasn't there to talk about those.
He provided an introduction to MGLRU, which is also described in the above-linked article. At its core, it divides memory into a number of buckets called "generations". A page's generation reflects its "age" — how long it has been since the page was last accessed. The management of these pages is done by a mechanism Zhao described as a "clock with two hands". The aging hand scans the accessed bit of pages to see if they have been used since the last scan; pages that have been used are marked to be moved to the youngest generation. The eviction hand will actually move pages to the correct generation; those that end up in the oldest generation are the coldest and can be considered for reclaim.
One of the interesting design decisions behind MGLRU is that its scanning walks through process page tables rather than scanning physical memory. This is partly for efficiency, Zhao said; the LRU walk in current kernels is constantly having to switch between different process's page tables, which creates cache misses and slows things down. The problem with walking page tables, though, is that they can be sparse, with a lot of empty entries; scanning those brings no benefit. So the MGLRU code includes a bloom filter that helps it to avoid walking page-table pages that contain few active entries. The MGLRU code also tries to learn from its mistakes by noticing when pages it reclaims are quickly brought back into memory. To that end, it incorporates a proportional-integral-derivative (PID) controller to redirect its attention when it seems to be making the wrong decisions.
Johannes Weiner started the discussion by asking about how the aging works. If a page in the oldest generation is seen to be accessed, is it moved to the youngest generation, or just to the next-younger generation? Zhao answered that it actually depends on the type of access. If the page was accessed via a page table, then it goes to the youngest generation; if, instead, it was accessed via a file descriptor, it only goes up one generation. There are two reasons for that: the cost of evicting file-backed pages is lower, and the system can see every access (since they are done through the kernel), while accesses via page tables can only be observed once on every scan. Andrew Morton asked whether the dirtyness of a page is factored in; the answer was that dirty pages are moved up one generation.
Extensions
Weiner continued by noting that, in general, generational garbage-collection algorithms try to look at how long objects have been in use. Everything starts in the oldest generation, and becomes less likely to be reclaimed over time if it is used. The MGLRU, though, starts everything in the youngest generation, and will also promote pages directly there. What, he asked, do generations buy when they are managed this way?
The answer was a bit surprising: it seems that the full mechanism for moving pages between generations is not yet in place. When MGLRU was first posted, Zhao said, it was called a "framework". There are a lot of different use cases out there, from servers to phones and more, and there is a lot of variety even within a single category like phones. Coming up with a generation-assignment algorithm that works everywhere would be a challenge, so MGLRU will allow it to be customized. There will BPF hooks that will be called on each page needing generation assignment; they will be provided with the associated process ID, the page address, its type (anonymous or file-backed), and, for page faults, the type of the fault. The called program can then tell the memory-management subsystem which generation the page should be placed in. The networking subsystem, Zhao continued, started with a single congestion-control algorithm. The number of those algorithms has grown over time, and now their implementation is moving to BPF. The MGLRU, he said, is heading down the same path.
Weiner admitted that he hadn't known about this aspect of the MGLRU. Zhao said that this machinery is not in the current patch posting, but should probably be added.
A future MGLRU feature, Zhao continued, could be detection of internal fragmentation with transparent huge pages. There are a lot of applications that suggest turning off this feature now; if their memory is sparsely accessed, using huge pages can end up wasting a lot of memory. He said that Redis and memcached are among the applications that are affected by this. The problem is that access to a single base page can make an entire huge page appear to be hot, potentially wasting up to 511 base pages in each huge page.
Internal fragmentation of huge pages can be detected by initially mapping them using base-page entries, then watching the access pattern with MGLRU. If most of the base pages end up in the younger generations (and are thus being used), the mapping can be turned into a huge-page mapping; otherwise, the unused pages can simply be reclaimed. Michal Hocko asked whether this code exists now; the answer was "no, but it is likely to happen within the next four years". Hocko then suggested focusing on the code that is being considered now. Before allowing that to happen, Zhao suggested that ballooning for virtual machines is another potential extension, allowing unused pages to be taken away.
Enabled by default?
Hocko said that there have been concerns about MGLRU expressed on the mailing lists. He asked where things should go from here. MGLRU has some nice potential for extension, he said, but the current LRU implementation has been improved over many years and benefits from a lot of accumulated experience. He suggested that MGLRU could be merged alongside the existing LRU with an opt-in approach. Merging is the only way to find out how well MGLRU really works across workloads, he said, but he was nervous about switching over to it by default.
That said, he continued, perhaps enabling by default could be considered; that would be a "trial by fire" for both the code and its developer. It would obviously be necessary to tell users clearly how to turn it off. There are advantages to both approaches, he said, and maintaining two LRUs will have a huge cost for as long as it lasts. What, he asked, can the group agree on?
Mel Gorman said that, if this code is merged, it should be enabled by default. That said, he worried that most distributions would not be able to enable it because the MGLRU currently places a relatively low limit on the maximum number of CPUs that the kernel can support. Zhao said that this limitation would be removed in the next version of the patch set. Part of the problem with CPU counts is evidently the number of page flags that MGLRU needs; Zhao suggested that he had a way to free up some page flags, provoking curiosity and raised eyebrows in the group. There followed a digression on how this might be done that didn't reach any firm conclusions.
Bringing the discussion back into focus, Gorman said that his testing shows that MGLRU performs reasonably well; it is better on single-node machines than on NUMA systems, though.
Morton said that, if this code is going to succeed, it will start with a relatively small number of users. It will get better over time as the problems are addressed, and people will start switching over to it. He worried, though, that the development history of the MGLRU code is hidden, and that the code itself is "inscrutable"; he suggested putting a lot of time into internal documentation. Zhao, Morton said, needs to "tell a story" to bring developers up to speed.
Continuing, Morton said that the current LRU still appears to perform better for some workloads. The kernel still has multiple slab allocators, but he would rather not do that again; MGLRU should be made better for all workloads. On the other hand, we don't have that concern for filesystems; we encourage users to choose between them. Perhaps the same could be done for the LRU.
He concluded that he could envision merging this code "in the next cycle", but it is going to be a challenge. Adding MGLRU takes developers who have worked on memory management for decades and "turns them into new hires" who will have to face a complex code base with no comments, and with nobody in the next cubicle to ask about something they don't understand.
Zhao said that he has been having a hard time getting users to try MGLRU without it being upstream. This has evidently been a problem even within Google. His group ended up hiring an outside firm to do the benchmarking on this code.
"Expect a few bug reports"
Hocko said that, if this code is merged, nobody seems to object to enabling it by default. He told Zhao to "expect a few bug reports" when that happens, and asked whether Zhao was prepared for a massive amount of work. Is he prepared to see the whole thing reverted if he can't keep up with that work? Morton suggested enabling MGLRU in linux-next for a single day to see what happens, but Gorman said that it wouldn't be possible to get useful information in that little time "unless it's an outright failure". Real memory-management issues tend to be more subtle and take more time to come out, he said; the only way to find them will be for major distributors to enable MGLRU by default.
Zhao sought to reassure the group that Google would continue to support this work; he said that both Android and the data center would start to use it once it's merged. The budget for this work has been planned for years ahead, he said. He has also been working with other companies, in private, to get them to use it, and will continue to do so. Hopefully he will eventually be able to take a break, someday, but not soon.
Morton said that merging it disabled by default can still have a lot of value, especially if Google can put resources behind testing and improving it. Gorman said that there are no good choices; if it is not enabled by default, most people won't use it. He is not opposed to merging this code, but would like to see it enabled. When transparent huge pages were merged, he said, the feature was enabled by default and "set everything on fire". It took three years to sort it all out, but without having been enabled for all users, it would never have been fixed.
Zhao said he would prefer to follow the process used at Google, which involves starting with a small group of users and slowly ramping up. Gorman answered that an approach like that is good for a fleet, but it doesn't help major distributors decide when to make a switch. Even when something works in the whole fleet, it doesn't mean that it will work in the general case. Weiner said that some distributors would turn MGLRU on even if it were disabled by default; he mentioned Arch in particular. That might be a good way to avoid a "total flag day". Matthew Wilcox agreed, saying that enabling transparent huge pages by default was actually the wrong thing to do. Documentation lives forever, he said, and vendors are still telling users to disable transparent huge pages even though the problems have long since been fixed.
Zhao said that he could live with MGLRU by default, but then it could become a problem for others; if he breaks things, users will suffer. So he thinks that is a risky approach; switching to MGLRU by default after a year might be better. Weiner said that, if MGLRU is off by default, there should be a set time frame to enable it — a maximum of a couple of development cycles.
At that point, a rather tired set of memory-management developers called an end to the session, and to the day. It seems highly likely that this work will be merged in the near future, though Morton's suggestion of doing it for 5.19 might strike others as a bit hasty. Whether it will be enabled for all users, though, is far from clear.
CXL 1: Management and tiering
Compute Express Link (CXL) is an upcoming memory technology that is clearly on the minds of Linux memory-management developers; there were five sessions dedicated to the topic at the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM). The first three sessions, on May 3, covered various aspects of memory management in the presence of CXL. It seems that CXL may bring some welcome capabilities, especially for cloud-service providers, but that will come at the cost of some headaches on the kernel-development side.At its core, CXL is a new way to connect memory to a CPU. That memory need not be on the local memory bus; indeed, it is likely to be located on a different device entirely. CXL vendors seemingly envision "memory appliances" that can provide memory to multiple systems in a flexible manner. Supporting CXL raises a number of interesting issues around system boot, memory hotplug, memory tiering, and more.
A CXL memory interface for containers
The first session was led by Hongjian Fan over a remote link; it was focused on how to use CXL memory to support containers. Figuring this out, he said, is complicated by the fact that CXL is new technology and there are no real devices to play with yet. So, much of the work being done is at the conceptual level. The Kubernetes container storage interface provides a flexible way to allocate storage to containers; he is working on a "container memory interface" (CMI) to do the same thing with CXL memory.
Systems can use CMI to provide functionality like memory tiering and to manage resources in a pooled-memory system. There are a few scenarios that Fan envisions for how this would all work. One would be that containers would have access to all of the memory available to the system (though managed by the control-group memory controller, of course); in this case, CXL would bring little change. If, instead, the container implements tiered memory, then CMI will control access to the different memory types. There are also pooled-memory scenarios, where the memory is located on an appliance somewhere.
Fan had a series of questions he was seeking to answer. The first would be whether it is possible to create a common CMI standard that would work across all CXL vendors. With regard to memory tiering, he asked, is it better to do it within the containers, or instead at the host level? There are also open questions about how to manage pooled-memory servers. An attendee started the discussion by asking whether all of this could be managed with control groups, with different types of data packaged as if they were CPUless NUMA nodes. That might be the simplest place to start, Fan answered, but he was not sure that control groups had sufficient flexibility.
Michal Hocko said that cpusets could perhaps help with the management, but they provide no way to control how memory is distributed across nodes. Dave Hansen said that there is interest in providing control over memory allocation; providers could charge lower rates for access to slower memory, for example. The problem exists now, and people try to manage things with the numactl utility, but it's not up to the task. It can block users from certain types of RAM, he said, but it's an all-or-nothing deal. It can't provide the finer quality-of-service control that providers want.
Dan Williams said that the current work has been focused on DRAM and slower types of memory. CXL is going to bring a broader spectrum of vendors and speeds, and multiple performance classes. While it might make sense to design a system to handle two tiers of memory service now, developers should be thinking about five tiers in the future. Matthew Wilcox said that enterprise vendors are unlikely to want to manage that many tiers, though.
Adam Manzanares suggested starting with well-defined uses cases and just two tiers. Otherwise, he worries that things will get out of control quickly. Wilcox said that there is a sane three-tier case consisting of CXL memory, DRAM, and persistent memory. But Hansen warned of multiple CXL-attached tiers, and that developers should expect "a lot of weird CXL devices". It is an open standard, and vendors are free to do interesting things with it.
Fan said that, for any sort of management to work, the kernel will need some idea of the relative performance of each available memory tier. Hansen answered that there is a lot of standards work in this area. ACPI has a way of enumerating NUMA latency, for example, and other mechanisms are under development. The Heterogeneous Memory Attribute Table (HMAT), for example, can provide bandwidth information for each memory type. UEFI, meanwhile, has specified the Coherent Device Attribute Table (CDAT) with CXL memory, among other types, in mind.
Williams said that Linux is too dependent on the notion of NUMA distance as a way of describing memory capabilities. There is better information about memory available from the firmware now, but the memory-management code does not make use of it. A baby step might be to boil that information down into a single distance value to at least make some use of it. Manzanares said that distance doesn't work for persistent memory, though, since it cannot capture the asymmetry between read and write speeds.
Hansen said that the relevant information is available now if an application knows where to look. The harder problem is making decisions about memory placement in the kernel. Different workloads may have different preferences depending on their access patterns; currently, applications have to figure out which memory they want and set up an appropriate NUMA policy. But the kernel could be using memory information to make smarter decisions; moving frequently written pages off of persistent memory, for example.
There was some discussion about where decisions on tiering should be made. Putting the logic into the kernel makes life easy for applications that don't care about NUMA placement, which is most of them, Williams said, but he worried that there could be fights between the kernel and user space about tiering. Hansen said those fights could happen now, but the kernel's NUMA-placement logic mostly stays out of the way if user space has set an explicit policy. That may be sufficient for future needs as well.
Williams asked for an explanation of the perceived deficiencies in the current NUMA API. Fan answered that there needs to be a way to set memory limits on a per-node basis; that will require a new control-group or numactl knob. Manzanares suggested adding better tiered-memory support to QEMU so that this work could go forward, but Davidlohr Bueso pointed out that it's not possible to get real performance numbers that way. The concern at this point, Manzanares said, is to work out the interface issues rather than to optimize performance. Hansen said that a lot can be done by putting some persistent memory into a system and treating it like another tier; the result "kind of looks like CXL if you squint at it funny". That would give ways to play with interfaces and get some initial performance data.
Fan thanked the group for having provided a bunch of good information for him to work with, and the session drew to a close.
Managing CXL memory
The next session, led by Jon Trantham, delved into some of the other issues that come up when trying to manage CXL memory. CXL, he said, is a way to attach memory devices that cannot go onto the DDR memory bus. Putting DDR interfaces onto devices can be hard for manufacturers, and DDR does not work all that well with persistent memory. But CXL memory has different performance characteristics than normal RAM. Its latency and bandwidth will differ, and they can change as the device ages. Persistence, endurance, and reliability can all differ as well.
There are various ways of reporting the characteristics and status of CXL memory, starting with the above-mentioned CDAT table. The CDAT is useful in that it can be updated as performance changes. CXL devices can also produce a stream of event records, indicating that maintenance is required or that performance is falling, for example. CXL 2.0 enables switches that can sit between memory and the computer, allowing memory to live in a different enclosure entirely. That makes actions like hot unplugging possible, but it will be necessary to figure out how to communicate that to the kernel.
CXL devices must implement decisions about how much memory to allocate to each processor; this can involve an "out-of-band fabric manager" to control the switches. Memory can be interleaved at a granularity as small as 64 bytes, which is great for performance but harder for error recovery; memory failure can leave small holes in the address space. Wilcox answered that the usual management technique in such situations is crashing.
On the security side, there are access and encryption keys shared between hosts and devices; that brings in the whole key-management problem. The sum of all this, he said, is that help is needed. How is all of this to be managed? Should it be done in the kernel or in user space?
Williams asked if the encryption features were only for persistent memory. Evidently CXL can provide link encryption for DRAM, but does not encrypt data at rest. Hansen said that it will never be possible for the kernel to recover from errors on a 64-byte boundary; it only handles memory at the page level. He suggested looking at the existing mechanisms and asking whether anything different was really needed; perhaps all of those CXL capabilities aren't really necessary.
Williams said that CXL makes it possible to turn bare metal into virtual machines; techniques like memory ballooning become possible. So it seems that the same interfaces should be used. Hocko says that ballooning relies on memory hotplug, which "mostly works", but shrinking memory is hard. The memory to be removed can only be used for movable allocations. This is equivalent to a return to the old high-memory systems, where much of the installed memory could not be used by the kernel.
Hansen answered that the kernel does a reasonable job of emptying a memory area that is to be removed, but there is always the case where a few pages simply cannot be cleared. If there were some way to retain those pages after the memory goes, he said, life would be easier and the whole mechanism would be more reliable.
The session closed with Manzanares suggesting more coordination between developers and vendors. Perhaps there needs to be some sort of regular group call where these issues are worked out. Chances are that something like that will be set up soon.
Tiering
The final CXL session on Tuesday was led by Jongmin Gim, who wanted to talk about tiering in particular. A lot of things are changing in the CXL 2.0 specification, he began, including the addition of a number of memory types. Tiering will allow the system to make the best use of those memory types, putting frequently used pages in fast memory while using slower memory to hold pages that are not needed as often.
Support for tiering is not currently upstream, but developers are working on it. There are various issues around promotion and demotion of pages between tiers to be worked out. The demotion side is easy, he said; if there is not enough fast memory available, kick out some pages. Promotion turns out to be harder, though. Current patches (described in this article) use the NUMA-balancing scan to try to determine which pages in slower memory are currently being used. When hot pages are found, they can be migrated to faster memory. A heuristic requiring two accesses before promoting a page helps to prevent rapid bouncing of pages between memory types.
One possible optimization might be to promote contiguous groups of pages together in a single operation. There was some discussion of implementing some sort of predictive algorithm to improve page promotion, but it was all at a fairly high level.
Manzanares said that the kernel's NUMA balancing was designed when all nodes in a system were more-or-less equal, and it is CPU-centric. He wondered whether the assumptions built into NUMA balancing are still valid in the CXL world. Gorman said that there is no assumption that nodes are the same size in the current code. Hansen said that NUMA balancing is used now for moving data to and from slower persistent-memory nodes, which are always mismatched in size, and it seems to be working now.
The discussion wandered around the details of NUMA balancing with no real conclusion. At the end of the session, though, there were two points of agreement: CXL devices are highly diverse, and that tiering is the way to manage them.
Proactive reclaim for tiered memory and more
Memory reclaim in Linux is largely a reactive practice; the kernel tries to find memory it can repurpose in response to the amount of free memory falling too low. Developers have often wondered if a proactive reclaim mechanism might lead to better performance, for some workloads at least, and optimal use of tiered-memory systems will likely require more active reclamation of memory as well. At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Davidlohr Bueso led a brief session on the topic.Bueso started by suggesting the addition of a per-node knob that would enable proactive reclaim; an administrator would write a number indicating the amount of memory that should be reclaimed, and the kernel would attempt to make it happen. It might also be possible, he said, to extend the debugfs knob added by the multi-generational LRU patches rather than adding a new knob. Michal Hocko opposed that latter idea, though, saying that he did not want to make anything in debugfs into an API that would have to be maintained.
Instead, Hocko said, a knob of this sort should be put into sysfs. There are two ideas for how this control could work: there could be a single knob that would accept a mask indicating which nodes to target for reclaim, or there could be a per-node knob as described by Bueso. Hocko likes the per-node knob idea better, since it provides better control to the administrator. Johannes Weiner said that he has tried to add a similar sort of knob to the control-group memory controller; it would accept a count of pages to reclaim from a given group. That controller does round-robin reclaim across the processes contained within the group, which might be good enough, he said. He suggested testing this mechanism on tiered-memory systems to see how it works.
Bueso asked whether that sort of interface can be counted on to work in the future; not every system uses control groups in this way, and control at the global level might be handled differently. Weiner said that users want all of the features in both the global and control-group settings, so there should not be any divergence there.
Another attendee pointed out a couple of other use cases for proactive reclaim. Migration of virtual machines will go faster if there are fewer pages to copy, so administrators would like to be able to force a virtual machine to reclaim as much memory as possible before the migration begins. The virtual machine can report which pages have been freed to the hypervisor, and those pages can be left out of the copy to the new host. A similar use case is suspend-to-disk, which will happen more quickly if there are a lot of free pages that need not be written to persistent storage.
Bueso turned the topic to testing of proactive-reclaim mechanisms; there are a lot of ideas going around, he said, but not a lot of numbers showing how well they actually work. For example, he likes the hot-page selection algorithm that is part of the tiered-memory work, but there is only one benchmark result that gives any information on its performance. The minimal approach to benchmarking appears to be the standard for this kind of work, he said, and that worries him.
He continued with a request for an easier way to subject a patch set to a variety of workloads. He has been hacking on MMTests toward that end, trying to get an indication of just when a workload starts to push pages out of DRAM and into a slower memory tier. That helps to know whether the tiering algorithm is actually working, he said. But he would like to find a way to add tests that exercise the memory-management subsystem in ways beyond just consuming lots of RAM.
As the session wound down, he also said that he would like a way to export the kernel's view of the various memory tiers to user space. The consensus seemed to be that a sysfs file should be added for that purpose.
Sharing page tables with mshare()
The Linux kernel allows processes to share pages in memory, but the page tables used to control that sharing are not, themselves, shared; as a result, processes sharing memory maintain duplicate copies of the page-table data. Normally this duplication imposes little overhead, but there are situations where it can hurt. At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Khaled Aziz (remotely) and Matthew Wilcox led a session to discuss a proposed mechanism to allow those page tables to be shared between cooperating processes.
Some mshare() background
There was not much discussion of the motivation for this work or the proposed API in this session, which was focused on implementation. That information can be found, though, in this patch set posted in April. Eight bytes of page-table entry per page is not much overhead — until you have thousands of processes sharing the page, at which point the space taken by page tables is more than the shared page itself. There are applications out there that run that many processes, so there is a desire to reduce the overhead of non-shared page tables.
The proposal is a pair of new system calls, the first of which is mshare():
int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode);
A process wanting to share a range of memory (along with the page tables) will first create a region, probably with mmap(); this region must be aligned to a 512GB boundary. The call to mshare() provides the address and size of this region, along with a name to identify it. This call, if successful, will create a file with the given name under /sys/fs/mshare that, when read, will provide the given addr and length values.
Any other process that wishes to share this region of memory will start by opening that file and reading the associated address and size; it can then call mshare() with that information to set up the mapping. The permissions on the file in /sys/fs/mshare control the access to this region. The mapping shares the memory, but also the page tables that control it. As a result, any changes to those page tables, with mmap() or mprotect() for example, will affect all processes that are sharing the region.
When a process is finished with the shared area, it can call mshare_unlink(), passing the given name; when all processes detach from the region, it will be destroyed.
Wilcox began the session by noting that a process's address space is described by struct mm_struct, of which each process has one. When mshare() is used to create a shared area, a new mm_struct is created to describe that part of the address space. This structure has no tasks assigned to it, but it is pointed to from the virtual memory areas (VMAs) in each process that have the area mapped. Since one process's actions on the shared area affect all of them, this mechanism is suitable for cooperating processes that trust each other.
Scary
Aziz had a set of questions for the group. What, he asked, is the right granularity for page-table sharing? The current patch set shares page tables at the PMD level, but there might be value in sharing higher-level page directories. He asked whether the proposed API makes sense, and whether it should be possible for a process to map only a portion of the shared region (which is not supported now). Should mremap() be supported in a shared region? He also had questions about how userfaultfd() should interact with this feature.
Michal Hocko started by saying that this feature "sounds scary". He had a number of questions of his own. Who, in the end, is in charge of the shared mm_struct structure? How is memory accounting handled? What about mapping with the MAP_FIXED flag (used by a process that wants to tell the kernel where in its address space a mapping should be placed)? Wilcox answered that, for the most part, this mapping is handled in the same way as a mapping shared by threads within a single process. Aziz said that a worry of his own is that the shared area might be useful for processes trying to hide malware. Before getting into that sort of issue, though, he asked whether the mshare() concept seems useful in general.
Mike Rapoport asked why the SCM_RIGHTS mechanism, which allows passing file descriptors over a Unix-domain socket, wasn't used to control access to the shared region. Wilcox answered that the first design for this feature did exactly that, but users were requesting the ability to open a file to access the area instead. John Hubbard said that the API looked elegant to him, and requested that the developers stick with it.
Dan Williams asked how page pinning and accounting were being handled; Aziz replied that the work was mostly focused on the basic functionality so far. Making get_user_pages() and such work was on the list of things to do, though. David Hildenbrand echoed Hocko's sentiment that the feature seemed scary; he suggested making an allowlist describing the actions that were permitted on a shared area. System calls like mlock() would not be on that list, he suggested, until the implications were well understood. Page pinning, too, should not be there at the outset, he said.
Wilcox said that the users driving this work want to use it with DAX (direct access to files stored in persistent memory). These users can have over 10,000 processes sharing the area, which causes the page-table overhead to exceed the amount of memory being shared. In a sense, he said, mshare() can be seen as giving DAX the same functionality as hugetlbfs, but nobody likes hugetlbfs, so the desire is to make something that is not so awful. Hocko suggested that the new API is "a different awful".
Continuing, Wilcox said that, with mshare(), the kernel now has the concept of a standalone mm_struct with a file descriptor attached to it. What else, he asked, could be done with that functionality? Perhaps there would be value in a more general system call that would create an mm_struct and allow processes to attach things to it. That would be an interesting concept, he said, but Hildenbrand suggested it would be something more like Frankenstein's monster. Wilcox responded that Frankenstein would have loved this idea; he was "a misunderstood genius, just like us".
API alternatives
Hubbard suggested that perhaps a different model would make more sense; it could be called a "lightweight process" (or just a "Frankenstein"). These new processes would have a set of rules describing their behavior. But Hocko said that he couldn't understand the consequences of such a feature; they would be "beyond imagination", he said. He asked why processes can't just share page tables on a per-mapping basis, using a feature that looks like hugetlbfs but in a more shareable way. Wilcox answered that "the customer" wants the described semantics where, for example, mprotect() applies across all processes, just as if they were threads sharing that part of the address space. That raises an obvious question, he said: why not just use threads? The answer was that "mmap_lock sucks". It is also not possible to change the existing behavior of MAP_SHARED, since that would break programs, so there would need to be, at a minimum, a new mmap() flag if not a new system call. Aziz said that the separate system call makes the page-table sharing explicit rather than it just being a side effect. That makes the decision to opt into this behavior explicit as well.
Liam Howlett asked how many mshare() regions are supported in any given process; Wilcox answered that there is no particular limit. A process can create as many files as it wants, but he does not expect the API to be used that way. A more typical pattern would be for processes to share a single large chunk of memory, then perhaps map pieces of it. Howlett responded that, in that case, it might be better to only allow a single region per process. That might simplify the impact on other parts of the memory-management subsystem.
Jason Gunthorpe said that, rather than using a separate mm_struct, a process could (via some mechanism) just instantiate a VMA mapped at a high level in the page-table hierarchy. The associated memory would be owned by that VMA (or the inode of a file backing it), and the reference counting could be done there. Hocko noted that this is how hugetlbfs works now. Wilcox answered that an explicit opt-in from the processes involved is still needed, since developers need to understand the changed semantics of system calls like mprotect(). Gunthorpe suggested a new mmap() flag. Aziz said that an approach like this was possible, but that the use of a separate mm_struct has the advantage of simplifying the use of existing mechanisms for working with page tables.
Wilcox started to wind down the session by saying that, if the memory-management developers found this idea too scary, something else could be done. Aziz said that he was about to send the next version of the patch set (which hasn't happened as of this writing) and he would see what the feedback is at that point.
As things were coming to a close, Jan Kara jumped in to say that the mmap_lock for the shared region will have the same contention problems as it does now. Wilcox said that he knew somebody would bring that up; to an extent, that problem does exist. But mshare() allows processes to have more than one memory region and separate private memory from shared memory. The effect, he said, is like splitting mmap_lock in half. But even separating out 20% of the contention, he said, would be an improvement. Kara asked whether it might be better, instead, to give threads a way to separate their private address space. Wilcox said that he had thought the same way a year ago, but the result in the end is about the same. Kara said that the concept might be easier for developers to grasp.
At that point the session came to an end for real. The next step will be further discussion on the mailing list once the updated patch set comes out.
LWN is hiring
LWN does its best to provide comprehensive coverage of the free-software development community, but there is far more going on than our small staff can handle. When expressed that way, this problem suggests an obvious solution: make the staff bigger. Thus, LWN is looking to hire a writer/editor.The job description is appended below, but LWN readers will already know what we are looking for: writers who can create our type of clearly written, technical coverage of what the community is up to. Our writers must understand how free software is made and distributed, and they must be prepared to write for an audience that knows more than they do. It is challenging, but also a lot of fun.
While we hope to find somebody who can cover a broad spectrum of free-software development, we also wish to find somebody who can complement and deepen our coverage in one or more of the following areas:
- Distribution development and project governance
- The development of the Rust language
- Language, toolchain, and low-level library development in general
- Linux kernel development
- Embedded systems and Android
- System-administration tools and containers
The above list is not exhaustive; we would certainly be interested in talking with authors whose passion takes them into a different area.
LWN will complete 25 years of publication next January. It has been a spectacular ride and we have no intention of stopping, but there will come a time where, if this show is to go on, a new generation will need to take over. We would like to get that generation in place and up to speed well before the situation becomes urgent. Today's new writers, we hope, will become tomorrow's senior editors.
If this appeals to you, please contact us at editorjob@lwn.net. If you know somebody else who might make a good candidate, please encourage them to talk to us. The free-software community shows no signs of slowing down anytime soon; with your help, LWN will be able to keep up with it for the next quarter-century — and beyond.
The job description
LWN.net is seeking a full-time technical journalist to provide high-quality coverage of the Linux and free-software communities for our readers. This is an opportunity to be a part of a community that has changed the world and is far from finished. LWN has been covering this community from within since 1998.
Responsibilities will include finding and researching topics, writing articles on a regular schedule, reviewing articles written by others, interacting with readers, and traveling to and reporting from community events. Additionally, we all take part in the tasks of running the business and making important decisions about where we are trying to go.
Requirements include:
- A university degree in technical writing, engineering, or the sciences — or equivalent experience.
- Top-level English writing and editing skills.
- The willingness and ability to work remotely full-time.
- An understanding of free software and the communities that create it.
- A willingness to take on a wide range of challenges in a small-company environment.
We would also like to see:
- A demonstrated history of writing for a highly technical audience.
- Software development experience, especially in the form of contribution to one or more free-software projects.
- Experience with web technologies and web-site design.
- Python development experience.
LWN is located in Colorado, but we are willing to consider applicants from anywhere in the US who can legally work here. The salary range for this position is from $70,000 to $140,000 per year, along with periodic bonuses and an annual profit-sharing distribution. Compensation also includes participation in our health and 401(k) retirement plans. Applicants from the rest of the world who can work as consultants can also be considered.
Page editor: Jonathan Corbet
Next page:
Brief items>>