A discussion on readahead

By Jake Edge
June 15, 2022

Readahead is an I/O optimization that causes the system to read more data than has been requested by an application—in the belief that the extra data will be requested soon thereafter. At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), Matthew Wilcox led a session to discuss readahead, especially as it relates to network filesystems, with assistance from Steve French and David Howells. The latency of the underlying storage needs to factor into the calculation of how much data to read in advance, but it is not entirely clear how to do so.

Wilcox began by describing readahead a bit. If user space is reading a file one byte at a time, Linux does not actually read the data that way; instead, it issues reads for a bigger chunk, say 64KB, which gets stored in the page cache. There is a certain amount of latency between the time a page is requested from the storage and when it appears in the page cache; that latency varies greatly over the wide variety of storage types that Linux supports. For network storage, those types can range from locally stored data on multi-gigabit Ethernet to data stored halfway around the world over decidedly slower links. Similarly, for local storage it can range from a 5GB-per-second NVMe SSD to some "crappy USB key picked up from a vendor at a trade show". There is "a lot of stuff to contend with there".

In his experience, block-layer developers tend to do all of their testing using direct I/O; they think that "the page cache sucks" so they avoid it in their testing. James Bottomley said they are often trying to exclude the effects of the page cache from their testing in order to eliminate variables outside of their control. Wilcox said that it was unfortunate, since the performance including the page cache is "what the users are actually seeing"; it would be nice to notice problems where either too much or too little readahead is affecting the performance—before users do.

He said that he has a KernelNewbies wiki page where he is collecting his thoughts about readahead and the page cache in general. The page cache "is awesome in some ways, in other ways it's terrible", but it would be good to fix the problems that it has. The Android developers encountered a problem with readahead, but "they worked around it in the worst way possible". They changed a setting for readahead, moving it from 256KB to several hundred megabytes, he said, in order to shave "some fraction of a second" from application startup time. That has other effects, of course; when they tried to upstream the patches, the memory-management developers said "no".

Howells suggested that the Android developers should be using fadvise() (or, he amended, madvise()) to indicate that the file should have more aggressive readahead. But Wilcox did not agree: "can we stop trying to pretend that user space knows what it is doing?" Android is a specialized environment, Bottomley said; it uses a log-structured filesystem, so the workaround may actually make sense. Wilcox expressed some skepticism on that score.

Overall, there are a bunch of readahead problems to solve. "I'm going to say the 'f' word; folios play into this a bit." The use of larger folios is driven by readahead, currently; the larger the readahead data gets, the larger the folios to hold it get. That is useful for testing. Filesystems that support larger folios, which is only XFS for now (though AFS and v9fs patches are queued), will allocate four-page (i.e. order-2) folios.

Wilcox said that French made him aware that Windows does huge reads for readahead on CIFS. French agreed with that, noting that because of the expected network latency, Windows can readahead up to 16MB, though that seems to degrade performance. He tested with a number of different sizes (256KB, 512KB, 1MB, 4MB) and found better performance with each of those, the most dramatic being seen when going to 512KB. On the Azure cloud, the value was set to 1MB because there were performance decreases for some workloads at 4MB.

The Linux CIFS server defaults to 4MB, he said, based on the results of his testing. It is clear that anything less than 1MB performs worse unless there is a fast network in between. The problem he sees is how this value can get changed sanely, throttled or raised as appropriate. Sometimes the page cache knows more than the filesystem, or the reverse can be true, and the network layer needs to factor in as well. It is not clear to him how that can all be resolved.

There is a mechanism to communicate this kind of information from the filesystem to the virtual filesystem (VFS) layer and page cache using the BDI (struct backing_dev_info), Wilcox said. That is where the VFS looks to find out the performance characteristics of the underlying storage. It may not currently have all of the right information, he said, but that's the place to put it.

When user space is reading a file one byte at a time, a 64KB read is issued instead; a "was this readahead useful" marker is placed around 20KB into the buffer and when it is reached, another 64KB read is issued. The intent is that the second read completes before user space consumes the rest of the 44KB, but the filesystem has no idea of what the latency is for the read. One could imagine measuring how long it takes to do the read and comparing it with the user-space consumption rate to better determine when to schedule the next read, he said, but that is not done.

That second 64KB read has its marker placed right at the beginning of the buffer; when that marker is reached, it decides to grow the readahead buffer and reads 128KB. It will increase once more (to 256KB), but not any further, though it should probably go up from there. The 256KB limit has been that way for 20 years and "I/O has hardly changed at all in that time", Wilcox said sarcastically; increasing that limit is many years overdue at this point. Josef Bacik read a chat comment from Jan Kara that said SUSE has had a limit of 512KB for years. Given that, Wilcox thought an immediate move to a 1MB maximum was in order.

But, rather than increasing the size of the read directly, Wilcox would rather issue multiple 256KB reads back to back because that will help the page cache track if the data that is read ahead is actually being used. French said that would be preferable for CIFS, as multiple smaller reads are better performance-wise. Jeff Layton came in via Zoom to say that he thought a single, larger read would be better for NFS and was surprised to hear that smaller reads were better for CIFS.

Howells said that which of the options is better is totally filesystem-dependent. Ted Ts'o said that it will also depend on the underlying storage; increasing the readahead size on memory-constrained devices may also be problematic. There is not going to be a single "magic readahead formula" that can be used in all situations. He suggested making it possible to experiment with different algorithms using BPF; that way a grad student, for example, could easily try out different ideas to see which worked best. Wilcox said that he had wanted to argue with that idea but then decided he liked it because it would make it someone else's problem, which was met with much laughter.

Chuck Lever said that he thought the readahead situation was being greatly oversimplified in the discussion. In the original example of a program reading a byte at a time, there is no reason to increase beyond 64KB, since that will keep up with the program just fine. There are problems with requesting too much data and filling the page cache with pages that are not actually going to be read.

Another problem is that queueing up a 1MB read, for example, on a network filesystem will hold up other smaller requests, like metadata updates, while the read is ongoing. He agrees with the need for experimentation and thinks that should take precedence over any immediate increase in the kernel's readahead size. There is a need for testing with a lot of different workloads, filesystems, and so forth to determine what the overall systemic effects of that kind of change would be.

Index entries for this article
Kernel	Readahead
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2022

A discussion on readahead

Posted Jun 15, 2022 16:05 UTC (Wed) by GauntletWizard (subscriber, #110285) [Link]

There was a trick built into the Java "runtime" in Google; essentially, the script that started the JVM was prefixed by `cat $JARFILE > /dev/null` . This sped up start-up times immensely, because it assured that the entire jar file was in the block cache before the Java process started to load classes from it, which was often in terrible order (partially because JAR files are PKZIP, with their "header" at the end of the file). This is likely where the Android example of insane readahead "got it".

A discussion on readahead

Posted Jun 15, 2022 16:21 UTC (Wed) by adobriyan (subscriber, #30858) [Link] (2 responses)

ELF loader for sure doesn't need to readahead those debuginfo sections at the end of the file.

A discussion on readahead

Posted Jun 29, 2022 5:28 UTC (Wed) by jengelh (subscriber, #33263) [Link]

ld.so would normally just mmap the file and then read the parts it wants (and skipping debuginfo in that sense). I don't see any madvise call in glibc to that end, so if there is any gratuitous readahead, it's being done by the kernel, which makes it "someone else's problem" as far as glibc is concerned ;-)

A discussion on readahead

Posted Jun 29, 2022 5:39 UTC (Wed) by jengelh (subscriber, #33263) [Link]

Some Linux distros use split debuginfo, whereby those sections are objcopy'd into files separate files in /usr/lib/debug and then stripped off the original executable. This makes it possible to install both pieces separately depending on what's needed. It so conveniently happens to also stop excess (file descriptor based) readahead, though that surely was not the motivation.

A discussion on readahead

Posted Jun 15, 2022 18:42 UTC (Wed) by zblaxell (subscriber, #26385) [Link] (1 responses)

It sounds like some people are about to try to reinvent TCP's last few decades of bandwidth-delay-product optimizations for packet-switching networks with highly variable and unknown performance characteristics.

To be fair, the state of the art in networking probably doesn't apply directly to a filesystem + block device stack, so someone will have to do some adaptation; however, it's not like a lot of people haven't been studying a very similar problem domain for a very long time.

A discussion on readahead

Posted Jun 18, 2022 17:22 UTC (Sat) by willy (subscriber, #9762) [Link]

I am Very Much Not looking to invent my own algorithm. I'm quite aware that we're often in a Long Fat Pipe situation with storage (particularly multi-spindle devices). Our existing readahead algorithm is shoddy, and its heuristics are laughable.

One purpose of this session was to gather feedback from filesystem people about their requirements... which turned out to be contradictory.

I think I have a pretty good handle on the performance characteristics of block storage (from floppies to CD-ROMs to laptop HDs to RAID arrays to USB sticks to SSDs), but I'm pretty clueless about the performance characteristics of network servers, so that was where I was looking to learn.

Multiple smaller reads vs one larger one

Posted Jun 15, 2022 19:34 UTC (Wed) by mbligh (subscriber, #7720) [Link] (5 responses)

If userspace issues a request for 1 byte, is the latency substantially higher for them to get the result if we issue 1x 1MB read rather than 4x 256KB ones?
Or are we smart enough to return immediately on any partial read completing?

Multiple smaller reads vs one larger one

Posted Jun 15, 2022 23:04 UTC (Wed) by Paf (subscriber, #91811) [Link] (4 responses)

Depends on the file system and storage details of course but in general as soon as the needed page is in cache the read will return. The question is how did we get it there, so how much faster did the first page arrive than the very last… etc

Multiple smaller reads vs one larger one

Posted Jun 15, 2022 23:06 UTC (Wed) by Paf (subscriber, #91811) [Link]

Oh and also: to what extent the reading occurs in another thread, duh. Networked file systems often have daemons, local ones probably don’t, so they’ll just finish it all before returning.

Multiple smaller reads vs one larger one

Posted Jun 18, 2022 17:07 UTC (Sat) by willy (subscriber, #9762) [Link] (2 responses)

Alas, not as true as you'd like it to be. The read will return as soon as the relevant page in the cache is marked Uptodate and unlocked. Typically that happens at the end of the I/O in the bio completion routine (... or the networking equivalent). So while the data arrives earlier (probably -- some transports can send data out of order), we're only notified when the entire I/O completes.

It's a standard latency vs bandwidth tradeoff. If we were notified for each page, we'd spend much more CPU processing completions, but reads would complete faster. We've chosen a model that readahead should always be in advance of the application, so we don't need to spend the extra CPU to track at that granularity.

Multiple smaller reads vs one larger one

Posted Jun 23, 2022 17:04 UTC (Thu) by mrugiero (guest, #153040) [Link] (1 responses)

Is there a way to restrict it to two notifications only, one for the range requested and one for the whole thing? I'm guessing no but I'm not well versed on this so I'd rather ask.

Multiple smaller reads vs one larger one

Posted Jun 23, 2022 17:13 UTC (Thu) by willy (subscriber, #9762) [Link]

Sure! We could submit two bio's, one for the range that's needed right now, and one for the range that's speculatively needed.

But generally all of the range is brought in speculatively, so this probably won't help many workloads. If you try to implement this, I'd love to see the results.

Fairness and limited I/O bandwidth

Posted Jun 15, 2022 20:34 UTC (Wed) by epa (subscriber, #39769) [Link]

How does this impact fairness calculations? On a heavily loaded system, where there is contention for filesystem reads (either because of bandwidth, or seek time), suppose a process is parsimoniously reading just the first hundred bytes of a file. But the kernel's readahead slurps the whole first megabyte. Will the I/O scheduler penalize that process for the whole I/O performed, or does the accounting work only on what was requested?

A discussion on readahead

Posted Jun 15, 2022 22:16 UTC (Wed) by MattBBaker (guest, #28651) [Link] (1 responses)

It really feels like what is actually needed is to separate readahead into two separate concepts. One that exposes performance characteristics based on fixed parameters such as FS type, device type, and link type, and then another optimizer that can look at those parameters, get a feel for what kind of insane things the user is doing, and start making smarter choices based on that. That way the system can behave differently when the user is doing one big read off of a well designed local filesystem and small reads off a remote and slow filesystem.

A discussion on readahead

Posted Jun 18, 2022 17:23 UTC (Sat) by willy (subscriber, #9762) [Link]

If you'd like to work on this, send me an email and we can discuss!

A discussion on readahead

Posted Jun 16, 2022 12:13 UTC (Thu) by jgh (subscriber, #92451) [Link]

If bulk-data NFS ops blocking tiny-ops is an issue, separation to a pair of TCP connections seems obvious.