Leading items

Welcome to the LWN.net Weekly Edition for October 28, 2021

This edition contains the following feature content:

Lessons from the linux-distros mailing list: a vulnerability falls through the cracks, leading to a review of how linux-distros manages security problems.
Android wallpaper fingerprints: yet another way to track users on mobile devices.
Controlling the CPU scheduler with BPF: giving user space a chance to influence scheduling decisions.
Synchronized GPU priority scheduling: a proposal to extend the priority-scheduling mechanism to graphics accelerators.
Replacing congestion_wait(): replacing an interface that memory management depends on — and which has been broken for three years.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Lessons from the linux-distros mailing list

By Jake Edge
October 27, 2021

The oss-security mailing list is specifically set up for reports and discussion of security flaws in open-source software after their embargo, if any, has expired. But the response to a recent report of the fix for a security flaw in the Linux kernel went in a different direction than usual. The report did not break the two-week embargo period, instead it was "late", which has highlighted some problems in the management of flaws of this nature.

The report from Lin Ma was for a use-after-free vulnerability in the near-field communication (NFC) protocol stack in the kernel. It had been found by fuzzing and was duly reported to the closed security@kernel.org and linux-distros mailing lists on September 1; it was assigned CVE-2021-3760 on the same day. Ma gave a detailed report of the problem to oss-security on October 26—nearly two months later. The flaw itself is difficult to trigger; it may require a compromised NFC device to send the malicious packet sequence.

Alexander Peslyak (or "Solar Designer"), who administers oss-security and linux-distros, replied, noting the large gap in time before the public disclosure, which Ma had apologized for in the report. Peslyak said that there were multiple problems in the handling of the report to linux-distros. "Let's use this opportunity to learn from the mishandling of this issue and avoid that for other issues."

To start with, Ma's original message to linux-distros asked for a 14-day embargo, which is reasonable, but no specific date was attached to that. Since the time period was "OK'ish", it was easy for those looking at the report to accept it, Peslyak said, but the guidelines for making reports do ask for a date. Using a date rather than a number of days has some advantages:

When it's a specific date/time, it's easier for everyone to notice it approaching - not only for people specifically tasked with that. That's just a psychological detail that I guess nevertheless statistically affects the outcomes.
So I think that the distros tasked with reviewing initial notifications should insist on the actual date/time being present in there, or add it on their own in an immediate follow-up.

The linux-distros list has representatives from multiple Linux distributions, as might be guessed; problems affecting more than just Linux should be reported to the separate distros list, which adds in representatives from FreeBSD, NetBSD, and Solaris. As described in the policies and instructions for members section of the linux-distros information page, individual distributions are listed as having a primary role in handling specific management tasks for bug reports; there is also a distribution listed as the backup for most tasks. This is part of the "contributing back" commitment that distributions make when they are added to the list. In this case, neither of the two distributions tasked with looking at initial reports assigned an actual end date for the embargo.

In addition, while the patch that Ma proposed did get into the mainline kernel, linux-distros was not apprised of the status of that effort along the way. Whatever work Ma did with upstream was not reported to the list, nor did the distribution representatives stay on top of its progress. Beyond that, the relevant disclosure channels (e.g. kernel mailing lists) were not monitored for mention of the bug, nor was Ma prompted to make the required oss-security post in a timely manner. Several different distributions dropped the ball on various parts of that. Peslyak described the activity as follows:

The only "contributing back" activity on this issue consisted of 3 postings to linux-distros: prompt CVE ID assignment by Red Hat, a reminder about 14 days having passed by SUSE on September 17 (that is, already 3 days past the embargo period end), and another reminder by (a different engineer from) SUSE on October 25 (this one worked).

Lastly, the oss-security report lacked a reference to the upstream commit that fixed the problem and a date for when it became public. He asked Ma to fill that information in; he also pointed out that this was a relatively low-impact bug but it could provide lessons for the future:

This issue itself is not that important, which is part of why it almost slipped through the cracks, but it's our reminder and opportunity to fix things before anything more important is mishandled.

Ma agreed with Peslyak's suggestions, but was unaware of the need to keep linux-distros in the loop on the progress of the fix. Ma pointed to the commit fixing the problem and thanked Peslyak for his help. As Peslyak said, though, the fix had been posted to the netdev mailing list on October 7, effectively breaking the embargo at that point; it was supposed to have ended far earlier, so no harm was done. He also said that someone could have made the required post to oss-security in Ma's stead once the embargo period was over, noting that there were a number of points in time where that could have happened.

So far, at least, only one of the distributions that are assigned to the dropped tasks that Peslyak highlighted has responded. Anthony Liguori at Amazon acknowledged that the company had missed staying on top of the progress of the bug and fix as the backup, but indicated that it wanted to keep working on the "contributing back" tasks going forward. If the linux-distros mailing list is going to be able to continue functioning, it clearly requires a cooperative effort among those who are participating. That seems to have broken down in this case, so Peslyak is trying to nip the problem in the bud.

Another part of the task is slipping through the cracks, Peslyak said in his first message: statistics gathering. There is no one currently assigned to do that and he would like to see a volunteer or two to step up. The data he is looking for is shown on this page and spelled out in the guidelines:

Keep track of per-report and per-issue handling and disclosure timelines (at least times of notification of the private list and of actual public disclosure), at regular intervals produce and share statistics (most notably, the average embargo duration) as well as the raw data (except on issues that are still under embargo) by posting to oss-security.

There are good reasons to collect that kind of information in order to monitor the health of the linux-distros community and processes, but it also may have helped directly with this bug:

[...] an important desirable side-effect of keeping the statistics up-to-date is that this would catch issues that were not reported to oss-security in time or at all. For example, if someone were updating statistics for September on October 15 (by which point nothing from September is supposed to still be embargoed), they'd catch this issue 10 days earlier.

The mailing list can provide some benefits for its members, early disclosure of important bugs, coordinated releases so that some distributions do not get left behind, and so on. But that only works if the process functions well, which requires a level of commitment from each participating organization.

Linux-distros came about after its predecessor, the vendor-sec list, was compromised and disbanded in 2011. There were a number of problems with vendor-sec, not least the size of the closed list (80-100 people), so linux-distros set out to tighten things up and to codify what is expected of participants. Peslyak would clearly like to see things get back on track, so one hopes that the linux-distros community heeds his little wakeup call.

Comments (1 posted)

Android wallpaper fingerprints

By Jake Edge
October 26, 2021

Uniquely identifying users so that they can be tracked as they go about their business on the internet is, sadly, a major goal for advertisers and others today. Web browser cookies provide a fairly well-known avenue for tracking users as they traverse various web sites, but mobile apps are not browsers, so that mechanism is not available. As it turns out, though, there are ways to "fingerprint" Android devices—and likely those of other mobile platforms—so that the device owners can be tracked as they hop between their apps.

While cookies provide an easy mechanism to assign a unique ID to a particular browser instance, there are ways around being tracked that way. Since cookies are stored locally, they can be deleted or the browser can restrict how they can be used. Beyond that, users can instruct their browsers to reject cookies. Because of that, at least in part, browser fingerprinting came about.

Browser fingerprinting originally used JavaScript to query various characteristics of the browser environment (e.g. display size, plugins and fonts installed, localization settings) and combined that with information like the User-Agent string sent by the browser to derive an ID that was often unique to the user. As browser makers tried to reduce the amount and diversity of information revealed, tracking companies evolved newer techniques (e.g. canvas fingerprinting). The Panopticlick tool from the Electronic Frontier Foundation (EFF) helped demonstrate fingerprinting and the organization now has the Cover Your Tracks tool that shows how well the browser is protecting against fingerprinting.

In the mobile space, many of the same fingerprinting techniques work within the browsers, but these days users often use apps to access content, rather than a browser. Apps can simply directly send whatever information they deem necessary to do their job; they do not have to rely on users to store and preserve cookies. But Android apps do not have access to JavaScript and the browser environment directly, and the Android API is somewhat restrictive on what kinds of information about the environment apps can get. They also cannot directly share an ID with each other on the phone. So other techniques are needed.

In a recent blog post, Alexey Verkhovsky at FingerprintJS detailed one way to fingerprint devices using information extracted from wallpaper images on Android phones. Up until Android 8.1 (released in December 2017), apps could simply access the wallpaper images directly, but Google tightened the ability to use the getDrawable() call, by requiring the READ_EXTERNAL_STORAGE permission, in that release. At the same time, though, a new getWallpaperColors() call was added to allow apps to get the three main colors used by the wallpaper images for the home or lock screens without requiring any special permissions. Android 12, released in October 2021, will use that information to theme the phone user interface.

The post looks at how those color values can be combined into a device fingerprint that will only change when the user, presumably infrequently, changes their wallpaper. There is a demonstration app on the Google Play store; a screen shot from running it on my phone is shown at right. It notes that my color combinations are unique in a small sample size, but my wallpaper also changes daily, so the tracking value of the ID generated would seem to be fairly low—and the same as others using the Android-provided "seascape" wallpaper.

The post suggests using default wallpapers and not changing them as mitigations for the information leak. Custom wallpapers or those of personal photos will make the phone more identifiable. Frequently changing wallpapers automatically would seem to help thwart the stability of the ID, as well. Though running through the same set of personal photos, for example, would add another level of identifiability if that were deemed important by an app author.

FingerprintJS is a company focused on device fingerprints for fraud prevention in banking, commerce, gaming, and so on. Much of its code is available on GitHub, including the source for the wallpaper ID app and a general library for Android fingerprinting. There are other mechanisms for device identification, as an earlier blog post covers, but some of those either have been removed or may disappear over time. In addition, those identifiers may not be stable or can be spoofed, which makes them less than ideal for fraud prevention. But, of course, IDs that can be used to detect unauthorized transactions can also be used for other things—user tracking, for example.

The library has a "playground" app that can be installed to further investigate the kinds of information that can be gleaned from a phone. The variety and amount of information available is truly eye-opening, including such things as installed apps and localization choices—all of which are available to an app without giving it any extra permissions.

While the instability of wallpaper fingerprints may make them unsuitable for most use cases, the ability for any app to gain access to the data shows something of an unintended consequence of providing information for theming. As the earlier blog post notes, other properties of the device can be combined to create IDs that are likely to be unique and are stable, possibly over the entire lifetime of a device. As Android ratchets down access to some of that kind of information, which seems inevitable, Google probably will not remove all of it, for reasons the wallpaper blog post makes clear:

Google has not restricted these for a number of years now, and it is unlikely that it ever will. At the end of the day, doing so would impact Android's efficacy as an advertising platform — and for the world's largest tech firm, it's a constant juggle between balancing these interests with protecting user privacy.

It is no surprise that unique IDs are desired for more than just the browser. Fraud prevention is certainly a laudable goal, for example. But being able to peer inside users' activities is rather less laudable, though it is even more desirable for entities ranging from advertisers to criminals to governments (and all of the shades of gray in between). It all adds up to more evidence, if any was truly needed, that our phones are privacy nightmares, which is something that we are probably never going to escape—at least in the standard mobile operating systems.

Comments (19 posted)

Controlling the CPU scheduler with BPF

By Jonathan Corbet
October 21, 2021

While the BPF virtual machine has been supported by Linux for most of the kernel's existence, its role for much of that time was limited to, as its full name (Berkeley packet filter) would suggest, filtering packets. That began to change in 2012 with the introduction of seccomp() filtering, and the pace picked up in 2014 with the arrival of the extended BPF virtual machine. At this point, BPF hooks have found their way into many kernel subsystems. One area that has remained BPF-free, though, is the CPU scheduler; that could change if some version of this patch set from Roman Gushchin finds its way into the mainline.

There are several CPU schedulers in the kernel, each of which works cooperatively to handle specific types of workloads. In systems without realtime processes, though, almost all scheduling is done by the Completely Fair Scheduler (CFS), to the point that most people probably just think of it as "the scheduler". CFS is a complicated beast; it embodies a set of hard-learned heuristics that seek to maximize performance for a wide variety of workloads, and has a number of knobs to tweak for the cases where the heuristics need help. CPU scheduling is a complex task, though, and it is not surprising that the results from CFS are not always seen as being optimal by all users.

Gushchin started the cover letter for the patch set by observing that an extensive look at the effects of the various CFS tuning knobs revealed that most of them have little effect on the performance of the workload. In the end, it came down to a couple of relatively simple decisions:

In other words, some our workloads benefit by having long running tasks preempted by tasks handling short running requests, and some workloads that run only short term requests which benefit from never being preempted.

The best scheduling policy varies from one workload to the next, so there is value in being able to tweak the policy as needed. That said, Gushchin noted most workloads are well served by CFS as it is now; it may not make much sense to add more tweaks for the relatively small set of workloads that can benefit from them.

This is just the sort of situation where BPF has made inroads into other parts of the kernel. It gives users the flexibility to change policies to meet their needs while being fast enough that it can sensibly be used in performance-critical subsystems like the CPU scheduler while not increasing overhead for systems where it is not in use. It is somewhat surprising that there have been no serious attempts to integrate BPF into the scheduler until now.

Gushchin's patch set creates a new BPF program type (BPF_PROG_TYPE_SCHED) for programs that influence CPU-scheduler decisions. There are three attachment points for these programs:

cfs_check_preempt_tick is called during the handling of the scheduler's periodic timer tick; a BPF program attached here can then look at which process is running. If that process should be allowed to continue to run, the hook can return a negative number to prevent preemption. A positive return value, instead, informs the scheduler that it should switch to a different process, thus forcing preemption to happen. Returning zero leaves the decision up to the scheduler as if the hook hadn't been run.
cfs_check_preempt_wakeup is called when a process is woken by the kernel; a negative return value will prevent this process from preempting the currently running process, a positive value will force preemption, and zero leaves it up to the scheduler.
cfs_wakeup_preempt_entity is similar to cfs_check_preempt_wakeup, but it is called whenever a new process is being selected for execution and can influence the decision. A negative return indicates no preemption, positive forces it, and zero leaves the decision to other parts of the scheduler.

Gushchin notes that, at Facebook, the first experiments using these hooks "look very promising". By posting the patch set, he hoped to start a conversation on how BPF could be used within the scheduler.

For the most part, it seems that this goal has not been attained; the conversation around these patches has been relatively muted. The most significant comments have come from Qais Yousef who, since he comes from the mobile world, has a different perspective on scheduler issues. He noted that, in that realm, vendors tend to heavily modify the CPU scheduler (see this article for a look at one vendor's scheduler changes). Yousef would like to see the scheduler improved to the point that these vendor changes are no longer necessary; he worried that the addition of BPF hooks could thwart that effort:

So my worry is that this will open the gate for these hooks to get more than just micro-optimization done in a platform specific way. And that it will discourage having the right discussion to fix real problems in the scheduler because the easy path is to do whatever you want in userspace. I am not sure we can control how these hooks are used.

Yousef later recognized that there could be value in this feature, but suggested it should be tightly controlled. Among other things, he said, BPF programs used as scheduler hooks should be distributed within the kernel tree itself, with any out-of-tree hooks causing the kernel to become tainted, much like how loadable modules work.

Gushchin's position was that, by making it easy to try out scheduler changes, the new BPF hooks could accelerate scheduler development rather than slowing it down. Meanwhile, he suggested, having vendors apply their scheduler changes as BPF programs might be better than the sorts of patches they create now.

Beyond this exchange, the patch set has not yet received any significant feedback from either the core scheduler developers or the BPF community. That will clearly need to change if this work is to ever be considered for merging into the mainline kernel. Allowing user space to hook into the scheduler is likely to be a hard sell at best, but it's an idea that seems unlikely to go away anytime soon. For better or for worse, the Linux kernel serves a wide variety of users; providing the best solution for every one of them out of the box is always going to be a challenge.

Comments (12 posted)

Synchronized GPU priority scheduling

By Jonathan Corbet
October 22, 2021

Since the early days, Unix-like systems have implemented the concept of process priorities, where higher-priority processes are given more CPU time to get their work done. Implementations have changed, and alternatives (such as deadline scheduling) are available for specialized situations, but the core priority (or, in an inverted sense, "niceness") concept remains essentially the same. What should happen, though, in a world where increasing amounts of computing work is done outside of the CPU? Tvrtko Ursulin has put together a patch set showing how the nice mechanism can be extended to GPUs as well.

As Ursulin describe the situation, the "current processing landscape seems to be more and more composed of pipelines where computations are done on multiple hardware devices". The kernel directly controls the availability of CPU time for the work that is actually done on the CPU. But, increasingly, computing work is offloaded to GPUs, AI accelerators, or cryptocurrency-mining peripherals. Those processors, while capable, can also be overloaded by the demands placed on them. If they run their workloads in a way that disagrees with the kernel's idea of process priorities, the end result may not be what the user would like to see.

As an example, Ursulin pointed out that the Chrome browser will lower the priority of tabs that are not currently in the foreground. If one of those background tabs is doing a lot of rendering in the GPU, though, it may slow down the foreground tab even though the background work is supposed to be running at low priority. It turns out that at least some of these GPUs, including some Intel i915 versions, can perform priority-based scheduling internally. But that requires informing the GPU of the relevant priorities, and there is currently no way to communicate those decisions, which are made in user space, to the GPU.

Ursulin's approach is to add the concept of "context nice" to the i915 driver. This value, which is tied to the priority of the process submitting work, is used with suitably capable GPUs to influence the scheduling of that work. This approach works, but only until the priority of the process on the CPU is changed; if the browser switches to a new tab and wants to increase its priority, continuing to run the associated work on the GPU side at a lower priority would not lead to greater user satisfaction. To avoid that problem, Ursulin's patch series adds a new notifier to the scheduler so that interested kernel subsystems can be informed whenever a process's priority is changed. The i915 driver then hooks into that notifier so that it can update its priority information to keep up with the CPU priority of any process that is running work on the GPU.

The notifier has turned out to be the most controversial part of this patch set. Ursulin noted that there could be security concerns with calling into a device driver from deep within the scheduler whenever a process's priority has changed. John Wanghui suggested that a separate "I/O nice" value could be added to control priorities on the GPU; this would be different from the "ionice" that already exists for block I/O but would function in a similar way. Barry Song, instead, complained that the use of simple nice values is insufficient; it does not take into account the effect of control groups or accumulated run time on actual access to the CPU. That could lead to scheduling results on the GPU that would be inconsistent with what happens on the CPU.

Ursulin mostly agreed with Song's criticisms, but also made the claim that even just using the process nice value is better than no control over execution priority on the GPU at all. This initial implementation could be extended later to include support for control groups and such if that seemed warranted. Meanwhile, though, he has concluded that perhaps the scheduler notifier is not necessary after all. By using the current process priority whenever work is submitted to the GPU, similar results would be obtained; the main difference is that a priority change would not apply to work that had already been passed to the GPU. The next version of this patch set, it appears, will drop the notifier.

Ursulin has done some simple benchmark tests where a graphical application is running alongside a "GPU hog" process. If the GPU hog is given a low priority, the graphical application is able to produce significantly higher frame rates than it can in the absence of priority control. He concluded: "So it appears the feature can indeed improve the user experience". It thus seems likely that some version of this work will eventually find its way into the mainline; what remains to be seen is how much it will have to change before it gets there.

Comments (62 posted)

Replacing congestion_wait()

By Jonathan Corbet
October 25, 2021

Memory management is a balancing act in a number of ways. The kernel must balance the needs of current users of memory with anticipated future needs, for example. The kernel must also balance the act of reclaiming memory for other uses, which can involve writing data to permanent storage, with the rate of data that the underlying storage devices are able to accept. For years, the memory-management subsystem has used storage-device congestion as a signal that it should slow down reclaim. Unfortunately, that mechanism, which was a bit questionable from the beginning, has not worked in a long time. Mel Gorman is now trying to fix this problem with a patch set that moves the kernel away from the idea of waiting on congestion.

The congestion_wait() surprise

When memory gets tight, the memory-management subsystem must reclaim in-use pages for other uses; that, in turn, requires writing out the contents of any pages that have been modified. If the block device to which the pages are to be written is overwhelmed with traffic, though, there is little value to making the pile of I/O requests even deeper. Back in the dark and distant pre-Git past (2002), Andrew Morton proposed the addition of a congestion-tracking mechanism for block devices; if a given device was congested, the memory-management subsystem would hold off on creating new I/O requests (and throttle — slow down — processes needing more memory) until the congestion eased. This mechanism found its way into the 2.5.39 development kernel release in September 2002.

Over the years since then, the congestion-wait mechanism has moved around and evolved in various ways. The upcoming 5.15 kernel will still include a function called congestion_wait() that blocks the current task until either the congested device becomes uncongested (as signaled by a call to clear_bdi_congested()) or a timeout expires. Or, at least, that is the intent.

As it happens, the main caller of clear_bdi_congested() was a function called blk_clear_congested(), and that function was removed for the 5.0 kernel release in 2018. With the exception of a few filesystems (Ceph, FUSE, and NFS), nothing has been calling clear_bdi_congested() since then, meaning that calls to congestion_wait() almost always just sit until the timeout expires, which was not the intent.

It took another year (until September 2019) for the memory-management developers to figure this out, at which point block subsystem maintainer Jens Axboe let it be known that:

Congestion isn't there anymore. It was always broken as a concept imho, since it was inherently racy. We used the old batching mechanism in the legacy stack to signal it, and it only worked for some devices.

The race-prone nature of the congestion infrastructure was actually noted by Morton in his original proposal; a task could check a device and see that it is not congested, but the situation could change before that task gets around to queuing its I/O requests. Congestion tracking also gets harder to do accurately as the length of the command queues supported by storage devices increases. So the block developers decided to get rid of the concept in 2018. Unfortunately, nobody there told the memory-management developers, a fact that led to a grumpy comment from Michal Hocko when the situation came to light.

This is an unfortunate case of one hand not knowing what the other is doing; it has resulted in reduced memory-management performance for years. But kernel developers tend not to sit around and recriminate over such things; instead they started thinking about how to solve this problem. They must have thought fairly hard, since that process took another two years before patches started coming to light.

Moving beyond congestion

Gorman's patch set starts by noting that "even if congestion throttling worked, it was never a great idea". There are a number of things that can slow down the reclaim process. One of those — too many pages under writeback overwhelming the underlying device — might be addressed by a (properly working) congestion-wait mechanism, but other problems would not be. So the patch set takes out all of the congestion_wait() calls and replaces them with a different set of heuristics:

There are places in the memory-management subsystem where reclaim will be throttled. For example, if the kswapd thread finds pages currently being written back that have been marked for immediate reclaim, it indicates that those pages have cycled all the way through the least-recently-used (LRU) lists before they can be written to the backing store. When that happens, tasks performing reclaim will be throttled for some time. Rather than waiting for the non-existent "congestion is gone" signal, though, reclaim will stall until enough pages on the current NUMA node have been written to indicate that progress is being made.
Note that some threads — kernel threads and I/O workers in particular — will not be throttled in this case; their work may be needed to clear the backlog.
Many memory-management operations, such as compaction and page migration, require "isolating" the pages to be operated on. Isolation, in this case, refers to removing the page from any LRU lists. The reclaim process, too, involves isolating pages before they can be written. If many tasks end up performing direct reclaim, they can isolate a lot of pages that may take some time to fully reclaim; if the kernel is isolating pages more quickly than they can be reclaimed, the effort is, in the end, wasted.
The kernel already throttles reclaim if the number of isolated pages becomes too large, but that throttling waits (or tries to wait) on congestion. Gorman noted: "This makes no sense, excessive parallelisation has nothing to do with writeback or congestion". The new code instead contains a wait queue for tasks that have been throttled while performing reclaim as the result of too many isolated pages; they will be awakened when the number of isolated pages drops or a timeout occurs.
Sometimes, a thread performing reclaim may find that it is making little progress; it scans a lot of pages, but succeeds in reclaiming few of them. This can happen as the result of too many references to the pages it is working on or various other factors. With this patch set, threads that are making insufficient progress in reclaim will be throttled until some progress is made somewhere in the system. Specifically, the kernel will wait until running reclaim threads are successful with at least 12% of the pages they scan before waking threads that were not making progress. This should reduce the amount of time wasted in unproductive reclaim efforts.
Writeback efforts will also be throttled if an attempt to write out dirty pages fails due to a lack of memory. The throttling, in this case, lasts until a number of pages have been successfully written back (or a timeout occurs, as usual).

Most of the timeout durations are set to one-tenth of a second. The wait for the number of isolated pages to drop, though, is one-fiftieth of a second on the reasoning that this situation should change quickly. The patch setting these timeouts notes that they are "pulled out of thin air", but they are something to start with until somebody finds a way to come up with better values. As a first step in that direction, the no-progress timeout was later changed to a half-second after benchmark results showed that it was expiring too quickly.

The patch set is accompanied by an extensive set of benchmark results; as part of the testing, Gorman added a new "stutterp" test designed to exhibit reclaim problems. The results vary quite a bit but are generally positive; one test shows an 89% reduction in system CPU time, for example. Gorman concluded:

Bottom line, the throttling appears to work and the wakeup events may limit worst case stalls. There might be some grounds for adjusting timeouts but it's likely futile as the worst-case scenarios depend on the workload, memory size and the speed of the storage. A better approach to improve the series further would be to prioritise tasks based on their rate of allocation with the caveat that it may be very expensive to track.

These patches have been through five revisions to date with various changes happening along the way. It is hard to imagine a scenario where this work does not eventually get merged into the mainline; the current code is demonstrably broken, after all. But this kind of core memory-management change is hard to merge; the variety of workloads is so great that there is certainly something out there that will regress when heuristics are changed in this way. So, while something like this seems likely to be accepted, one never knows how many timeouts will expire before that happens.

Comments (1 posted)

Page editor: Jonathan Corbet
Next page: Brief items>>