Shrinking shrinker locking overhead
Kernel subsystems that maintain caches should register a shrinker that can be called when the kernel needs to free memory for other uses. A shrinker is described by struct shrinker; among other things, it contains a pair of callbacks that the kernel can use to query how many cached objects could be freed, and to ask that they actually be freed. Shrinkers can be asked to focus on a specific NUMA node or memory control group, but not all shrinkers implement that functionality. Since shrinkers are called from the reclaim path when memory is tight, they should be quick and refrain from allocating memory themselves.
Shrinkers can be registered and deleted as the system runs, creating a concurrency problem: a shrinker should not be deleted while it is running, and the list of shrinkers must be changed carefully given that other CPUs may be traversing it at the same time. In current kernels, the shrinker list is protected by a reader/writer semaphore (rwsem); traversing the list to run shrinkers requires read access, while changing the list requires exclusive write access. This was meant to be a fast solution; frequent traversals of the list (reads) can run concurrently, while changes to the list that would require write access are relatively rare.
This rwsem, it turns out, can be a performance bottleneck on busy systems. It is a global lock, so frequent acquisitions and releases can create a lot of cache-line bouncing, slowing the system even if the lock itself is not contended. Things can get worse if a shrinker runs (or is blocked) for a long time. If a writer comes along, it will request a write lock, which will have to wait until all existing read locks are dropped; meanwhile, the write-lock request blocks any additional read locks from being granted. In this situation, a long-running shrinker can clog up the works for some time.
Performance problems of this type come up often in the kernel, and the path to their solution is reasonably well-worn at this point; it almost inevitably involves using read-copy-update (RCU) to defer changes to existing structures until all users are gone.
In this case, the patch series starts by changing the shrinker registration interface so that all shrinkers are allocated dynamically — even those that are present from boot and cannot be removed. This change allows all shrinkers to be treated uniformly, getting rid of special cases, and sets the stage for changing how shrinker registration is handled. As seen in this patch, a new shrinker instance is created with shrinker_alloc(), made active with shrinker_register(), and released with shrinker_free().
There are a couple of implications here. One, as noted in the cover letter, is that this change will break all out-of-tree modules that implement shrinkers; they will have to be converted to the new API or they will fail to load. This is a deliberate change to ensure that, in kernels implementing the new mechanism, no old-style shrinkers are in use. A more quiet change is that, while the existing register_shrinker() interface is exported to all modules, the new functions are exported as GPL-only. As a result, proprietary kernel modules that implement shrinkers will not be fixable at all.
The bulk of this 45-part patch series is focused on converting all in-kernel shrinkers to the new API, after which the old one is deleted. The real purpose of the patch set is only achieved in patch 42, where the lockless algorithm is introduced. The shrinker structure gains three new fields: a reference count, a completion to be used for removals, and an rcu_head structure.
When a shrinker is registered, its reference count is set to one, and it is added (in an RCU-safe manner) to the shrinker list; it is then available to be called when the memory-management subsystem needs to find some memory. The traversals of the shrinker list are performed with the RCU lock held, meaning that the entries in the list will not disappear at an inconvenient time. To invoke a shrinker, the kernel will first attempt to increment its reference count; that attempt will only succeed if the count is already greater than zero. The RCU lock will then be dropped, and the shrinker invoked. Once its work is done, the RCU lock will be reacquired, and the reference count decremented. Since the reclaim code held a reference, the shrinker will not have disappeared while the lock was dropped.
When the time comes to remove a shrinker, shrinker_free() will drop the reference acquired at registration time, then use the completion to wait until all other references (if any) are also dropped. At this point, the fact that the reference count is zero means that shrinker will not acquire any more users, since an attempt to increment the reference count only succeeds if that count is greater than zero. But there may still be threads traversing the shrinker list and seeing this shrinker's entry there, so its removal has to be handled with care. That, of course, is what RCU is for; the entry is taken off the list, but then handed to RCU until a grace period passes, after which it is known that the shrinker structure can be safely freed.
With these changes made, the shrinker rwsem is no longer used during the invocation of shrinkers; it is only taken for write access when changes are being made to the shrinkers themselves. The final patch in the series turns the rwsem into a lower-overhead mutex, and the work is done.
This series is in its sixth revision, and the stream of comments appears to
be slowing down. Benchmark results show no regressions from this change,
unlike previous attempts to address the locking bottleneck that created
problems elsewhere. Unless new problems turn up somewhere — always a
possibility with this kind of low-level code — it looks like lockless
shrinking may be reaching a point where it is ready for wider testing in
linux-next.
Index entries for this article | |
---|---|
Kernel | Memory management/Shrinkers |
Kernel | Releases/6.7 |
Posted Sep 15, 2023 18:29 UTC (Fri)
by calumapplepie (guest, #143655)
[Link] (12 responses)
In fact, I think exporting something similar to the shrinker API to userspace would make sense; either through the kernel, or via a daemon that looks at memory PSI info. MADV_FREE is useful, but if you want to add a more complicated cache, it just isn't good enough.
Posted Sep 15, 2023 18:44 UTC (Fri)
by Sesse (subscriber, #53779)
[Link] (1 responses)
Posted Sep 15, 2023 20:29 UTC (Fri)
by ringerc (subscriber, #3071)
[Link]
Especially those awful vendor endpoint security systems. They're uniformly terrible quality and they seem to be deliberately obfuscated. It can be hard to even be sure they're present initially, let alone figure out that some idiotic thing they're doing is breaking PostgreSQL.
I'm pretty sure some of them have never heard fork() without exec(), or multiple processes opening and doing I/O to one file at the same time.
Posted Sep 15, 2023 20:34 UTC (Fri)
by josh (subscriber, #17465)
[Link] (1 responses)
Posted Sep 16, 2023 15:42 UTC (Sat)
by tux3 (subscriber, #101245)
[Link]
The effect seems to be the same as simply marking an existing interface GPL-only, except perhaps for having an opportunity to make the change. But if there is a good reason to mark everything GPL-only, it seems to me that we may as well find the reason and do it immediately.
If it is worth doing, there is no reason to wait for an opportunity to bundle the change within a larger rework. That feels like a slightly underhanded way to avoid conflict, if that's the worry.
Posted Sep 15, 2023 20:56 UTC (Fri)
by mb (subscriber, #50428)
[Link] (7 responses)
Posted Sep 15, 2023 21:04 UTC (Fri)
by pizza (subscriber, #46)
[Link] (4 responses)
Huh, that makes me think... In the US, bypassing that "technical restriction" could be considered a DMCA violation, and subject to criminal penalties...
Posted Sep 15, 2023 21:09 UTC (Fri)
by mb (subscriber, #50428)
[Link]
Posted Sep 16, 2023 4:50 UTC (Sat)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
> (3) As used in this subsection—
If you don't meet the test in (B), then it's not DRM and there's no 1202 liability. You don't need "the authority of the [kernel's] copyright owner" to mark your own module as GPL'd (and thereby gain access to GPL-only interfaces), so the GPL-only restriction does not effectively control access to the kernel within the meaning of 1202.
1203 is more debatable:
> (a) False Copyright Management Information.—No person shall knowingly and with the intent to induce, enable, facilitate, or conceal infringement—
So you could argue that, by falsely marking your module as GPL'd, you're violating this law. OTOH, one could make exactly the same argument about the upstream GPL-only marker, as derivative works don't really follow bright line rules like that in the first place. I suspect a US court would at least seriously consider applying the equitable doctrine of unclean hands in such a scenario (i.e. you cannot violate the law, and then sue somebody else for violating the law, when your violation directly caused their violation).
However, there is a far more straightforward defense: If you didn't have "the intent to induce, enable, facilitate, or conceal infringement" (because you honestly believe that your module is not a derivative work), then you don't violate (a) and therefore haven't broken the law at all.
Posted Sep 16, 2023 12:32 UTC (Sat)
by IanKelling (subscriber, #89418)
[Link] (1 responses)
I'm not trying to argue that your larger point is wrong, but this particular sentence in isolation strikes me as very odd because copyright law say you certainly do need the authority of the [kernel's] copyright owner to do things like mark your own module as GPL'd. Perhaps for various reasons that is not the kind of authority it is talking about, including that permission was already irrevocably granted.
Posted Sep 29, 2023 18:48 UTC (Fri)
by sammythesnake (guest, #17693)
[Link]
Howso? If the module isn't a derivative work of the kernel, then the copyright ownership of the kernel code is entirely irrelevant. The rights of an author do not extend to others' non-derivative works.
Even if it were, the GPL explicitly disclaims any consideration of how the code is *used* (without distribution) so the copyright holders have already granted permission to lie to the API (to whatever extent that such permission is required , which is none at all in many jurisdictions, but YMMV)
A kernel module (whether derived from kernel code or not) uses an API (which includes a flag indicating "this is GPL code") The API is described in various documentation that you can read without ever seeing the kernel code. If the module's authors never see the kernel code, then it can't be a derivative work of that kernel code.
This is a really important fact of how the GPL and other licences (including proprietary ones) work. They can only grant permission to do things the author is entitled to withhold, not withdraw entitlement the user gets by default. Stuff like parody, journalism, fair use/fair dealing etc. give certain freedoms of use/reuse otherwise covered by copyright, but even those are exceptions for where there is a relevant copyright to consider. That isn't the case if the original work (in this case, we're talking about kernel code) isn't copied/modified/distributed/whatever.
It's very important to remember that the *authors' opinion* of what counts as a derivative work is a distraction - that question is a matter for the legal system (i.e. copyright law, relevant judicial precedent and potentially the vagaries of any courts/juries that might get involved)
Posted Sep 15, 2023 21:12 UTC (Fri)
by butlerm (subscriber, #13312)
[Link]
Posted Sep 16, 2023 17:25 UTC (Sat)
by Wol (subscriber, #4433)
[Link]
>It's a marking that the author does not want proprietary modules to use the interface.
Actually, it's not. And this is part of the problem.
The fanatics are going round marking OTHER PEOPLES' code as GPL-only, and in some cases the copyright holders have been screaming blue murder as a result.
That is why this whole area is a minefield.
Cheers,
Posted Sep 17, 2023 15:32 UTC (Sun)
by nelhage (subscriber, #59579)
[Link] (3 responses)
Man, I've forgotten how many times I have encountered some version of this problem. I saw it often enough to write a post discussing the problem and several concrete instances: https://blog.nelhage.com/post/rwlock-contention/
On the whole, I've come to believe that reader/writer locks are mostly attractive nuisances and that you should usually be looking for a better option if one is at all available.
Posted Sep 17, 2023 17:28 UTC (Sun)
by Paf (subscriber, #91811)
[Link]
Their limitations don’t mean they weren’t an improvement on what was there before. And those other options are almost inevitably a lot more work to implement…. So…
Posted Sep 17, 2023 17:54 UTC (Sun)
by willy (subscriber, #9762)
[Link]
Your linked article about the mmap_sem is also spot on. https://www.tumblr.com/nelhagedebugsshit/140317144518/whe...
In the past few months, we've landed changes to avoid using the mmap_sem for the majority of page faults. There is more work to do (in particular, reading /proc/PID/maps needs to use RCU instead of the rwsem), but even in its current form, we see improvement. v6.4 will avoid the mmap_sem for anon memory and v6.6 will avoid it for file-backed memory too (as long as we hit in the page cache).
You can read about the scheme we're using right here on LWN: https://lwn.net/Articles/906852/
Funnily, it's _more_ rwsems, but there should rarely be contention; you have to be doing a write operation that splits an existing VMA to have a reader block. eg an mprotect(). It's certainly not a traditional way to use an rwsem, but it felt better than open-coding a spinlock, a wait-queue and a "modification in progress" bit.
Posted Sep 24, 2023 23:45 UTC (Sun)
by kmeyer (subscriber, #50720)
[Link]
Yes, I've come to the same conclusion. If a mutex isn't suitable in some situation, it's rare that a RW-lock is an improvement.
Posted Sep 24, 2023 23:51 UTC (Sun)
by kmeyer (subscriber, #50720)
[Link]
Shrinking shrinker locking overhead
Shrinking shrinker locking overhead
Shrinking shrinker locking overhead
Shrinking shrinker locking overhead
Shrinking shrinker locking overhead
Shrinking shrinker locking overhead
It's a marking that the author does not want proprietary modules to use the interface.
It does not say anything about whether those modules could legally use the interface, if it wasn't technically restricted.
Therefore, all interfaces in the kernel could be marked GPL-only, if the authors decide to do so.
And that is perfectly fine.
Shrinking shrinker locking overhead
Shrinking shrinker locking overhead
I wrote
>if it wasn't technically restricted.
Shrinking shrinker locking overhead
> (A) to “circumvent a technological measure” means to descramble a scrambled work, to decrypt an encrypted work, or otherwise to avoid, bypass, remove, deactivate, or impair a technological measure, without the authority of the copyright owner; and
> (B) a technological measure “effectively controls access to a work” if the measure, in the ordinary course of its operation, requires the application of information, or a process or a treatment, with the authority of the copyright owner, to gain access to the work.
> (1) provide copyright management information that is false, or
> [...]
> (c) Definition.—As used in this section, the term “copyright management information” means any of the following information [...], except that such term does not include any personally identifying information about a user [...]:
> [...]
> (6) Terms and conditions for use of the work.
Shrinking shrinker locking overhead
Shrinking shrinker locking overhead
Shrinking shrinker locking overhead
Shrinking shrinker locking overhead
Wol
Shrinking shrinker locking overhead
Shrinking shrinker locking overhead
Shrinking shrinker locking overhead
R/W Locks
Shrinking shrinker locking overhead