Restartable sequences in glibc
The kernel makes extensive use of per-CPU data structures to avoid locking. This technique works well if the kernel takes care to disable preemption while those data structures are being manipulated; as long as a task running in the kernel has exclusive access to the data, it can safely make changes. It would be nice to be able to use similar techniques in user space, but user-space code lacks the luxury of being able to disable preemption. So a different approach, which relies on detecting rather than preventing preemption, must be used.
A restartable-sequences refresher
That approach is restartable sequences, which were first proposed by Paul Turner in 2015, then later pursued by Mathieu Desnoyers and merged in 2018. Restartable sequences rely on a couple of simple rules for the creation of safe, lock-free critical sections. The first rule is that the critical section cannot make any changes to the protected data structure that are visible to other threads until the final instruction in that section. That last instruction will typically be a pointer assignment making the new state of things visible. The other rule is that the section can be interrupted at any time prior to that last instruction; when that happens, the code must be able recover and restart the operation from the beginning.
Using restartable sequences is a bit tricky because user space must be able to tell the kernel when such a sequence is running. Executing a system call would defeat the purpose of the entire exercise, though; at that point, the thread might as well just grab a lock. So, instead, restartable sequences are managed with a special region of memory shared between user space and the kernel. Specifically, user space sets up a special structure, struct rseq, and informs the kernel of this structure using the rseq() system call. The structure is a bit complex, but at its core is field called rseq_cs, which is a pointer to a structure also called rseq_cs, containing the description of a critical section:
struct rseq_cs { __u32 version; __u32 flags; __u64 start_ip; __u64 post_commit_offset; __u64 abort_ip; };
To set up a critical section, a user-space thread fills in an rseq_cs structure, setting start_ip to the address of the first instruction in that section. The post_commit_offset is the length of the critical section in bytes; when added to start_ip the result is the first instruction after the end of the section. abort_ip, instead, is the address of the instruction to jump to if the sequence is interrupted (via preemption or CPU migration, for example) before it completes. version should be zero, and the flags field can be used to tweak restart behavior; some information on that can be found in this man page source.
Actually running the critical section is a matter of storing the address of the rseq_cs structure into the rseq structure that was registered with the kernel; this should be done just prior to entry into the section. Whenever the kernel preempts the thread, it will check the instruction pointer to see whether the critical section was executing at the time; if so, when the thread resumes execution, it will jump to the abort_ip address, at which point it should recover and try again.
One potential problem with the restartable-sequences ABI is that any given thread can only register a single rseq structure with the kernel. Even checking a single structure adds a bit of overhead to the hottest parts of the scheduler; checking a list of them would be unacceptable. The restriction makes sense, but it does pose a problem in situations where there might be more than one user of restartable sequences in a thread; some of them might be buried inside libraries and invisible to users of those libraries, perhaps several layers up the call stack. For restartable sequences to be a reliable mechanism, there must be a way to prevent these users from stepping on each other's toes.
The GNU C Library's approach
If glibc is to expose restartable sequences to its users, it must have a plausible answer to the sharing problem. The implementation put together by Florian Weimer takes the approach of putting glibc in the middle for users of this mechanism. Thus, the registration of the rseq structure with the rseq() system call is done by glibc itself during initialization; by the time user code runs, that setup will have already been performed. Should an application want to perform its own registration (and not use the glibc support at all), the glibc.pthread.rseq tunable can be used to disable the automatic registration.
Applications using restartable sequences via glibc should include <sys/rseq.h>. This header defines the rseq and rseq_cs structures and a few important variables, the first of which is __rseq_size. That will be the size of the rseq structure registered by the library, or zero if registration didn't happen for whatever reason (no support in the kernel or disabled, for example).
Finding the rseq structure registered by glibc is not quite as straightforward as one might think. It is stored in the thread control block (TCB) maintained by the library; specifically, it can be found at an offset of __rseq_offset bytes from the thread pointer. Actually getting at the thread pointer is an architecture-specific affair, though; GCC offers __builtin_thread_pointer() for some architectures but not all. As it happens, x86 is one of the exceptions; there the thread pointer is stored in the FS register and applications must fetch it themselves.
The glibc-registered rseq structure is shared by all users within a given thread, but each user should create its own rseq_cs structure describing its critical section. Immediately prior to entering its critical section, a thread should store the address of its rseq_cs structure into the rseq_cs field of the global rseq structure; it should reset that field to NULL on exit. This setup implies that critical sections cannot nest, but these sections are meant to be short and should not be calling into other code anyway, so that will not be a problem.
The code located at abort_ip must begin with the special RSEQ_SIG sentinel, which is defined in an architecture-dependent manner. Note that, if the abort code is invoked, the rseq_cs field will be zeroed by the kernel and must be assigned anew before reentering the critical section.
There is also an __rseq_flags variable containing the flags that were used when registering with the kernel; according to Weimer's documentation patch, that variable is always set to zero for now.
With that structure in place, applications using glibc can now use restartable sequences in a cooperative way. Unfortunately, there aren't really any useful examples of code using this new API to point to; this is all new stuff at this point.
As readers have likely understood by now, actually coding the critical
section almost certainly requires resorting to assembly language. This is
clearly not a feature that is intended for casual or frequent use, but it
can evidently produce significant performance gains in systems with high
scalability requirements. Support in the GNU C Library will make
restartable sequences a bit more accessible, but it seems destined
to remain a niche feature used by few developers.
Index entries for this article | |
---|---|
Kernel | Restartable sequences |
Posted Jan 31, 2022 18:42 UTC (Mon)
by roc (subscriber, #30627)
[Link] (9 responses)
I'm a bit surprised that glibc is going to default to initializing rseq for every thread. An extra syscall for every thread creation, whether or not rseq will actually be used, seems like unnecessary overhead and this is also going to break every existing rseq user. Wouldn't it have made more sense to require users to call some kind of "ensure rseq initialized" function on each thread before they use rseq?
Posted Jan 31, 2022 20:04 UTC (Mon)
by mjw (subscriber, #16740)
[Link] (3 responses)
For valgrind we opted for now to simply return ENOSYS for rseq after consulting with the glibc hackers to confirm this causes glibc to simply skip the rseq setup when running under valgrind (as if running on a kernel that doesn't implement rseq).
https://bugs.kde.org/show_bug.cgi?id=405377
A real implementation is somewhat tricky if you might have instrumented the restartable sequence. DynamoRIO lists some issues and their current "run twice" approach https://dynamorio.org/page_rseq.html
Posted Jan 31, 2022 23:54 UTC (Mon)
by zx2c4 (subscriber, #82519)
[Link]
Posted Feb 1, 2022 6:53 UTC (Tue)
by roc (subscriber, #30627)
[Link] (1 responses)
Posted Feb 1, 2022 6:54 UTC (Tue)
by roc (subscriber, #30627)
[Link]
Posted Feb 1, 2022 1:57 UTC (Tue)
by compudj (subscriber, #43335)
[Link] (4 responses)
So glibc is not just wiring rseq up for the application. It expects to use it as well even before the application main() is started.
I tried to keep track of all rseq "early adopter" open source projects (e.g. tcmalloc), and they have been made aware that they would have to update their userspace ABI to adapt to the glibc ABI. It was not an issue for them. I maintain librseq, which I have adapted to co-exist with glibc 2.35 and older glibc as well. I have not made any official release of this librseq project yet especially because I was awaiting a final choice on the userspace ABI, which is now happening with glibc 2.35. I have also sent a patch series to Peter Zijlstra to update the Linux kernel rseq selftests so they can co-exist with glibc 2.35. It is queued in the tip tree for the next merge window.
Posted Feb 1, 2022 14:29 UTC (Tue)
by khim (subscriber, #9252)
[Link]
Looks like the notion this is clearly not a feature that is intended for casual or frequent use, but it can evidently produce significant performance gains in systems with high scalability requirements is completely wrong: it would absolutely be the feature which many apps would be “frequently using”… although not directly but because small handful of very low-level libraries would adopt it eagerly.
Posted Feb 1, 2022 21:37 UTC (Tue)
by NYKevin (subscriber, #129325)
[Link] (2 responses)
Posted Feb 1, 2022 21:49 UTC (Tue)
by corbet (editor, #1)
[Link]
Posted Feb 2, 2022 11:41 UTC (Wed)
by compudj (subscriber, #43335)
[Link]
Posted Feb 1, 2022 1:26 UTC (Tue)
by developer122 (guest, #152928)
[Link] (3 responses)
Posted Feb 1, 2022 1:51 UTC (Tue)
by compudj (subscriber, #43335)
[Link] (2 responses)
Please keep in mind that this is user-space code, so interrupt handlers don't really make sense in this context. What happens when a signal is delivered on top of a rseq critical section is very much relevant though.
What happens in this case is that the rseq c.s. interrupted by the signal handler will be aborted (it's instruction pointer moved to the abort_ip) so when the signal handler returns, the interrupted thread will continue its execution at the abort ip. It's pretty much as simple as that.
This allows using rseq critical sections within signal handlers as well.
Posted Feb 1, 2022 16:10 UTC (Tue)
by developer122 (guest, #152928)
[Link] (1 responses)
As for how such a situation could ever occur, the article mentions calling code that makes use of restartible sequences, which I suppose could be inlined. So, one bit of restartible code could call a data manipulation library that itself naively tried to create a restartible sequence to protect it's own data structures. Each is trying to protect it's manipulation of it's data structures from access during premption by discarding results that were being worked on if a premption occurred.
The tricky thing here is the matter of cleanup. If something interrupts both sequences by occuring during the nested sequence, then you could restart just the inner sequence but that's wrong because the outer sequence is interrupted and doesn't know it. BUT, if you run just out outer sequence's cleanup code, then the data structures for the inner sequence may be left in an indeterminate state with the changes not being discarded. You can't run both, because only the entry not the exit is defined.
And while we're at it, we're invented the C++ problem of memory cleanup :/
Posted Feb 1, 2022 16:24 UTC (Tue)
by compudj (subscriber, #43335)
[Link]
So I really don't think the scenario you have in mind can realistically happen with the current rseq ABI.
Posted Feb 1, 2022 1:38 UTC (Tue)
by compudj (subscriber, #43335)
[Link]
Whereas it is OK to set rseq_cs to NULL when exiting a rseq critical section, it is not actually needed. The only requirement is that the rseq_cs pointer is set to NULL at some point after exiting the rseq critical section, but before reclaim of the memory holding the rseq_cs structure and the code it points to (e.g. dlclose(3) of a shared library).
Removing the requirement for setting the rseq_cs pointer to NULL on exit from a rseq critical section is a significant performance improvement considering that the entire critical section is implemented with very few instructions, which is achieved by letting the kernel detect when it returns to user-space over an instruction pointer which is outside of the range of the rseq critical section. When this is detected, the kernel simply clears the rseq_cs pointer.
Posted Feb 1, 2022 3:27 UTC (Tue)
by compudj (subscriber, #43335)
[Link]
Its primary goal is to provide a higher-level C API as static inline functions to implement the critical sections for common use-cases on all supported architectures (currently x86 32/64, arm 32/64, powerpc 32/64, s390, s390/x, and mips). It does the heavy lifting: it implements the per-architecture assembly for each per-cpu data access pattern.
Its second goal is to provide a rseq registration API to be used with older glibc (before 2.35) which is also compatible with glibc 2.35.
librseq is available under both LGPLv2.1 and MIT licenses. This library is still under active development, with no official release yet.
Posted Feb 2, 2022 2:38 UTC (Wed)
by NYKevin (subscriber, #129325)
[Link] (1 responses)
1. In order to use restartable sequences, someone needs to alloc a struct rseq and tell the kernel about it. Also, you can't free that struct unless the thread exits or you tell the kernel to stop using it.
glibc fills the role of the "someone" in step 3. However, I don't understand why this role must necessarily be filled by a specific, fixed userspace component at all. If the kernel exposed an API for querying the address of a thread's current struct rseq (which the kernel surely knows), then you could just take a "first to call rseq() wins" approach, and completely sidestep the ownership issue altogether. You would still have the problem that the struct must be freed when the thread exits (and no earlier!), and in practice this might result in glibc trying to be the first to initialize it anyway, but there would be no need for an explicit userspace ABI for this sort of coordination - everybody could just use the kernel to coordinate who owns the struct. OTOH, I suppose there might be some sequencing issues when the thread exits (i.e. during the thread-exiting process, exactly when does it become "safe" to free/reclaim the struct rseq?), but I tend to imagine that there are ways of solving this problem (e.g. it must be allocated on the owning thread's stack, it must be free'd from a different thread after the owning thread is gone, or something similar), and it probably wouldn't be too hard to agree on a convention for how to do that.
Have I misunderstood something?
Posted Feb 2, 2022 4:06 UTC (Wed)
by foom (subscriber, #14868)
[Link]
As far as not exposing any user space abi: if you don't, every user would then need to make a syscall to retrieve the rseq area's location separately per thread? And presumably then cache it in a tls variable for performance? That seems a bit silly and wasteful, when it's easy enough to just make a constant thread offset available in user space to anyone who needs it.
Posted Jul 20, 2023 13:53 UTC (Thu)
by vickyBishnoi (guest, #166151)
[Link]
Restartable sequences in glibc
Restartable sequences in glibc
Do you have a pointer to your patch?
Restartable sequences in glibc
Restartable sequences in glibc
Restartable sequences in glibc
Restartable sequences in glibc
Restartable sequences in glibc
Restartable sequences in glibc
Marginally related: the musl developer isn't entirely happy with the glibc plan and would like to see it delayed.
Musl
Restartable sequences in glibc
Restartable sequences in glibc
Restartable sequences in glibc
Restartable sequences in glibc
Restartable sequences in glibc
Restartable sequences in glibc
Restartable sequences in glibc
Restartable sequences in glibc
2. If multiple different libraries or codepaths try to do that, they will step on each other.
3. Therefore, someone needs to "own" the struct rseq for each thread, and everyone needs to agree on the ownership of this struct.
4. However, once everyone agrees on who owns the struct, there's nothing wrong with foreign code overwriting the struct, so long as it stays within its own thread and doesn't have any reentrancy issues (the struct only needs to be valid over short instruction sequences, and nesting is explicitly unsupported - clobbering somebody else's rseq state is a non-issue as long as you don't move the struct, free it, or clobber it from a signal handler).
Restartable sequences in glibc
Restartable sequences in glibc
if yes, how it can be done.