Restartable sequences
Imagine maintaining a per-CPU linked list, and needing to insert a new element at the head of that list. Code to do so might look something like this:
new_item->next = list_head[cpu]; list_head[cpu] = new_item;
Such code faces a couple of hazards in a multiprocessing environment. If it is preempted between the two statements above, another process might slip in and insert its own new element; when the original process resumes, it will overwrite list_head[cpu], causing the loss of the item added while it was preempted. If, instead, the process is moved to a different CPU, it could get confused between each CPU's list or run concurrently with a new process on the original CPU; the result in either case would be a corrupted list and late-night phone calls to the developer.
These situations are easily avoidable by using locks, but locks are expensive even in the absence of contention. The same holds for atomic operations like compare-and-swap; they work, but the result can be unacceptably slow. So developers have long looked for faster alternatives.
The key observation behind restartable sequences is that the above code shares a specific feature with many other high-performance critical sections, in that it can be divided into two parts: (1) an arbitrary amount of setup work that can be thrown away and redone if need be, and (2) a single instruction that "commits" the operation. The first line in that sequence:
new_item->next = list_head[cpu];
has no visible effect outside the process it is executing in; if that process were preempted after that line, it could just execute it again and all would be well. The second line, though:
list_head[cpu] = new_item;
has effects that are visible to any other process that uses the list head. If the executing process has been preempted or moved in the middle of the sequence, that last line must not be executed lest it corrupt the list. If, instead, the sequence has run uninterrupted, this assignment can be executed with no need for locks or atomic instructions. That, in turn, would make it fast.
A restartable sequence as implemented by Paul's patch is really just a small bit of code stored in a special region of memory; that code implements both the setup and commit stages as described above. If the kernel preempts a process (or moves it to another CPU) while the process is running in that special section, control will jump to a special restart handler. That handler does whatever is needed to restart the sequence; often (as it would be in the linked-list case) it's just a matter of going back to the beginning and starting over.
The sequence must adhere to some restrictions; in particular, the commit operation must be a single instruction and code within the special section cannot invoke any code outside of it. But, if it holds to the rules, a repeatable sequence can function as a small critical section without the need for locks or atomic operations. In a sense, restartable sequences can be thought as a sort of poor developer's transactional memory. If the operation is interrupted before it commits, the work done so far is simply tossed out and it all restarts from the beginning.
Paul's patch adds a new system call:
int restartable_sequences(int op, int flags, long val1, long val2, long val3);
There are two operations that can be passed as the op parameter:
- SYS_RSEQ_SET_CRITICAL sets the critical region; val1
and val2 are the bounds of that region, and val3 is
a pointer to the restart handler (which must be outside of the
region).
- SYS_RSEQ_SET_CPU_POINTER specifies a location (in val1) of an integer variable to hold the current CPU number. This location should be in thread-local storage; it allows each thread to quickly determine which CPU it is running on at any time.
The CPU-number pointer is needed so that each section can quickly get to the correct per-CPU data; to emphasize that, the restart handler will not actually be called until this pointer has been set. Only one region for restartable sequences can be established (but it can contain multiple sequences if the restart handler is smart enough), and the region is shared across all threads in a process.
Paul notes that Google is using this code internally now; it was also discussed at the Linux Plumbers Conference [PDF] in 2013. He does not believe it is suitable for mainline inclusion in its current form, though. The single-region limitation does not play well with library code, the critical section must currently be written in assembly, and the interactions with thread-local storage are painful. But, he thinks, it is a reasonable starting place for a discussion on how a proper interface might be designed.
Paul's patch is not the only one in this area; Mathieu Desnoyers posted a patch set with similar goals back in May.
Given Linus's
reaction, it's safe to say that Mathieu's patch will not be merged
anytime soon, but Mathieu did achieve his
secondary goal of getting Paul to post his patches. In any case, there is
clearly interest in mechanisms that can improve
the performance of highly concurrent user-space code, so we will almost
certainly see more patches along these lines in the future.
Index entries for this article | |
---|---|
Kernel | Restartable sequences |
Posted Jul 9, 2015 7:16 UTC (Thu)
by epa (subscriber, #39769)
[Link]
Posted Jul 13, 2015 7:43 UTC (Mon)
by HIGHGuY (subscriber, #62277)
[Link]
Under the assumption that the scheduler always tries hard to keep threads on the same CPU as long as possible, the effect is almost the same. In the best case (N threads <= N CPUs) per-thread ~ per-cpu, in the worst case there can be some bouncing.
On many occasions I believed that per-cpu data in userspace would be great, but every time I concluded that I could reach the same goal with TLS. But, then they're writing an allocator and storing data per-thread is a waste of memory...
Would a vDSO approach with seqlock + CPUID work?
Posted Jul 15, 2015 0:50 UTC (Wed)
by riking (subscriber, #95706)
[Link] (3 responses)
What if it was set up like this?
<pre>
We could say that "the code between set_restore and a critical_foo() will be run in its entirety one or more times". If any other tasks performed a critical set on the watched memory address during that time, then the move is not performed and we jump to the restart.
The critical move would be ran in a syscall that takes two pointers, and would jump to the restart point if the "fuse" was "broken", or perform the operation and "break" all other "fuses" on that address.
That would be like Redis's MULTI/WATCH/EXEC operations... You would need to get the compiler to cooperate, though.
Posted Jul 15, 2015 10:20 UTC (Wed)
by dgm (subscriber, #49227)
[Link] (2 responses)
Maybe another (naive?) approach would be to share a memory pointer with the kernel where the thread can publish a restart address. Say something like
set_atomic_restart(&ptr);
with ptr some memory owned by the thread big enough for a pointer. With that, an application can do
ptr=&&some_label;
when entering a critical section, and ptr=NULL; when leaving. If the scheduler needs to pre-empt the thread and finds that ptr has a non-null value, execution would restart at ptr instead of the saved IP.
Posted Jul 15, 2015 18:42 UTC (Wed)
by nix (subscriber, #2304)
[Link] (1 responses)
Posted Jul 15, 2015 19:54 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Jul 17, 2015 5:42 UTC (Fri)
by alkbyby (subscriber, #61687)
[Link]
Process can register a signal (and glibc could automatically register one and "steal" it from userspace similarly to what they're already doing for some pthread stuff).
So kernel would raise this signal to a thread before returning to user-space every time either:
a) this thread was previously running on different CPU
b) this thread gets CPU _back_ after some other thread run on this CPU
Then libc could implement restartable_seq by simply looking at PC in siginfo's ucontext. Quick "current cpu number" cache and perhaps a range of other things could be implemented just as easily with this signal.
This signal does not need to be queue-able. And could be block-able as normal. Seems much cleaner and more generic thing to me.
I know, people in general have bad attitude towards signals, but in this case it looks like better option that glibc can abstract further. Of course unless I'm missing something crucial.
Posted Jul 20, 2015 20:49 UTC (Mon)
by toyotabedzrock (guest, #88005)
[Link]
Restartable sequences
Restartable sequences
But on the other hand side, in this last case you still need locks, atomics or barriers for the per-cpu data case as well, since multiple threads will share the per-cpu data.
The kernel could increase a sequence number and update current CPUID for every migration (and/or context switch!). Userspace could use the sequence number like a seqlock.
Restartable sequences
// normal code
restartable_set_restore(&list_head);
// restartable section
new_item->next = list_head;
// critical instruction
restartable_critical_move_ptr(&list_head, new_item);
// normal code
</pre>
Restartable sequences
Restartable sequences
Restartable sequences
Restartable sequences
Restartable sequences