Restartable sequences

By Jonathan Corbet
July 7, 2015

Concurrent code running in user space is subject to almost all of the same constraints as code running in the kernel. One of those is that cross-CPU operations tend to ruin performance, meaning that data access should be done on a per-CPU basis whenever possible. Unlike kernel code, though, user-space per-CPU code cannot enter atomic context; it, thus, cannot protect itself from being preempted or moved to another CPU. The restartable sequences patch set recently posted by Paul Turner demonstrates one possible solution to that problem by providing a limited sort of atomic context for user-space code.

Imagine maintaining a per-CPU linked list, and needing to insert a new element at the head of that list. Code to do so might look something like this:

    new_item->next = list_head[cpu];
    list_head[cpu] = new_item;

Such code faces a couple of hazards in a multiprocessing environment. If it is preempted between the two statements above, another process might slip in and insert its own new element; when the original process resumes, it will overwrite list_head[cpu], causing the loss of the item added while it was preempted. If, instead, the process is moved to a different CPU, it could get confused between each CPU's list or run concurrently with a new process on the original CPU; the result in either case would be a corrupted list and late-night phone calls to the developer.

These situations are easily avoidable by using locks, but locks are expensive even in the absence of contention. The same holds for atomic operations like compare-and-swap; they work, but the result can be unacceptably slow. So developers have long looked for faster alternatives.

The key observation behind restartable sequences is that the above code shares a specific feature with many other high-performance critical sections, in that it can be divided into two parts: (1) an arbitrary amount of setup work that can be thrown away and redone if need be, and (2) a single instruction that "commits" the operation. The first line in that sequence:

    new_item->next = list_head[cpu];

has no visible effect outside the process it is executing in; if that process were preempted after that line, it could just execute it again and all would be well. The second line, though:

    list_head[cpu] = new_item;

has effects that are visible to any other process that uses the list head. If the executing process has been preempted or moved in the middle of the sequence, that last line must not be executed lest it corrupt the list. If, instead, the sequence has run uninterrupted, this assignment can be executed with no need for locks or atomic instructions. That, in turn, would make it fast.

A restartable sequence as implemented by Paul's patch is really just a small bit of code stored in a special region of memory; that code implements both the setup and commit stages as described above. If the kernel preempts a process (or moves it to another CPU) while the process is running in that special section, control will jump to a special restart handler. That handler does whatever is needed to restart the sequence; often (as it would be in the linked-list case) it's just a matter of going back to the beginning and starting over.

The sequence must adhere to some restrictions; in particular, the commit operation must be a single instruction and code within the special section cannot invoke any code outside of it. But, if it holds to the rules, a repeatable sequence can function as a small critical section without the need for locks or atomic operations. In a sense, restartable sequences can be thought as a sort of poor developer's transactional memory. If the operation is interrupted before it commits, the work done so far is simply tossed out and it all restarts from the beginning.

Paul's patch adds a new system call:

    int restartable_sequences(int op, int flags, long val1, long val2, long val3);

There are two operations that can be passed as the op parameter:

SYS_RSEQ_SET_CRITICAL sets the critical region; val1 and val2 are the bounds of that region, and val3 is a pointer to the restart handler (which must be outside of the region).
SYS_RSEQ_SET_CPU_POINTER specifies a location (in val1) of an integer variable to hold the current CPU number. This location should be in thread-local storage; it allows each thread to quickly determine which CPU it is running on at any time.

The CPU-number pointer is needed so that each section can quickly get to the correct per-CPU data; to emphasize that, the restart handler will not actually be called until this pointer has been set. Only one region for restartable sequences can be established (but it can contain multiple sequences if the restart handler is smart enough), and the region is shared across all threads in a process.

Paul notes that Google is using this code internally now; it was also discussed at the Linux Plumbers Conference [PDF] in 2013. He does not believe it is suitable for mainline inclusion in its current form, though. The single-region limitation does not play well with library code, the critical section must currently be written in assembly, and the interactions with thread-local storage are painful. But, he thinks, it is a reasonable starting place for a discussion on how a proper interface might be designed.

Paul's patch is not the only one in this area; Mathieu Desnoyers posted a patch set with similar goals back in May. Given Linus's reaction, it's safe to say that Mathieu's patch will not be merged anytime soon, but Mathieu did achieve his secondary goal of getting Paul to post his patches. In any case, there is clearly interest in mechanisms that can improve the performance of highly concurrent user-space code, so we will almost certainly see more patches along these lines in the future.

Index entries for this article
Kernel	Restartable sequences

Restartable sequences

Posted Jul 9, 2015 7:16 UTC (Thu) by epa (subscriber, #39769) [Link]

Perhaps a similar mechanism could be used for concurrency in general, assuming a single-CPU machine. The special section runs and if the process is pre-empted for any reason at all then the restart handler is called. Then the final commit is a single instruction. If the scheduler timeslice is big enough this might let multithreaded code be written without locking. Of course the process will starve if interrupts happen so often the special section can never complete.

Restartable sequences

Posted Jul 13, 2015 7:43 UTC (Mon) by HIGHGuY (subscriber, #62277) [Link]

I've always considered that (concurrency-wise) thread-local storage is to userspace threads what per-cpu data is to kernel threads.

Under the assumption that the scheduler always tries hard to keep threads on the same CPU as long as possible, the effect is almost the same. In the best case (N threads <= N CPUs) per-thread ~ per-cpu, in the worst case there can be some bouncing.
But on the other hand side, in this last case you still need locks, atomics or barriers for the per-cpu data case as well, since multiple threads will share the per-cpu data.

On many occasions I believed that per-cpu data in userspace would be great, but every time I concluded that I could reach the same goal with TLS. But, then they're writing an allocator and storing data per-thread is a waste of memory...

Would a vDSO approach with seqlock + CPUID work?
The kernel could increase a sequence number and update current CPUID for every migration (and/or context switch!). Userspace could use the sequence number like a seqlock.

Restartable sequences

Posted Jul 15, 2015 0:50 UTC (Wed) by riking (subscriber, #95706) [Link] (3 responses)

Maybe we could design backwards from an exposure of this in the stdlib...

What if it was set up like this?

<pre>
// normal code
restartable_set_restore(&list_head);
// restartable section
new_item->next = list_head;
// critical instruction
restartable_critical_move_ptr(&list_head, new_item);
// normal code
</pre>

We could say that "the code between set_restore and a critical_foo() will be run in its entirety one or more times". If any other tasks performed a critical set on the watched memory address during that time, then the move is not performed and we jump to the restart.

The critical move would be ran in a syscall that takes two pointers, and would jump to the restart point if the "fuse" was "broken", or perform the operation and "break" all other "fuses" on that address.

That would be like Redis's MULTI/WATCH/EXEC operations... You would need to get the compiler to cooperate, though.

Restartable sequences

Posted Jul 15, 2015 10:20 UTC (Wed) by dgm (subscriber, #49227) [Link] (2 responses)

I think the whole idea of the approach (using a memory region) is to avoid making syscalls, basically for performance. This way the scheduler can just compare the Instruction Pointer to detect if a restart is needed. Note that the critical section should run much more often than the scheduler for this to be a net win.

Maybe another (naive?) approach would be to share a memory pointer with the kernel where the thread can publish a restart address. Say something like

set_atomic_restart(&ptr);

with ptr some memory owned by the thread big enough for a pointer. With that, an application can do

ptr=&&some_label;

when entering a critical section, and ptr=NULL; when leaving. If the scheduler needs to pre-empt the thread and finds that ptr has a non-null value, execution would restart at ptr instead of the saved IP.

Restartable sequences

Posted Jul 15, 2015 18:42 UTC (Wed) by nix (subscriber, #2304) [Link] (1 responses)

I thought the whole point of using futexes for locks in NPTL was so that no syscalls were necessary for them in the uncontended case, either, in which case this whole thing is no faster than the locky approach (but is a lot more complex).

Restartable sequences

Posted Jul 15, 2015 19:54 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Futexes have to use atomic operations for their fastpaths and they are quite slow. You can't avoid them because your thread might be migrated to another CPU, so non-atomic memory access will likely ruin everything.

Restartable sequences

Posted Jul 17, 2015 5:42 UTC (Fri) by alkbyby (subscriber, #61687) [Link]

What people think about generalizing this like a bit ?

Process can register a signal (and glibc could automatically register one and "steal" it from userspace similarly to what they're already doing for some pthread stuff).

So kernel would raise this signal to a thread before returning to user-space every time either:

a) this thread was previously running on different CPU

b) this thread gets CPU _back_ after some other thread run on this CPU

Then libc could implement restartable_seq by simply looking at PC in siginfo's ucontext. Quick "current cpu number" cache and perhaps a range of other things could be implemented just as easily with this signal.

This signal does not need to be queue-able. And could be block-able as normal. Seems much cleaner and more generic thing to me.

I know, people in general have bad attitude towards signals, but in this case it looks like better option that glibc can abstract further. Of course unless I'm missing something crucial.

Restartable sequences

Posted Jul 20, 2015 20:49 UTC (Mon) by toyotabedzrock (guest, #88005) [Link]

Sounds like there needs to be work done within gcc to make this workable.