The potential use cases of RSEQs in user space are:
- Efficiently retrieving the CPU on which the thread is currently running on, e.g., for indexing in per-CPU data structures
- Modifying per-CPU data structures which are protected by spinlocks
The following paragraphs focus on the latter use case (which implicitly relies on the first use case).
-
Access to shared data must be synchronized to avoid race conditions in case the data is modified concurrently by multiple threads
-
This problem can be solved by protecting the critical section (CS) (where the shared data is modified) via a synchronization primitive, e.g., using a mutex lock or a semaphore
-
Introducing synchronization in a highly parallel application can however result in high contention (= many threads block and try to acquire the lock which deteriorates performance)
-
A popular approach of reducing contention is the use of per-CPU data structures
-
Example (of a per-CPU data structure): Ring buffer (rb) implementation
- Supported operations (of the data structure):
- Offering:
- Inserts a new item
- Requires the producer to …
- (1.) read the current write position (a.k.a., the head),
- (2.) write the new item and
- (3.) update the head. Updating the head effectively commits the change, making the item visible to the consumer.
- Polling:
- Reads an item
- Requires the consumer to …
- (1.) read the current read position (a.k.a., the tail),
- (2.) read the item and
- (3.) update the read position.
- Offering:
- Supported operations (of the data structure):
-
The rb is "inherently" (as each user space thread running on a CPU has its own data structure) thread safe with respect to parallel access
-
HOWEVER, it's not "inherently" (as long as no synchronization primitive is used) thread safe with respect to other threads running on the same CPU
- Hence, the following sequence of events could occur:
- (1.) Producer A finished writing a new item but gets preempted by the OS before it could commit
- (2.) Producer B, running on the same CPU, reads the old head, starts overwriting Producer A's uncommitted item and commits its own
- (3.) Producer A continues and tries to commit its now overwritten item
- → Synchronization is thus needed to enforce atomicity with respect to preemption
- Hence, the following sequence of events could occur:
-
There are different "approaches" for mitigating this synchronization issue:
-
Commiting items using e.g., an atomic instruction like CAS:
bool CAS(object: pointer, expected: int, desired: int): if *object == expected: *object = desired return true return false
- This approach has a few downsides:
- Performance penalty
- ABA problem
- This approach has a few downsides:
-
Disabling preemption altogether while manipulating per-CPU data structures- This ability is limited to kernel space
-
RSEQs (developed by Paul Turner and Andrew Hunter at Google and Mathieu Desnoyers at EfficiOS)
- Idea: Detect preemption (with the help of the OS scheduler) and if necessary, restart the preempted operation
- RSEQ = the implementation of aforesaid concept in the Linux kernel
- This mechanism has been part of the Linux kernel since version 4.18
-
- The relevant data structure definitions (for user space) can be found in the header file:
linux/rseq.h
, provided bylinux-libc-dev
(may contain outdated definitions depending on libc version)rseq/rseq.h
, provided by librseq
-
Definition in kernel source tree (as of 6.3):
include/uapi/linux/rseq.h
-
Serves as kernel- ↔ user space interface which is used to manage RSEQs in each thread individually
-
"Lifecycle":
-
Setup:
It's the responsibility of each user space thread (which wants to use RSEQs) to:(a) allocate thestruct
as a global TLS variable(b) perform the thread registration using the RSEQ syscallThis is handled automatically since glibc 2.35 (can be disabled using theglibc.pthread.rseq
tunable)
→ The scheduler will update the member fields once the registration has been complete
-
Usage:
- A user space thread can …
- obtain information (e.g., on which CPU it's running) by reading the
struct
- register / unregister a RSEQ critical section (CS) by writing to the
struct
- obtain information (e.g., on which CPU it's running) by reading the
- THINGS TO KEEP IN MIND when accessing the
struct
(after successful registration):- Should be done via a macro like:
define RSEQ_ACCESS_ONCE(x) (*(__volatile__ __typeof__(x) *)&(x))
. This is necessary as a compiler might optimize out memory reads pertainingstruct rseq
member fields (thestruct
is updated externally by the kernel. This happens unbeknownst to the compiler, which might assume that thestruct
never changes, as it may never be written to by the user space thread.) - NOTE: Program running with a glibc version ≥ 2.35 must take an additional step of indirection when accessing
struct rseq
- Glibc maintains the registered
struct rseq
within the Thread Control Block (TCB)- Accessing it requires adding an offset, exported as symbol
__rseq_offset
, to the thread pointer - This pointer can be obtained using the gcc builtin
__builtin_thread_pointer
- Accessing it requires adding an offset, exported as symbol
- Glibc maintains the registered
- Should be done via a macro like:
- A user space thread can …
-
-
"Layout":
struct rseq { // Taken from Linux 6.3 source tree __u32 cpu_id_start; __u32 cpu_id; __u64 rseq_cs; __u32 flags; __u32 node_id; __u32 mm_cid; char end[]; } __attribute__((aligned(4 * sizeof(__u64))));
cpu_id_start
/cpu_id
:- Both hold current CPU number on which the registered thread is running on
- Range:
0 ≤
cpu_id
< # of CPUs
- Range:
- They essentially only differ with their init values:
cpu_id_start
always holds a valid CPU number (even whenstruct rseq
hasn't been registered yet)cpu_id
is initialized with-1
- Use case:
- Get CPU number on which the user space thread is running on (faster than
sched_getcpu
(3), even with vDSO) - Index (using the obtained CPU number) in per-CPU data structures
- Get CPU number on which the user space thread is running on (faster than
- Both hold current CPU number on which the registered thread is running on
node_id
:- Available on Linux 6.3+
- Initialized with
0
- Holds (once inited) the current NUMA ID
- Use case: Same as
cpu_id
, but on the NUMA domain level
mm_cid
:- Available on Linux 6.3+
- Abbreviation for memory map concurrency id (cid)
- Holds unique, temporarily assigned value, which is allotted by the scheduler to each actively running thread within a process
- Range:
0 ≤
mm_cid
< # of actively running threads ≤ # of CPUs
- Range:
- Use case: Indexing in per-CPU data structures (
cpu_id
alternative)- Advantage (compared to
cpu_id
): The cid is closer to0
(v.s.,cpu_id
), allowing for a more efficient (re-)use of memory when indexing in per-CPU data structures (e.g., in memory allocators)- The main beneficiaries of this new field are processes which have …
- fewer threads than cores or which have
- been restricted to run on fewer cores through scheduler affinity or cgroup cpusets
- The main beneficiaries of this new field are processes which have …
- Advantage (compared to
rseq_cs
: Pointer to data structure which describes the CS (a.k.a., the CS descriptor)(deprecated as of Linux 6.1)flags
- Implementation in kernel source tree (as of 6.3):
kernel/rseq.c
- Registers / unregisters
struct rseq
for invoking thread - Args of syscall:
SYSCALL_DEFINE4(rseq, // Taken from Linux 6.3 source tree struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig)
-
rseq
: Address of allocatedstruct rseq
-
rseq_len
: Length of structure- Tells kernel which
struct rseq
fields need to be updated (this ensures backward compatibility, as the user space program might still use an older definition of thestruct
, which doesn't include fields likemm_cid
)
- Tells kernel which
-
flags
:0
= perform the registration /RSEQ_FLAG_UNREGISTER
= unregister already registered struct -
sig
: Used for thwarting binary exploitation attacks -
NOTE: As of glibc 2.35, there's no glibc syscall wrapper (there shouldn't be a need anyways, as registration is handled by glibc)
- Can be invoked manually using the …
syscall
function or …- inline asm
- Can be invoked manually using the …
-
- Implementation in kernel source tree (as of 6.3):
fs/binfmt_elf.c
- Allows the user space program to detect which feature fields (in
struct rseq
) are supported by the kernel (e.g., kernels ≤ 6.3 don't support themm_cid
)#include <sys/auxv.h> #define rseq_sizeof_field(TYPE, MEMBER) sizeof((((TYPE*)0)->MEMBER)) #define rseq_offsetofend(TYPE, MEMBER) (offsetof(TYPE, MEMBER) + rseq_sizeof_field(TYPE, MEMBER)) const unsigned long auxv_rseq_feature_size = getauxval(AT_RSEQ_FEATURE_SIZE); const bool mm_cid_is_supported = (int)auxv_rseq_feature_size >= (int)rseq_offsetofend(struct rseq, mm_cid);
-
A.k.a., the CS descriptor
-
Definition in kernel source tree (as of 6.3):
include/uapi/linux/rseq.h
-
Difference: Critical section v.s., CS descriptor:
-
Critical section:
-
The CS itself contains the operations which shall be carried out to modify the per-CPU data structures
-
This CS is guarded against interruptions by RSEQ mechanism
- An active CS may either be interrupted by
- migration to another CPU,
- unmasked / non-maskable signals or
- preemption
- An active CS may either be interrupted by
-
Limitations (of CS):
-
Consist of two stages:
- Preparatory stage:
- The to be committed changes are prepared
- Must be carried out in a discrete manner, such that they're invisible to other threads (until they're finally commited)
- Commit stage:
- Publishes made changes
- Must consist of a single atomic instruction
- Preparatory stage:
-
Must be written in asm.
The generated machine instructions must faithfully follow the program order (as defined in the source). This is an issue in high-level languages, as reorders of stores might occur by the compiler (for optimization purposes). Such optimizations can change the program order by e.g., preponing the store associated with the commit phase into the preparatory phase.
-
Should never invoke syscalls
-
No function calls
Doing so would move the Instruction Pointer (IP) outside the CS, making the detection (whether a CS was active) impossible.
-
Stack operations should be limited to reserved stack space (e.g., local variables defined in the C code preceding the inline assembly block).
Pushing elements also requires an attendant undo operation upon exit. However, undoing the operation with
pop
could result in stack memory corruption if the CS was interrupted prior to thepush
operation. -
CS must be kept brief.
A too long CS will be perpetually restarted as the time slice (allotted by the scheduler) is always expiring before the CS can finish.
-
CS cannot be debugged.
Setting a breakpoint within the CS would later interrupt it, causing it to be restarted indefinitely. RSEQ CSs should therefore be declared in the dedicated ELF sections
__rseq_cs_ptr_array
and__rseq_exit_point_array
. Debuggers likegdb
can simply read these sections and skip all CS when single stepping.
-
-
-
CS descriptor:
struct
describing the critical section- This includes e.g., where the CS starts and ends (see Ex. down below)
-
Example: Ring buffer (in C-like pseudocode for better intelligibility):
- Critical section (this includes only the pseudocode after
start:
): - Attendant descriptor of CS:
struct rseq_cs descriptor = { .version = 0, .flags = 0, .start_ip = &&start, // Points to first machine instruction in CS .post_commit_offset = (uintptr_t)&&post_commit - (uintptr_t)&&start, // Length of CS in bytes .abort_ip = &&abort };
- Critical section (this includes only the pseudocode after
-
-
Lifecycle / USAGE:
- To 'start' a RSEQ,
struct rseq
'srseq_cs
is set to the CS descriptor - Scheduler check:
- Note that the diagram has been simplified (the
RSEQ_SIG
security check e.g., has been omitted) - When scheduling a thread, the Linux scheduler will check whether
- (1.) a CS descriptor has been set in the
rseq_cs
field ofstruct rseq
. If this is the case, the kernel will check whether - (2.) the saved IP address is falling in the range
start_ip
≤ IP address <start_ip
+post_commit_offset
- (1.) a CS descriptor has been set in the
- The RSEQ CS must be restarted if this condition also holds true.
This is automatically handled by the kernel by setting the IP to
abort_ip
, which is the address of the first instruction in the abort handler.
- Note that the diagram has been simplified (the
- To 'start' a RSEQ,
-
As already mentioned, CSs are typically implemented using inline assembly.
-
Note that the C language doesn't have a standardized syntax for including assembler in C source files. Its inclusion in the compiler is considered an extension to the C language.
-
The gcc extended asm syntax (which will be used in this example) is best suitable for mixing C and assembly (as it supports input- and output operands in the form of C variables and jumps to C labels):
asm asm-qualifiers ( AssemblerTemplate : OutputOperands : InputOperands : Clobbers : GotoLabels)
-
Relevant
asm-qualifiers
for RSEQ CSs arevolatile
andgoto
:volatile
is required for asm statements which don't use output operands and instead produce the desired result by performing side effects. E.g., the CS of theoffer
ing operation takes a reference to the rb as input operand. It then writes the new item into its buffer and updates the head. Hence, the memory referenced by the input operand is manipulated to produce the desired result. The gcc optimizer may discard the asm statement, which is prevented by this keyword. It also prevents reordering of statements by the compiler.goto
allows the asm statement to perform a jump to any C label in theGotoLabels
list. The CSoffer
ing operation may use such a jump to block in case the rb is full and return an appropriate error code to the caller.
-
AssemblerTemplate
contains the actual assembly instructions and assembler directives as a string literal. It's a template which may contain tokens. These tokens refer to e.g., operands and goto labels and need to be replaced by the compiler. Once replaced, it's passed to the assembler, which produces the machine code. Gcc supports both Intel- and AT&T x86 assembler dialects with the latter being the default.- Writing asm:
-
Assembler directives are prefixed with a dot (
.
, e.g.,.popsection
) -
Local labels:
- Declaration:
<int>:
, e.g.,1:
- Referencing it …
- after its 'declaration line' requires
b
(“backwards”) as suffix, e.g.,1b
- before its 'declaration line' requires
f
(“forwards”) as suffix, e.g.,1f
- after its 'declaration line' requires
- Declaration:
-
The AT&T syntax has these relevant traits:
- Immediate operands are prefixed with
$
, whereas registers are prefixed with%
- Instruction mnemonics follow the order
source, destination
. This only pertains to mnemonics with two operands. - Instruction mnemonics are typically suffixed with a character indicating the size of the operands. Common suffixes are
b
for byte (8 bit),w
for word (16 bit),l
for long (32 bit) andq
for quadruple word (64 bit).
- Immediate operands are prefixed with
-
- Writing asm:
-
InputOperands
are passed as follows:[ [asmSymbolicName] ] constraint (cexpression)
- This allows passing a
cexpression
(which may be a C variable or expression) to theAssemblerTemplate
, which then can be referenced via the symbolic nameasmSymbolicName
- Multiple operands are separated using a comma
constraint
specifies where the parameter should be placed by gcc. Common constraints arem
for memory,r
for a general-purpose register andi
for immediate integer operands whose value is known during assembly time.
- This allows passing a
-
Clobbers
lists all locations, such as used scratch registers, which are modified by the assembly. This causes the compiler to exempt the listed locations when e.g., choosing registers for theInputOperands
. Theflags
register is listed using the special clobbercc
. In case memory is read and written by the assembly, the special clobbermemory
must be used (which effectively forms a memory barrier for the compiler). -
GotoLabels
lists all C labels to which the assembly might jump to (this requires the previously mentionedgoto
qualifier).
-
- Available on GitHub
- Library which makes it easier to integrate RSEQs into applications by providing:
-
Functions like …
rseq_register_current_thread
,rseq_unregister_current_thread
,rseq_clear_rseq_cs
,rseq_prepare_unload
for handling the RSEQ lifecyclerseq_available
,rseq_mm_cid_available
,rseq_node_id_available
, etc., for checking whichstruct rseq
fields are supportedrseq_current_cpu_raw
,rseq_cpu_start
,rseq_current_mm_cid
,rseq_current_node_id
for readingstruct rseq
fields
-
Prewritten CSs which are supported on many ISAs (thus eliminating portability issues)
- For instance,
rseq_cmpeqv_trymemcpy_storev(intptr_t * v, intptr_t expect, void * dst, void * src, size_t len, intptr_t newv, int cpu)
may be used to implement a rb, where e.g., the producer would pass- a pointer to the
head
asv
, - the previously read value of
head
asexpect
, - the next index in the buffer as
dst
, - a pointer to the item pointer as
src
, - the size of the pointer as
len
and - the next
head
value asnewv
.
- a pointer to the
- For instance,
-
Macros
RSEQ_ASM_*
for writing own CSs (which eliminate boilerplate code):RSEQ_ASM_DEFINE_TABLE(<cs_label>, <start_ip>f, <post_commit_ip>f, <abort_ip>f)
:// Expands to asm directives which emit the CS descriptor (`struct rseq_cs`) // for the ensuing CS + debugging information: ".pushsection __rseq_cs, \"aw\"\n\t" ".balign 32\n\t" "<cs_label>:\n\t" // Local label for referencing this CS descriptor ".long 0x0, 0x0\n\t" // `version`, `flags` ".quad <start_ip>f, (<post_commit_ip>f - <start_ip>f), <abort_ip>f\n\t" // `start_ip`, `post_commit_ip`, `abort_ip` ".popsection\n\t" ".pushsection __rseq_cs_ptr_array, \"aw\"\n\t" // Debugging information ".quad 3b\n\t" ".popsection\n\t"
RSEQ_ASM_DEFINE_EXIT_POINT(<start_ip>f, %l[<c_label_exit_point>])
:// (Optional) Expands to asm directives which emit debugging information // (may be used by e.g., `gdb`) of RSEQ CS exit points in an ELF section: ".pushsection __rseq_exit_point_array, \"aw\"\n\t" ".quad <start_ip>f, %l[<c_label_exit_point>]\n\t" ".popsection\n\t"
RSEQ_ASM_STORE_RSEQ_CS(<start_ip>, <cs_label>b, <struct_rseq_cs_ptr>)
:// Expands to ASM which 'registers' the CS by setting `rseq_cs` // in `struct rseq` to point to the defined CS descriptor: "leaq <cs_label>b(%%rip), %%rax\n\t" // (Uses RIP-relative addressing due to ASLR) "movq %%rax, <struct_rseq_cs_ptr>\n\t" "<start_ip>:\n\t"
RSEQ_ASM_CMP_CPU_ID(<cpu_input_operand>, <struct_rseq_hw_thread>, <abort_ip>f)
:// Expands to asm which checks and aborts when the current `cpu_id` / `mm_cid` // doesn't match `cpu` (only necessary when indexing into the per-CPU data // structure OUTSIDE of the CS): "cmpl %[<cpu_input_operand>], <struct_rseq_hw_thread>\n\t" "jnz <abort_ip>f\n\t"
RSEQ_ASM_DEFINE_ABORT(<abort_ip>, <teardown>, <c_label_abort>)
:// Expands to asm + asm directives for emitting the abort handler // "signature" + instructions into an eXecutable ELF section ".pushsection __rseq_failure, \"ax\"\n\t" ".byte 0x0f, 0xb9, 0x3d\n\t" // (Documented undefined instruction (UD1) for trapping speculative execution) ".long 0x53053053\n\t" // `RSEQ_SIG` (used to thwart binary exploitation attacks) "<abort_ip>:\n\t" // Local label required for defining `abort_ip` in CS descriptor teardown // Additional optional asm for teardown "jmp %l[<c_label_abort>]\n\t" // `c_label_abort` = C label to jump to ".popsection\n\t"
-
The rb is defined as a global data structure which is allocated during program startup:
struct rb* rb_baseptr; // Global var int main(void) { rb_baseptr = malloc( get_ncpus() * sizeof(*rb_baseptr) ); // Alloc global structure during startup // … }
-
Operations:
-
NOTE: This implementation uses a power-of-2 size
-
rb_poll
:- No need for a RSEQ CS, as there's only one consumer (Single-Producer implementation)
-
rb_offer
:- Has to be guarded via a RSEQ CS, as there are multiple producers (Multi-Producer implementation)
- Pseudocode for better intelligibility:
int rb_offer(void* item) { // Arg `item_ptr` = Item to be added // - Index into per-CPU data structure const unsigned int cpu = rseq.mm_cid; // Read current HW thread from `struct rseq` struct rb* rb_ptr = (rb_baseptr + sizeof(*rb_baseptr) * cpu); // Get rb for the CPU on which this thread is currently executing on // - Register CS by setting the CS descriptor in `struct rseq` rseq.rseq_cs = &descriptor; // - BEGIN CS - start: // Begin of CS // Check whether the current 'HW thread' still matches the previously used `cpu` (which was used for indexing) if (rseq.mm_cid != cpu) goto abort; // - Prepare // Check whether there's ample space available if ((rb.tail & (RB_CAPACITY_BYTES -1)) == (rb.head & (RB_CAPACITY_BYTES -1)) && (rb.head > rb.tail)) { return -1; // `block` } // Copy item into rb const int idx = (rb_ptr->head & (RB_CAP_BYTES -1)) / sizeof(rb_ptr->buf[0]); rb_ptr->buf[idx] = item; // - Commit (by writing new head, which makes copied item visible to consumers) rb_ptr->head += sizeof(rb_ptr->buf[0]); post_commit: // End of CS // - END CS - return 0; // - ABORT HANDLER - abort: return 1; }
- The actual implementation (which leverages librseq) can be found here
-
-
Modern memory allocators (e.g., Google's tcmalloc (Thread-Caching Malloc)) (used to) use tcaches for optimizing performance on multi-core systems
-
PRO: Reduced lock contention.
Each thread can obtain 'cached' memory for allocations from an own thread-local heap (instead of a single global heap). The threads thus don't have to contend for the global lock when allocating / deallocating memory.
-
CON: Memory may be "tied up" in tcaches.
- Consequence: Inefficient use of memory (a.k.a., heap blowup)
- This issue mostly arises with programs that use a high number of threads which exceeds the number of CPUs available on the system. This applies e.g., to server SW, where performance is hampered most by I/O. Some threads will therefore run concurrently, i.e., interleaved, rather than in parallel. A subset of the threads will be idling in the ready- or waiting state as a consequence. The memory cached in the tcache of an idling thread therefore isn't being used. Also, running threads which need to allocate memory cannot use the already available free memory as it's tied up in the tcaches of idling threads. This forces running threads to request additional memory from the OS, thereby needlessly increasing the memory footprint of the program.
- One remedy to this issue are so called CPU-level caches (ccaches)
-
Heap memory caches are maintained on the CPU-, rather than the thread level
- The heaps (for each thread) are now allocated as part of a global per-CPU data structure which is modified using RSEQ CSs
-
PRO: Reduced memory footprint as threads can now always allocate memory from the ccache corresponding to the CPU on which they're currently running on.
-
A thread can access its corresponding heap by indexing via
cpu_id
ormm_cid
into the global data structure-
cpu_id
has pathological usage patterns though which can also lead to heap blowup:When indexing with
cpu_id
, the single thread will fill and deplete the ccache of every CPU it's scheduled on. The thread might run on a CPU whose ccache is currently empty when it tries to allocate memory. A ccache of a different CPU however might be full. Memory is therefore available, but cannot be accessed by the thread. This forces the thread to request additional memory from the OS, resulting in heap blowup (again). Indexing withmm_cid
avoids this issue.
-
- Both rpmalloc and jemalloc have been adapted to support ccaches (indexable with
cpu_id
andmm_cid
) - Rpmalloc e.g., supports both a Level 1 cache (which can be either a tcache or a ccache) and an optional Level 2 cache (global)