1½ Topics: realtime throttling and user-space adaptive spinning

By Jonathan Corbet
May 13, 2023

The Linux CPU scheduler will let realtime tasks hog the CPU to the exclusion of everything else — except when it doesn't. At the 2023 Open Source Summit North America, Joel Fernandes covered the problems with the kernel's realtime throttling mechanism and a couple of potential solutions. As a bonus, since the room was unscheduled for the following slot, attendees were treated to a spontaneous session on adaptive spinning in user space run by André Almeida.

Realtime throttling

Fernandes began with a quick overview of the scheduling classes supported by the kernel. The class known as the completely fair scheduler (CFS) is the normal, default scheduler; it treats all tasks equally and shares the CPU between them. The realtime scheduler implements two scheduling classes (FIFO and round-robin) at a higher priority; at any given time, the highest-priority realtime task will be given access to the CPU. Or, at least, that is true if there are no tasks running in the deadline class. These tasks do not have priorities; instead, they tell the system how much CPU time they need and how soon they must get it. The deadline scheduler will then select tasks in a way that ensures that they all meet their deadlines.

The Chrome OS system is, Fernandes said, "scheduler-heavy". The browser that forms the user interface for the system runs as a whole set of cooperating processes, each of which consists of multiple threads. The browser process handles overall user interaction, render processes fill pages with content, and the "viz process" handles the graphics. Numerous threads are involved when something happens. Fernandes traced a "mouse down" event through the browser, outlining about ten thread transitions — and that was if the event did not result in any display changes. If something delays any one of those threads from running, the result will be increased input latency.

Looking at the system, Fernandes found that the interrupt-handling thread that first receives the event runs at realtime priority; everything else was in the CFS scheduling class with a mix of priorities. He wondered if simply moving everything to the realtime class would help the situation; an experiment showed a 32% decrease in mouse-event latency, which would seem to indicate that the change is worthwhile. The only problem is that some of the threads involved can, for example, run JavaScript programs and stay in the CPU for a long time. If the user loads a page containing a cryptocurrency-mining program, they are likely to be unhappy if it runs at realtime priority; nothing else will be able to run at all.

The scheduler developers tried to address this kind of problem many years ago through a mechanism known as realtime throttling, which limits realtime tasks to 95% of the available CPU time. Once that limit has been hit, the realtime task(s) will be set aside for a little while to allow CFS tasks to run, hopefully giving a system administrator a chance to deal with a runaway realtime system. The problem with realtime throttling, Fernandes said, is that it is "horrible and broken". Specifically, that last 5% of the available CPU time will never be given to realtime tasks, even if the alternative is for the CPU to go idle. That wastes CPU time at a time when a realtime task would like to be running.

Within the kernel, the throttling mechanism works by simply removing the realtime run queues from the scheduling hierarchy, making them invisible to the task-selection code, until a suitable amount of time passes. So the way to improve the situation seemed clear enough: simply keep the realtime run queue in place, but skip over it if the system is meant to be throttling — unless the CPU would go idle. An implementation was put together, but it did not survive its encounter with the scheduling community at the recently concluded OSPM Summit.

As Fernandes put it, the scheduler developers are uninterested in fixes to the realtime-throttling code. They would rather do away with it completely, but they do not want to lose the ability to recover a runaway system. The alternative approach that emerged from the OSPM discussion was to implement throttling by temporarily boosting the CFS scheduler into the deadline class, putting it above realtime tasks in the hierarchy. This idea, Fernandes said, "might work", and it would be somewhat less complex.

The specific changes required are the result of, in large part, contributions from Peter Zijlstra and Juri Lelli. The plan is to create a fake deadline task (called fair_server) that would be given 5% of the available CPU time. The first CFS task to wake on a given CPU will (within the kernel) start the deadline server task, while the last task to sleep will stop it. This task will run CFS tasks as usual, but within the deadline class, which will automatically limit the running time to 5% of the available time.

There are some details to deal with, such as ensuring that the fair_server task runs at the end of the throttling period rather than the beginning. But it appears that there is a solution for realtime throttling in sight; stay tuned.

Adaptive spinning in user space

Mutexes and similar mutual-exclusion primitives can be found in both the kernel and in user space. The kernel has a distinct advantage, though, in that it can employ adaptive spinning in its locks. If a mutex is found to be unavailable when needed, the losing thread can simply block until the mutex is freed again. However, better performance will usually be had if, before blocking, the thread simply spins for a while waiting for the lock to be freed. Often, that will happen quickly, the lock can change hands immediately, and a fair amount of overhead is avoided.

This technique only works, though, if the thread holding the lock is known to be executing on another CPU. If that thread is blocked waiting for something else, the wait for the lock to become free could go on for a long time. Even worse, if that thread is waiting for the CPU that is spinning on the lock, that spinning will actively prevent the lock from being freed. So kernel code will only spin in this way after verifying that the lock holder is actively running somewhere else.

User space is unable to make that determination, though, so it is unable to safely spin on a contended lock. Almeida would like to improve that situation. Getting there, he said, requires making two separate pieces of information available to the thread that is trying to acquire a futex (the user-space mutex implementation in Linux): which thread currently holds the futex, and whether that thread is currently running somewhere in the system.

The current owner of the futex is actually a difficult thing for the kernel to provide: it simply does not know the answer. The futex code has been carefully crafted to avoid involving the kernel at all when the lock is uncontended. To get around this, Almeida suggests creating a convention where the ID of the owning thread is stored in the lock when it is acquired. Any other thread wanting to acquire the lock could immediately see which thread currently owns it.

There is also currently no way to ask the kernel whether a given thread is running or not. The information available in /proc only provides process-level information. Almeida asked whether it would be possible to add a system call to query the running status of a thread, but Steve Rostedt pointed out that a system call would defeat the purpose. The cost of entering the kernel to get that information would far outweigh the gain from spinning on the futex. Any solution to this problem has to work purely in user space, he said.

He continued by saying that there might just be a solution. The recently merged user trace events mechanism allows the kernel to inform user space when a specific tracepoint has been enabled by setting a bit in shared memory. Something similar could perhaps be set up to indicate which threads are running at any given time. An alternative, suggested by your editor (who had just conceived the idea and recognizes that it probably lacks merit) might be to hook into the restartable sequences virtual CPU ID feature, which is designed to efficiently communicate information about which CPU each thread is running on at any given time.

Almeida left the session with some thoughts on how to further pursue this idea. He will have his work cut out for him, though; the desire to implement adaptive spinning futexes has come and gone before. Darren Hart made an attempt at an implementation in 2010, and Waiman Long tried in 2016, but neither approach was merged. Maybe the third spin will be the charm for this challenge.

Index entries for this article
Kernel	Scheduler/Realtime
Kernel	Spinlocks/User-space
Conference	Open Source Summit North America/2023

1½ Topics: realtime throttling and user-space adaptive spinning

Posted May 13, 2023 18:06 UTC (Sat) by quotemstr (subscriber, #45331) [Link] (4 responses)

I'm glad the adaptive spinning work is proceeding. I still think rseq is a promising approach: I wrote something about it back in 2018: https://lore.kernel.org/lkml/CAKOZuesa338sc_=w6-wvro25idr...

Maybe it's the right time to try that approach.

1½ Topics: realtime throttling and user-space adaptive spinning

Posted May 14, 2023 5:39 UTC (Sun) by xecycle (subscriber, #140261) [Link] (2 responses)

Also interested in this part. Correct me if I'm wrong, is it that what people have proposed can be simply taken as, "hey kernel, please set this flag if you preempt me"? Then I think it may benefit another use case, the flat-combining technique. Maybe, instead of telling another thread to sleep, it could take over work left by the original lock owner. I hope the final solution can accommodate this.

1½ Topics: realtime throttling and user-space adaptive spinning

Posted May 15, 2023 2:48 UTC (Mon) by nevets (subscriber, #11875) [Link] (1 responses)

If this is implemented, then that use case is possible. But I'm not sure of what would use that use case.

1½ Topics: realtime throttling and user-space adaptive spinning

Posted May 15, 2023 15:38 UTC (Mon) by atnot (subscriber, #124910) [Link]

The main place I've seen things of this sort is garbage collection. If you have some shared data structure that allocates, *someone* needs to do the cleanup and usually you'd really prefer if that wasn't you or a thread you're waiting on. This leads to all kinds of complex heuristics and hints. If you can just pass the broom on to whoever wants to touch the data next, that has the potential to decrease tail latencies a lot.

1½ Topics: realtime throttling and user-space adaptive spinning

Posted May 17, 2023 15:32 UTC (Wed) by compudj (subscriber, #43335) [Link]

Here is my attempt at solving this with rseq: https://lore.kernel.org/lkml/20230517152654.7193-1-mathie...

1½ Topics: realtime throttling and user-space adaptive spinning

Posted May 23, 2023 16:56 UTC (Tue) by tbird20d (subscriber, #1901) [Link]

I'm not a realtime expert, but my first impression is that if your realtime tasks are getting 95% of the CPU and you're still missing realtime deadlines, you've overcommitted the CPU and need to re-design your app. That workload is, IMHO, running without sufficient safety margins for a true hard realtime system.

However, it appears that the the article is describing a soft realtime system (a browser?), and they're trying to eke performance out of it. That seems contradictory to the goals of realtime. That is, usually there's tradeoff between realtime and throughput, where you only get to prioritize one or the other. Am I missing something? (I'm sure I am.)

1½ Topics: realtime throttling and user-space adaptive spinning

Posted Jun 1, 2023 11:07 UTC (Thu) by zse (guest, #120483) [Link] (1 responses)

IMHO adaptive spinning does not seem to be worth it.

From my understanding, simply always spinning for a short duration before making the waiting syscall should be cheap enough (assuming a good short duration is chosen) when using appropriate pause instructions. Adaptive spinning would also have to have a spin limit (as there is still no guarantee that the waited-on thread continues to run after the check or is otherwise giving up the lock anytime soon), so only on very specific situations this would possibly give a tiny advantage, while also introducing significant book-keeping overhead. Just my 2 cents.

1½ Topics: realtime throttling and user-space adaptive spinning

Posted Oct 30, 2024 10:48 UTC (Wed) by gaoqiang (guest, #173011) [Link]

I think If the critical section is short enough(maybe ns level), adaptive spinning may be more useful.