1½ Topics: realtime throttling and user-space adaptive spinning
Realtime throttling
Fernandes began with a quick overview of the scheduling classes supported by the kernel. The class known as the completely fair scheduler (CFS) is the normal, default scheduler; it treats all tasks equally and shares the CPU between them. The realtime scheduler implements two scheduling classes (FIFO and round-robin) at a higher priority; at any given time, the highest-priority realtime task will be given access to the CPU. Or, at least, that is true if there are no tasks running in the deadline class. These tasks do not have priorities; instead, they tell the system how much CPU time they need and how soon they must get it. The deadline scheduler will then select tasks in a way that ensures that they all meet their deadlines.
The Chrome OS system is, Fernandes said, "scheduler-heavy". The browser
that forms the user interface for the system runs as a whole set of
cooperating processes, each of which consists of multiple threads. The
browser process handles overall user interaction, render processes fill
pages with content, and the "viz process" handles the graphics. Numerous
threads are involved when something happens. Fernandes traced a "mouse
down" event through the browser, outlining about ten thread transitions —
and that was if the event did not result in any display changes. If
something delays any one of those threads from running, the result will be
increased input latency.
Looking at the system, Fernandes found that the interrupt-handling thread that first receives the event runs at realtime priority; everything else was in the CFS scheduling class with a mix of priorities. He wondered if simply moving everything to the realtime class would help the situation; an experiment showed a 32% decrease in mouse-event latency, which would seem to indicate that the change is worthwhile. The only problem is that some of the threads involved can, for example, run JavaScript programs and stay in the CPU for a long time. If the user loads a page containing a cryptocurrency-mining program, they are likely to be unhappy if it runs at realtime priority; nothing else will be able to run at all.
The scheduler developers tried to address this kind of problem many years ago through a mechanism known as realtime throttling, which limits realtime tasks to 95% of the available CPU time. Once that limit has been hit, the realtime task(s) will be set aside for a little while to allow CFS tasks to run, hopefully giving a system administrator a chance to deal with a runaway realtime system. The problem with realtime throttling, Fernandes said, is that it is "horrible and broken". Specifically, that last 5% of the available CPU time will never be given to realtime tasks, even if the alternative is for the CPU to go idle. That wastes CPU time at a time when a realtime task would like to be running.
Within the kernel, the throttling mechanism works by simply removing the realtime run queues from the scheduling hierarchy, making them invisible to the task-selection code, until a suitable amount of time passes. So the way to improve the situation seemed clear enough: simply keep the realtime run queue in place, but skip over it if the system is meant to be throttling — unless the CPU would go idle. An implementation was put together, but it did not survive its encounter with the scheduling community at the recently concluded OSPM Summit.
As Fernandes put it, the scheduler developers are uninterested in fixes to the realtime-throttling code. They would rather do away with it completely, but they do not want to lose the ability to recover a runaway system. The alternative approach that emerged from the OSPM discussion was to implement throttling by temporarily boosting the CFS scheduler into the deadline class, putting it above realtime tasks in the hierarchy. This idea, Fernandes said, "might work", and it would be somewhat less complex.
The specific changes required are the result of, in large part, contributions from Peter Zijlstra and Juri Lelli. The plan is to create a fake deadline task (called fair_server) that would be given 5% of the available CPU time. The first CFS task to wake on a given CPU will (within the kernel) start the deadline server task, while the last task to sleep will stop it. This task will run CFS tasks as usual, but within the deadline class, which will automatically limit the running time to 5% of the available time.
There are some details to deal with, such as ensuring that the fair_server task runs at the end of the throttling period rather than the beginning. But it appears that there is a solution for realtime throttling in sight; stay tuned.
Adaptive spinning in user space
Mutexes and similar mutual-exclusion primitives can be found in both the kernel and in user space. The kernel has a distinct advantage, though, in that it can employ adaptive spinning in its locks. If a mutex is found to be unavailable when needed, the losing thread can simply block until the mutex is freed again. However, better performance will usually be had if, before blocking, the thread simply spins for a while waiting for the lock to be freed. Often, that will happen quickly, the lock can change hands immediately, and a fair amount of overhead is avoided.
This technique only works, though, if the thread holding the lock is known
to be executing on another CPU. If that thread is blocked waiting for
something else, the wait for the lock to become free could go on for a long
time. Even worse, if that thread is waiting for the CPU that is spinning
on the lock, that spinning will actively prevent the lock from being freed.
So kernel code will only spin in this way after verifying that the lock
holder is actively running somewhere else.
User space is unable to make that determination, though, so it is unable to safely spin on a contended lock. Almeida would like to improve that situation. Getting there, he said, requires making two separate pieces of information available to the thread that is trying to acquire a futex (the user-space mutex implementation in Linux): which thread currently holds the futex, and whether that thread is currently running somewhere in the system.
The current owner of the futex is actually a difficult thing for the kernel to provide: it simply does not know the answer. The futex code has been carefully crafted to avoid involving the kernel at all when the lock is uncontended. To get around this, Almeida suggests creating a convention where the ID of the owning thread is stored in the lock when it is acquired. Any other thread wanting to acquire the lock could immediately see which thread currently owns it.
There is also currently no way to ask the kernel whether a given thread is running or not. The information available in /proc only provides process-level information. Almeida asked whether it would be possible to add a system call to query the running status of a thread, but Steve Rostedt pointed out that a system call would defeat the purpose. The cost of entering the kernel to get that information would far outweigh the gain from spinning on the futex. Any solution to this problem has to work purely in user space, he said.
He continued by saying that there might just be a solution. The recently merged user trace events mechanism allows the kernel to inform user space when a specific tracepoint has been enabled by setting a bit in shared memory. Something similar could perhaps be set up to indicate which threads are running at any given time. An alternative, suggested by your editor (who had just conceived the idea and recognizes that it probably lacks merit) might be to hook into the restartable sequences virtual CPU ID feature, which is designed to efficiently communicate information about which CPU each thread is running on at any given time.
Almeida left the session with some thoughts on how to further pursue this
idea. He will have his work cut out for him, though; the desire to
implement adaptive spinning futexes has come and gone before. Darren Hart
made an attempt at an implementation in
2010, and Waiman Long tried in 2016, but
neither approach was merged. Maybe the third spin will be the charm for
this challenge.
Index entries for this article | |
---|---|
Kernel | Scheduler/Realtime |
Kernel | Spinlocks/User-space |
Conference | Open Source Summit North America/2023 |
Posted May 13, 2023 18:06 UTC (Sat)
by quotemstr (subscriber, #45331)
[Link] (4 responses)
Maybe it's the right time to try that approach.
Posted May 14, 2023 5:39 UTC (Sun)
by xecycle (subscriber, #140261)
[Link] (2 responses)
Posted May 15, 2023 2:48 UTC (Mon)
by nevets (subscriber, #11875)
[Link] (1 responses)
Posted May 15, 2023 15:38 UTC (Mon)
by atnot (subscriber, #124910)
[Link]
Posted May 17, 2023 15:32 UTC (Wed)
by compudj (subscriber, #43335)
[Link]
Posted May 23, 2023 16:56 UTC (Tue)
by tbird20d (subscriber, #1901)
[Link]
However, it appears that the the article is describing a soft realtime system (a browser?), and they're trying to eke performance out of it. That seems contradictory to the goals of realtime. That is, usually there's tradeoff between realtime and throughput, where you only get to prioritize one or the other. Am I missing something? (I'm sure I am.)
Posted Jun 1, 2023 11:07 UTC (Thu)
by zse (guest, #120483)
[Link] (1 responses)
From my understanding, simply always spinning for a short duration before making the waiting syscall should be cheap enough (assuming a good short duration is chosen) when using appropriate pause instructions. Adaptive spinning would also have to have a spin limit (as there is still no guarantee that the waited-on thread continues to run after the check or is otherwise giving up the lock anytime soon), so only on very specific situations this would possibly give a tiny advantage, while also introducing significant book-keeping overhead. Just my 2 cents.
Posted Oct 30, 2024 10:48 UTC (Wed)
by gaoqiang (guest, #173011)
[Link]
1½ Topics: realtime throttling and user-space adaptive spinning
1½ Topics: realtime throttling and user-space adaptive spinning
1½ Topics: realtime throttling and user-space adaptive spinning
1½ Topics: realtime throttling and user-space adaptive spinning
1½ Topics: realtime throttling and user-space adaptive spinning
1½ Topics: realtime throttling and user-space adaptive spinning
1½ Topics: realtime throttling and user-space adaptive spinning
1½ Topics: realtime throttling and user-space adaptive spinning