Deadline servers as a realtime throttling replacement
The POSIX realtime scheduling classes are conceptually simple; at any given time, the task with the highest priority runs to the exclusion of anything else. In the real world, though, the rule enables a runaway realtime task to take over the system to the point that the only way to recover it may be to pull the plug. Power failures, as it turns out, have an even higher priority than realtime tasks.
Yanking out the power cord is aesthetically displeasing to many, though, and tends to cause realtime deadlines to be missed; in an attempt to avoid it, the kernel developers introduced realtime throttling many years ago. In short, realtime throttling restricts realtime tasks to (by default) 95% of the available CPU time; the remaining 5% is left for lower-priority tasks, with the idea that it is enough for an administrator to kill off a runaway task if need be.
Most of the time, this throttling is not a problem. In a properly designed realtime system, the actual realtime work should be using far less than 95% of the available CPU time anyway, so the throttling will never actually happen. But, in cases where a realtime task does need all of the available CPU time for an extended period, realtime throttling can be a problem. This is especially true because the throttling happens even if there are no lower-priority tasks waiting to run. Rather than run the realtime task that still needs CPU, the scheduler will simply force the system idle in this case. The idle time is an unwanted artifact of how the throttling is implemented rather than a desired feature in its own right.
Various efforts have been made to address this problem over the years; this article describes one approach, where realtime throttling would be disabled if it would cause the system to go idle. The deadline-server idea is a different approach to the problem, based on the deadline scheduling class. This class, which has a higher priority than the POSIX realtime classes, is not priority-based; instead, tasks declare the amount of CPU time they need and the time by which they must receive it, and the deadline scheduler works to ensure that those tasks meet their deadlines.
This class thus seems like a natural way to take back 5% of the CPU from realtime tasks when needed. All that is needed is to create a task in the deadline class (called the "deadline server"), declare that it needs 5% of the CPU, and have that task run lower-priority tasks with the time that it is given. The scheduler will then carve out the necessary CPU time but, if the deadline server doesn't need it, it will simply not be runnable and the realtime tasks can continue to run.
The idea, as implemented in Bristot's patch set (which contains patches from Peter Zijlstra and Juri Lelli), does the job reasonably well, in that it makes space for lower-priority tasks without needlessly causing the CPU to go idle. The fact that the deadline class has a higher priority than the realtime classes makes this idea work, but also brings one little problem: once the deadline server is enabled, it will run immediately, perhaps preempting a realtime task that would have eventually yielded anyway. The lower-priority tasks should get their 5%, but giving it to them immediately may create problems for well-behaved realtime tasks.
The proposed solution here is to delay the enabling of the deadline server. A kernel timer is used to occasionally run a watchdog function that looks at the state of the normal-priority tasks on the system. If it appears that those tasks are being starved — with starvation defined as not getting any CPU time over a half-second — then the deadline server will be started. Otherwise, in the absence of starvation problems, scheduling will run as usual.
With this tweak, the work is moving "in the right direction
",
Bristot said, but there is still room for improvement. The delay of the
startup of the deadline server can be further delayed to the "zero-laxity"
time — the time just before it would miss a 5% deadline entirely. The
starvation monitor could perhaps be moved to CPUs that are not running
realtime tasks to prevent interference there. In general, though, this
work looks like it could be a plausible solution to the realtime-throttling
problem.
Index entries for this article | |
---|---|
Kernel | Releases/6.8 |
Kernel | Scheduler/Realtime |
Posted Jun 12, 2023 22:00 UTC (Mon)
by geofft (subscriber, #59789)
[Link] (4 responses)
Is there in fact any overlap in practice? For instance, could something like a realtime policy be useful for servers running web applications and similar workloads that traditionally haven't been considered realtime, to achieve a similar effect to the CFS quota mechanism? In some sense they're opposites, in that realtime guarantees processes are scheduled at least so much, and quotas guarantee processes are scheduled at most so much. But for most users of the CFS quota mechanism, limiting a process's time isn't their actual goal, it's just a means towards ensuring that all other processes do get their fair share of time. Something else that guarantees each process a minimum time (where the kernel ensures that all the minimums add up to no more than 100% of the system) would also work.
Posted Jun 15, 2023 4:30 UTC (Thu)
by alison (subscriber, #63752)
[Link] (3 responses)
Realtime currently guarantees only that the highest priority processes run until they block, yield to interrupt handlers, exit or somehow else yield the processor. That's exactly the problem throttling is trying to solve. SCHED_FIFO makes no attempt to be "fair." "at least so much" plays no role. If the highest priority thread on a core wants to run forever, it can, and the second highest priority process can wait forever, although in real systems it may well migrate to another core.
Posted Jun 15, 2023 9:21 UTC (Thu)
by farnz (subscriber, #17727)
[Link] (2 responses)
And the problem case that throttling is meant to solve is the one where you have 20 CPU threads available (so 20 running processes simultaneously), and 20 SCHED_FIFO processes all trying to run forever (e..g because there's a bug causing them to busy-loop instead of yielding). If that happens, we want it to be possible for the admin to access the system and take corrective action using their normal tools.
This is especially important since the admin may be accessing the system over a network, and we do not want to run remote access daemons as real-time tasks, since that opens up a DoS vector.
Posted Jun 18, 2023 17:33 UTC (Sun)
by alison (subscriber, #63752)
[Link] (1 responses)
What we really want is for the scheduler to manage the system in a latency-sensitive but efficient way, possibly also considering power usage. If a human ever has to login to fix something, we have hard-failed.
Posted Jun 19, 2023 11:31 UTC (Mon)
by farnz (subscriber, #17727)
[Link]
I don't think that's a useful way to view things; if I, as the admin, accidentally run for(;;) {} as a SCHED_FIFO process at maximum priority, I've told the system that the most important thing to do is to busy-loop, and that busy-looping should take priority over any other work.
This is clearly a bug, and as admin I've clearly made a mistake doing this. But I need some way to recover from this mistake, short of pulling the power out and hoping that I've not accidentally set this up to busy-loop on boot; this is the whole case for throttling, since the scheduler cannot distinguish a busy-loop that stops me doing real work from a hard real time task that really does need 99% of my CPU to meet its deadlines.
Posted Jun 13, 2023 14:39 UTC (Tue)
by zeno_kdab (guest, #165579)
[Link] (5 responses)
Posted Jun 13, 2023 15:14 UTC (Tue)
by droundy (subscriber, #4559)
[Link] (4 responses)
Posted Jun 13, 2023 15:40 UTC (Tue)
by zeno_kdab (guest, #165579)
[Link] (3 responses)
Posted Jun 13, 2023 19:00 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (2 responses)
It's exactly that, and it's a neat implementation simplification to produce the desired outcome.
The real feature we want is to prevent real-time tasks from blocking normal tasks (such as your remote access daemon) from running at all; it should always be possible to SSH into a system and kill off rogue processes to recover control, assuming you run an SSH daemon.
Today's implementation handles that by saying that real-time tasks are only permitted 95% of CPU cycles, and we'll go idle rather than let them reach 100% of CPU cycles. But there's a better policy we've been trying to find a way to implement for some time, where real-time tasks are allowed 100% of CPU cycles, but become lower priority than normal tasks if, and only if, they're hogging the CPU; the "deadline server" is a nice trick for implementing that policy, exploiting the fact that deadline tasks are even higher priority than real-time tasks.
Posted Jun 15, 2023 18:26 UTC (Thu)
by zeno_kdab (guest, #165579)
[Link] (1 responses)
My personal feeling is still kind of that it might be better to just run the ssh server as that deadline task, but I do admit I am an backseat purist architect ;)
Posted Jun 15, 2023 18:54 UTC (Thu)
by zeno_kdab (guest, #165579)
[Link]
Posted Jun 13, 2023 17:41 UTC (Tue)
by abatters (✭ supporter ✭, #6932)
[Link]
vs.
https://lwn.net/Articles/934142/
"Interestingly, the possibility to swap rt.c and deadline.c in the kernel, or even to possibly make it a tunable sysfs option, was discussed for other reasons in other talks throughout OSPM. "
Conflict detected...
Posted Jun 16, 2023 0:48 UTC (Fri)
by Fowl (subscriber, #65667)
[Link] (1 responses)
Keeping the CPU at 95% might be keeping things within a thermal/power envelope - if the CPU is actually at 100% more of the time, things could be scaled up, leading to expenditure of the thermal budget, followed by throttling repeat. These fast/slow cycles could be a major source of unpredictable/variable latency.
Posted Jun 17, 2023 8:55 UTC (Sat)
by anton (subscriber, #25547)
[Link]
Running a CPU 95% of the time (and halting it for 5%) is similar to throttling in its effect on power consumption. When the CPU is at the power or thermal limit, it will generally perform more work if you let it run all the time than if you let it run only 95% of the time. "More work" will not be 1/0.95 times as much, but it will be >1.
Deadline servers as a realtime throttling replacement
Deadline servers as a realtime throttling replacement
Deadline servers as a realtime throttling replacement
Deadline servers as a realtime throttling replacement
Deadline servers as a realtime throttling replacement
Deadline servers as a realtime throttling replacement
Deadline servers as a realtime throttling replacement
Deadline servers as a realtime throttling replacement
Deadline servers as a realtime throttling replacement
Deadline servers as a realtime throttling replacement
Deadline servers as a realtime throttling replacement
Deadline servers as a realtime throttling replacement
"we needed to swap the order of deadline.c and rt.c among the scheduling classes in the Linux kernel, thus giving POSIX realtime tasks priority over deadline tasks."
Deadline servers as a realtime throttling replacement
If the system is competently cooled, it reaches the power limit before it reaches the thermal limit, and it reaches the thermal limit before it starts throttling. The answer to reaching the power limit or thermal limit is to lower the clock rate and the voltage, which means that the power consumption is reduced more than the performance. OTOH, throttling is an emergency mechanism that just skips clock cycles without reducing voltage (or reducing the actual clock rate, although the effective clock is reduced).
Deadline servers as a realtime throttling replacement