Deadline servers as a realtime throttling replacement

By Jonathan Corbet
June 12, 2023

The CPU scheduler's one job at any given time is to run the task that has the strongest claim to the CPU. There are many factors that complicate that job, not the least of which is that the "strongest claim" is sometimes a bit of a fuzzy concept. Realtime throttling, a mechanism designed to keep a runaway realtime task from monopolizing the CPU, is one case where developers have concluded that the task with, ostensibly, the highest priority should not actually be the one that runs. But realtime throttling has rarely pleased anybody; the deadline-server infrastructure patches posted by Daniel Bristot de Oliveira are the latest attempt to find a better solution.

The POSIX realtime scheduling classes are conceptually simple; at any given time, the task with the highest priority runs to the exclusion of anything else. In the real world, though, the rule enables a runaway realtime task to take over the system to the point that the only way to recover it may be to pull the plug. Power failures, as it turns out, have an even higher priority than realtime tasks.

Yanking out the power cord is aesthetically displeasing to many, though, and tends to cause realtime deadlines to be missed; in an attempt to avoid it, the kernel developers introduced realtime throttling many years ago. In short, realtime throttling restricts realtime tasks to (by default) 95% of the available CPU time; the remaining 5% is left for lower-priority tasks, with the idea that it is enough for an administrator to kill off a runaway task if need be.

Most of the time, this throttling is not a problem. In a properly designed realtime system, the actual realtime work should be using far less than 95% of the available CPU time anyway, so the throttling will never actually happen. But, in cases where a realtime task does need all of the available CPU time for an extended period, realtime throttling can be a problem. This is especially true because the throttling happens even if there are no lower-priority tasks waiting to run. Rather than run the realtime task that still needs CPU, the scheduler will simply force the system idle in this case. The idle time is an unwanted artifact of how the throttling is implemented rather than a desired feature in its own right.

Various efforts have been made to address this problem over the years; this article describes one approach, where realtime throttling would be disabled if it would cause the system to go idle. The deadline-server idea is a different approach to the problem, based on the deadline scheduling class. This class, which has a higher priority than the POSIX realtime classes, is not priority-based; instead, tasks declare the amount of CPU time they need and the time by which they must receive it, and the deadline scheduler works to ensure that those tasks meet their deadlines.

This class thus seems like a natural way to take back 5% of the CPU from realtime tasks when needed. All that is needed is to create a task in the deadline class (called the "deadline server"), declare that it needs 5% of the CPU, and have that task run lower-priority tasks with the time that it is given. The scheduler will then carve out the necessary CPU time but, if the deadline server doesn't need it, it will simply not be runnable and the realtime tasks can continue to run.

The idea, as implemented in Bristot's patch set (which contains patches from Peter Zijlstra and Juri Lelli), does the job reasonably well, in that it makes space for lower-priority tasks without needlessly causing the CPU to go idle. The fact that the deadline class has a higher priority than the realtime classes makes this idea work, but also brings one little problem: once the deadline server is enabled, it will run immediately, perhaps preempting a realtime task that would have eventually yielded anyway. The lower-priority tasks should get their 5%, but giving it to them immediately may create problems for well-behaved realtime tasks.

The proposed solution here is to delay the enabling of the deadline server. A kernel timer is used to occasionally run a watchdog function that looks at the state of the normal-priority tasks on the system. If it appears that those tasks are being starved — with starvation defined as not getting any CPU time over a half-second — then the deadline server will be started. Otherwise, in the absence of starvation problems, scheduling will run as usual.

With this tweak, the work is moving "in the right direction", Bristot said, but there is still room for improvement. The delay of the startup of the deadline server can be further delayed to the "zero-laxity" time — the time just before it would miss a 5% deadline entirely. The starvation monitor could perhaps be moved to CPUs that are not running realtime tasks to prevent interference there. In general, though, this work looks like it could be a plausible solution to the realtime-throttling problem.

Index entries for this article
Kernel	Releases/6.8
Kernel	Scheduler/Realtime

Deadline servers as a realtime throttling replacement

Posted Jun 12, 2023 22:00 UTC (Mon) by geofft (subscriber, #59789) [Link] (4 responses)

There seems to be some conceptual overlap here with the CFS quota mechanism (i.e., for normal/non-realtime tasks). Basically, you restrict a task to no more than some number of CPU-seconds per clock second. But it turns out that sometimes nothing else needs the CPU, which is a little inefficient - but a small subset of operators actually want that, for predictability. And also it turns out that if a multi-threaded task runs past its quota on multiple CPUs before the kernel gets a chance to notice, it can get very badly throttled until it "earns" the quota back, so they added a bursting mechanism (https://lwn.net/Articles/844976/), which is reminiscent of the delay problem here.

Is there in fact any overlap in practice? For instance, could something like a realtime policy be useful for servers running web applications and similar workloads that traditionally haven't been considered realtime, to achieve a similar effect to the CFS quota mechanism? In some sense they're opposites, in that realtime guarantees processes are scheduled at least so much, and quotas guarantee processes are scheduled at most so much. But for most users of the CFS quota mechanism, limiting a process's time isn't their actual goal, it's just a means towards ensuring that all other processes do get their fair share of time. Something else that guarantees each process a minimum time (where the kernel ensures that all the minimums add up to no more than 100% of the system) would also work.

Deadline servers as a realtime throttling replacement

Posted Jun 15, 2023 4:30 UTC (Thu) by alison (subscriber, #63752) [Link] (3 responses)

> realtime guarantees processes are scheduled at least so much

Realtime currently guarantees only that the highest priority processes run until they block, yield to interrupt handlers, exit or somehow else yield the processor. That's exactly the problem throttling is trying to solve. SCHED_FIFO makes no attempt to be "fair." "at least so much" plays no role. If the highest priority thread on a core wants to run forever, it can, and the second highest priority process can wait forever, although in real systems it may well migrate to another core.

Deadline servers as a realtime throttling replacement

Posted Jun 15, 2023 9:21 UTC (Thu) by farnz (subscriber, #17727) [Link] (2 responses)

And the problem case that throttling is meant to solve is the one where you have 20 CPU threads available (so 20 running processes simultaneously), and 20 SCHED_FIFO processes all trying to run forever (e..g because there's a bug causing them to busy-loop instead of yielding). If that happens, we want it to be possible for the admin to access the system and take corrective action using their normal tools.

This is especially important since the admin may be accessing the system over a network, and we do not want to run remote access daemons as real-time tasks, since that opens up a DoS vector.

Deadline servers as a realtime throttling replacement

Posted Jun 18, 2023 17:33 UTC (Sun) by alison (subscriber, #63752) [Link] (1 responses)

>If that happens, we want it to be possible for the admin to access the system and take corrective action using their normal tools.

What we really want is for the scheduler to manage the system in a latency-sensitive but efficient way, possibly also considering power usage. If a human ever has to login to fix something, we have hard-failed.

Deadline servers as a realtime throttling replacement

Posted Jun 19, 2023 11:31 UTC (Mon) by farnz (subscriber, #17727) [Link]

I don't think that's a useful way to view things; if I, as the admin, accidentally run for(;;) {} as a SCHED_FIFO process at maximum priority, I've told the system that the most important thing to do is to busy-loop, and that busy-looping should take priority over any other work.

This is clearly a bug, and as admin I've clearly made a mistake doing this. But I need some way to recover from this mistake, short of pulling the power out and hoping that I've not accidentally set this up to busy-loop on boot; this is the whole case for throttling, since the scheduler cannot distinguish a busy-loop that stops me doing real work from a hard real time task that really does need 99% of my CPU to meet its deadlines.

Deadline servers as a realtime throttling replacement

Posted Jun 13, 2023 14:39 UTC (Tue) by zeno_kdab (guest, #165579) [Link] (5 responses)

It is a bit unclear to me what this "deadline server" is supposed to do exactly. I may not get the full context of this, and this my maybe naive, but why not simply have an emergency console process running at the highest priority level? The need to have special kernel handling for this is not obvious to me at least.

Deadline servers as a realtime throttling replacement

Posted Jun 13, 2023 15:14 UTC (Tue) by droundy (subscriber, #4559) [Link] (4 responses)

My sense is that the kernel doesn't want to know which low priority process or processes might be critical to have keep running, since there isn't always a physical consomme that is accessible. The simplest approach is to ensure that every process gets some chance to run. e.g. maybe you need to ssh into the server to kill the runaway real-time task, but you don't want sshd running at highest priority, and don't want to configure a second sshd "for emergencies only."

Deadline servers as a realtime throttling replacement

Posted Jun 13, 2023 15:40 UTC (Tue) by zeno_kdab (guest, #165579) [Link] (3 responses)

Hm, I suppose. Then this "deadline server" is kind of a "dummy task" (or proxy) for the non-RT scheduler. Now that I think more about it, kind of clever to make this kind of recursive thing, where a whole scheduler becomes a task in another scheduler.

Deadline servers as a realtime throttling replacement

Posted Jun 13, 2023 19:00 UTC (Tue) by farnz (subscriber, #17727) [Link] (2 responses)

It's exactly that, and it's a neat implementation simplification to produce the desired outcome.

The real feature we want is to prevent real-time tasks from blocking normal tasks (such as your remote access daemon) from running at all; it should always be possible to SSH into a system and kill off rogue processes to recover control, assuming you run an SSH daemon.

Today's implementation handles that by saying that real-time tasks are only permitted 95% of CPU cycles, and we'll go idle rather than let them reach 100% of CPU cycles. But there's a better policy we've been trying to find a way to implement for some time, where real-time tasks are allowed 100% of CPU cycles, but become lower priority than normal tasks if, and only if, they're hogging the CPU; the "deadline server" is a nice trick for implementing that policy, exploiting the fact that deadline tasks are even higher priority than real-time tasks.

Deadline servers as a realtime throttling replacement

Posted Jun 15, 2023 18:26 UTC (Thu) by zeno_kdab (guest, #165579) [Link] (1 responses)

Thinking a bit more about this (I do enjoy these topics), doesn't that solution have the disadvantage of adding a small overhead globally? I mean, whenever a non-RT task is runnable, even when there is no overload going on, that deadline server by proxy also becomes runnable, creating some unnecessary scheduling overhead. That is my understanding at least.

My personal feeling is still kind of that it might be better to just run the ssh server as that deadline task, but I do admit I am an backseat purist architect ;)

Deadline servers as a realtime throttling replacement

Posted Jun 15, 2023 18:54 UTC (Thu) by zeno_kdab (guest, #165579) [Link]

Oh never mind, I forgot about the last few paragraphs of the article talking about only conditionally enabling this.

Deadline servers as a realtime throttling replacement

Posted Jun 13, 2023 17:41 UTC (Tue) by abatters (✭ supporter ✭, #6932) [Link]

> deadline scheduling ... has a higher priority than the POSIX realtime classes

vs.

https://lwn.net/Articles/934142/

"we needed to swap the order of deadline.c and rt.c among the scheduling classes in the Linux kernel, thus giving POSIX realtime tasks priority over deadline tasks."

"Interestingly, the possibility to swap rt.c and deadline.c in the kernel, or even to possibly make it a tunable sysfs option, was discussed for other reasons in other talks throughout OSPM. "

Conflict detected...

Deadline servers as a realtime throttling replacement

Posted Jun 16, 2023 0:48 UTC (Fri) by Fowl (subscriber, #65667) [Link] (1 responses)

Is there a potential interaction with thermal throttling/dynamic frequency scaling (possibily at the microcode/firmware level)?

Keeping the CPU at 95% might be keeping things within a thermal/power envelope - if the CPU is actually at 100% more of the time, things could be scaled up, leading to expenditure of the thermal budget, followed by throttling repeat. These fast/slow cycles could be a major source of unpredictable/variable latency.

Deadline servers as a realtime throttling replacement

Posted Jun 17, 2023 8:55 UTC (Sat) by anton (subscriber, #25547) [Link]

If the system is competently cooled, it reaches the power limit before it reaches the thermal limit, and it reaches the thermal limit before it starts throttling. The answer to reaching the power limit or thermal limit is to lower the clock rate and the voltage, which means that the power consumption is reduced more than the performance. OTOH, throttling is an emergency mechanism that just skips clock cycles without reducing voltage (or reducing the actual clock rate, although the effective clock is reduced).

Running a CPU 95% of the time (and halting it for 5%) is similar to throttling in its effect on power consumption. When the CPU is at the power or thermal limit, it will generally perform more work if you let it run all the time than if you let it run only 95% of the time. "More work" will not be 1/0.95 times as much, but it will be >1.