The long road to lazy preemption
Some review
Current kernels have four different modes that regulate when one task can be preempted in favor of another. PREEMPT_NONE, the simplest mode, only allows preemption to happen when the running task has exhausted its time slice. PREEMPT_VOLUNTARY adds a large number of points within the kernel where preemption can happen if needed. PREEMPT_FULL allows preemption at almost any point except places in the kernel that prevent it, such as when a spinlock is held. Finally, PREEMPT_RT prioritizes preemption over most other things, even making most spinlock-holding code preemptible.
A higher level of preemption enables the system to respond more quickly to events; whether an event is the movement of a mouse or an "imminent meltdown" signal from a nuclear reactor, faster response tends to be more gratifying. But a higher level of preemption can hurt the overall throughput of the system; workloads with a lot of long-running, CPU-intensive tasks tend to benefit from being disturbed as little as possible. More frequent preemption can also lead to higher lock contention. That is why the different modes exist; the optimal preemption mode will vary for different workloads.
Most distributions ship kernels built with the PREEMPT_DYNAMIC pseudo-mode, which allows any of the first three modes to be selected at boot time, with PREEMPT_VOLUNTARY being the default. On systems with debugfs mounted, the current mode can be read from /sys/kernel/debug/sched/preempt.
PREEMPT_NONE and PREEMPT_VOLUNTARY do not allow the arbitrary preemption of code running in the kernel; there are times when that can lead to excessive latency even in systems where minimal latency is not prioritized. This problem is the result of places in the kernel where a large amount of work can be done; if that work is allowed to run unchecked, it can disrupt the scheduling of the system as a whole. To get around this problem, long-running loops have been sprinkled with calls to cond_resched(), each of which is an additional voluntary preemption point that is active even in the PREEMPT_NONE mode. There are hundreds of these calls in the kernel.
There are some problems with this approach. cond_resched() is a form of heuristic that only works in the places where a developer has thought to put it. Some calls are surely unnecessary, while there will be other places in the kernel that could benefit from cond_resched() calls, but do not have them. The use of cond_resched(), at its core, takes a decision that should be confined to the scheduling code and spreads it throughout the kernel. It is, in short, a bit of a hack that mostly works, but which could be done better.
Doing better
The tracking of whether a given task can be preempted at any moment is a complicated affair that must take into account several variables; see this article and this article for details. One of those variables is a simple flag, TIF_NEED_RESCHED, that indicates the presence of a higher-priority task that is waiting for access to the CPU. Events such as waking a high-priority task can cause that flag to be set in whatever task is currently running. In the absence of this flag, there is no need for the kernel to consider preempting the current task.
There are various points where the kernel can notice that flag and cause the currently running task to be preempted. The scheduler's timer tick is one example; any time a task returns to user space from a system call is another. The completion of an interrupt handler is yet another, but that check, which can cause preemption to happen any time that interrupts are enabled, is only enabled in PREEMPT_FULL kernels. A call to cond_resched() will also check that flag and, if it is set, call into the scheduler to yield the CPU to the other task.
The lazy-preemption patches are simple at their core; they add another flag, TIF_NEED_RESCHED_LAZY, that indicates a need for rescheduling at some point, but not necessarily right away. In the lazy preemption mode (PREEMPT_LAZY), most events will set the new flag rather than TIF_NEED_RESCHED. At points like the return to user space from the kernel, either flag will lead to a call into the scheduler. At the voluntary preemption points and in the return-from interrupt path, though, only TIF_NEED_RESCHED is checked.
The result of this change is that, in lazy-preemption mode, most events in the kernel will not cause the current task to be preempted. That task should be preempted eventually, though. To make that happen, the kernel's timer-tick handler will check whether TIF_NEED_RESCHED_LAZY is set; if so, TIF_NEED_RESCHED will also be set, possibly causing the running task to be preempted. Tasks will generally end up running for something close to their full time slice unless they give up the CPU voluntarily, which should lead to good throughput.
With these changes, the lazy-preemption mode can, like PREEMPT_FULL, run with kernel preemption enabled at (almost) all times. Preemption can happen any time that the preemption counter says that it should. That allows long-running kernel code to be preempted whenever other conditions do not prevent it. It also allows preemption to happen quickly in those cases where it is truly needed. For example, should a realtime task become runnable, as the result of handling an interrupt, for example, the TIF_NEED_RESCHED flag will be set, leading to an almost immediate preemption. There will be no need to wait for the timer tick in such cases.
Preemption will not happen, though, if only TIF_NEED_RESCHED_LAZY is set, which will be the case much of the time. So a PREEMPT_LAZY kernel will be far less likely to preempt a running task than a PREEMPT_FULL kernel.
Removing cond_resched() — eventually
The end goal of this work is to have a scheduler with only two non-realtime modes: PREEMPT_LAZY and PREEMPT_FULL. The lazy mode will occupy a place between PREEMPT_NONE and PREEMPT_VOLUNTARY, replacing both of them. It will, however, not need the voluntary preemption points that were added for the two modes it replaces. Since preemption can now happen almost anywhere, there is no longer a need to enable it in specific spots.
For now, though, the cond_resched() calls remain; if nothing else, they are required for as long as the PREEMPT_NONE and PREEMPT_VOLUNTARY modes exist. Those calls also help to ensure that problems are not introduced while lazy preemption is being stabilized.
In the current patch set, cond_resched() only checks TIF_NEED_RESCHED, meaning that preemption will be deferred in many situations where it will happen immediately from cond_resched() in PREEMPT_VOLUNTARY or PREEMPT_NONE mode. Steve Rostedt questioned this change, asking whether cond_resched() should retain its older meaning, at least for the PREEMPT_VOLUNTARY case. Even though PREEMPT_VOLUNTARY is slated for eventual removal, he thought, keeping the older behavior could help to ease the transition.
Thomas Gleixner answered that only checking TIF_NEED_RESCHED is the correct choice, since it will help in the process of removing the cond_resched() calls entirely:
That forces us to look at all of them and figure out whether they need to be extended to include the lazy bit or not. Those which do not need it can be eliminated when LAZY is in effect because that will preempt on the next possible preemption point once the non-lazy bit is set in the tick.
He added that he expects "less than 5%
" of the
cond_resched() calls need to check TIF_NEED_RESCHED_LAZY
and, thus, will need to remain even after the transition to
PREEMPT_LAZY is complete.
Before then, though, there are hundreds of cond_resched() calls that need to be checked and, for most of them at least, removed. Many other details have to be dealt with as well; this patch set from Ankur Arora addresses a few of them. There is also, of course, the need for extensive performance testing; Mike Galbraith has made an early start on that work, showing that throughput with lazy preemption falls just short of that with PREEMPT_VOLUNTARY.
It all adds up to a lot to be done still, but the end result
of the lazy-preemption work should be a kernel that is a bit smaller and
simpler while delivering predictable latencies without the need to
sprinkle scheduler-related calls throughout the code. That seems like a
better solution, but getting there is going to take some time.
Index entries for this article | |
---|---|
Kernel | Preemption |
Kernel | Scheduler |
Posted Oct 19, 2024 19:48 UTC (Sat)
by milesrout (subscriber, #126894)
[Link] (35 responses)
It is surely a good sign for Linux that it has been kept flexible enough that core changes like this (and many of the others we've read about over the years) can still be made. It isn't hard to imagine a 33-year-old monolithic kernel with an arguably >50-year-old design would have ossified and get stuck with old designs forever.
The question I have is this: is this because of "good design"? Or is it because the people that work on the kernel are just better? Are changes like folios, sched_ext, this, etc. able to be done because of good design/modularity in the kernel? Or is it just a willingness to make pervasive changes across a huge kernel codebase that lets this happen - by sheer force of will, the developers will not allow the kernel to ossify?
Posted Oct 19, 2024 20:21 UTC (Sat)
by atnot (subscriber, #124910)
[Link] (29 responses)
Posted Oct 19, 2024 22:38 UTC (Sat)
by Wol (subscriber, #4433)
[Link] (2 responses)
And this is probably down to just one person - Linus.
I'm not saying it wouldn't have happened with another person, and another OS - Bill Gates and Windows had a damn good try ...
But Linus is a damn good people manager, wasn't greedy and didn't make enemies, and importantly is trusted by everyone to be neutral (inasmuch as being "Mr Linux" can be neutral).
Any threat to Linux will have to be lead by someone of Linus' abilities, and who will earn stature like Linus'.
Cheers,
Posted Oct 20, 2024 5:09 UTC (Sun)
by j16sdiz (guest, #57302)
[Link] (1 responses)
He made lots of enemies. Lots of people hate him. He acts like dick all the time. -- but, he is able to support his technical decision with reasons.
Posted Oct 20, 2024 9:02 UTC (Sun)
by milesrout (subscriber, #126894)
[Link]
Posted Oct 20, 2024 0:55 UTC (Sun)
by willy (subscriber, #9762)
[Link] (25 responses)
You're not necessarily wrong, but I certainly didn't start from "I want a kernel that manages memory in large chunks, which one shall I work on?" Other projects may have. I started from "Here are some problems I see in Linux. How could we solve them?"
I did consciously ask "Which filesystem shall I start with?" and chose XFS for a number of reasons (mostly working with the people and perceiving iomap as being the future of the VFS).
Posted Oct 20, 2024 22:39 UTC (Sun)
by Paf (subscriber, #91811)
[Link] (24 responses)
Frankly, this view is a little harsh, but Pike has always seemed to me bitter that he came along a little too late and too uninterested in the idea that a problem could be, well, fairly well solved by existing systems. I think Plan 9 wound up irrelevant not because industry is stupid or hidebound, but because it didn't represent enough of an improvement (where it was an improvement at all).
Some problems do, at a certain point, end up, for some sense of the word solved, mostly solved. And the kind of innovation that previously defined a system/product/whatever trails off and moves elsewhere.
At a certain point, we stopped changing the basic design of how cars work with their driver and the road. A car from much after 1950 (or, really, even the 1930s for many designs) has, to a first approximation, the same human interface and basic functionality as one from 2024, leaving aside any partial self driving features (which are brand new in any case). There has been a lot of innovation, but not in the controls or basic shape. This was a problem that was solved well enough that new solutions couldn't generate space. And yeah, they might be superior, but probably not all that superior. It's a lot more than just "cost of switching" and "everyone gave up". There was, in fact, little need to make larger changes. I know that's depressing if you're a researcher, and, yes, some promising innovations end up moribund... But they always did.
Posted Oct 20, 2024 23:11 UTC (Sun)
by willy (subscriber, #9762)
[Link] (20 responses)
The car-human interface isn't as fixed as one might think. Is the gear shift in the centre console or on a stalk? Automatic or manual? Is there an ignition key or a "Start" button somewhere? How does one open the refuelling socket? Which @#$& side of the car is it on? Where are the windscreen wipers? How do you dip the headlights? None of these are big burdens if you own a car, but if you rent, you have to figure all these things out at some point (preferably before leaving the lot).
It's the same with showers. They mostly have controls for volume, temperature and which outlet(s) the water should come out of, but some hotels go out of their way to have super fancy ones that are utterly non-discoverable. I know how to work my shower at home, but if you stay at three different hotels in three nights, you're going to be confused and angry in the third shower.
Anyway, back to my point. Innovation has moved up the stack. We have something good enough, and we're all building on it. Linux has become the substrate on which we innovate. I don't think that's sad, I think that's progress.
Posted Oct 21, 2024 0:30 UTC (Mon)
by gmatht (subscriber, #58961)
[Link] (8 responses)
Posted Oct 21, 2024 1:14 UTC (Mon)
by dskoll (subscriber, #1630)
[Link]
LWN properly supports Unicode, so ◀⛽ or ⛽▶ 🙂
I didn't know about the fuel socket indicator until about 5 years ago.
Posted Oct 21, 2024 2:35 UTC (Mon)
by sfeam (subscriber, #2841)
[Link] (5 responses)
Posted Oct 21, 2024 13:43 UTC (Mon)
by LtWorf (subscriber, #124958)
[Link] (4 responses)
Posted Oct 21, 2024 14:03 UTC (Mon)
by Wol (subscriber, #4433)
[Link] (3 responses)
Ime, the engine cover lever is always in the passenger footwell by the door. While the fuel cover switch in my car is in the driver's door. And while I have little experience of said switches, I've never known them on the passenger side ...
My bug bear is "flash headlamps" and "wash windscreen". The number of times I've flashed people by mistake ...
Cheers,
Posted Oct 21, 2024 20:24 UTC (Mon)
by mathstuf (subscriber, #69389)
[Link] (1 responses)
Posted Oct 22, 2024 8:13 UTC (Tue)
by anselm (subscriber, #2796)
[Link]
The engine-hood lever is usually on the left side of a car because most cars have the driver sitting on the left, and given how rarely that lever is used there is no point in moving it to the right for the others because all that will achieve is to make building the cars more complicated. So in the UK it is on the passenger side, and in places like the USA and Germany it is on the driver's side.
We can probably count ourselves lucky, though, that in right-hand-side-driver cars the pedals aren't in reverse order.
Posted Oct 22, 2024 8:55 UTC (Tue)
by farnz (subscriber, #17727)
[Link]
IME, engine cover latch is on the left of the car in the footwell, and some cars have the fuel cover release on the frame next to the driver's seat. In UK versions, that puts the fuel cover release on the right of the car, but in the French market version of the same car, the engine cover release and the fuel cover release are on the same side of the car - engine cover by your left foot, fuel cover under your left shoulder when the door is open.
Posted Oct 21, 2024 3:44 UTC (Mon)
by interalia (subscriber, #26615)
[Link]
Posted Oct 21, 2024 2:12 UTC (Mon)
by Paf (subscriber, #91811)
[Link]
You know, perhaps this is the biggest point of all: It is good enough, and very challenging to change.
Posted Oct 21, 2024 4:38 UTC (Mon)
by ebiederm (subscriber, #35028)
[Link] (1 responses)
The most questionable decision in all of that seems to be relying on hardware to define the isolation of untrusted software.
Hardware is always buggy and so expensive to fix it might as well be unfixable. We are effectively 7 years into the era of Spectre and I am not aware of any high performance cpus that successfully isolate untrusted software.
So why do our operating system architectures by design rely on broken and unfixable hardware to get security right?
Which is to say when operating systems are failing at part of their core mission because of how they are designed I think there is room and need for innovation at that level.
Posted Dec 13, 2024 13:21 UTC (Fri)
by roblucid (guest, #48964)
[Link]
A lot of security is about process isolation and correct virtual memory implementations, you simply cannot do something like logical->physical address translations efficiently in software, it needs to be initiated by L1 cache look up (hence the cache's tags to eliminate false positive hits) and available for L2/L3/DRAM fetches.
Then again software being mutable is what hostiles rely on, you need the OS & hardware support to harden a system against expoitation, a program that's reentrant, relocatable or dynamically linkable simply cannot know what logical addresses it uses. Even so without hardware support where would the immutable correct address tables be stored, so errors cannot be exploited to patch the program?
Posted Oct 21, 2024 17:00 UTC (Mon)
by paulj (subscriber, #341)
[Link] (7 responses)
E.g., if you got in a car from the 20s to early 30s, most people today would be unable to start it, for want of knowledge of 2 key engine controls (one of which was automated fairly early on - who here knows what the lever on the centre of the steering wheel [typically] did?; another was present until the 80s on many cars, a knob on the dash you had to pull in and out typically - I'll add a comment later with the answers ;) ). If they were able to start it, they might well damage the engine. They would also struggle to change gear without damaging the car.
UI has gotten simpler, and details hidden. A driver from the 20s would probably find it easier to get comfortable driving a modern car, than a modern driver getting into a car from 100 years ago - bit more to learn. The amazing speed of modern vehicles might be the 1 control thing the 20s driver might need to adapt to, but that wouldn't stop them driving at a slower speed.
Also, the maintenance of the car is now minimal compared to the earlier days.
Posted Oct 22, 2024 13:44 UTC (Tue)
by paulj (subscriber, #341)
[Link] (6 responses)
- Ignition advance:
This used to be something that had to be manually adjusted as you drove, to suit the engine speed, warmth and mixture.
- Choke:
Manual adjustment of mixture (interacting with previous), particularly for engine start.
- double-declutching for gear changes:
Changing gear required pausing the gear change in neutral, letting the clutch engage again, and matching the engine speed to the drive-shaft (either
(Apparently truck drivers in the USA still have to do this on many models of tractor units).
Posted Oct 22, 2024 16:17 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (5 responses)
You missed a couple more important details; the modern 2 or 3 pedal layout was not yet the standard in the 1920s, and some cars even then had the throttle as a lever on the steering wheel, rather than a pedal. And the pedals might well be gear selection, with possibly a foot brake, possibly not. The clutch could be a pedal, but it might also be a hand-operated lever, and even if it's a pedal, it might need lifting with your foot instead of pressing.
It's not until the 1940s that the industry finally settles on the modern control scheme.
Posted Oct 22, 2024 16:21 UTC (Tue)
by paulj (subscriber, #341)
[Link] (4 responses)
The 1927 Austin 7 he once had, which I've driven, already had the familiar 3 pedal layout. The clutch was more like a button though. Very hard to get used to. So that layout already existed in the 20s.
Posted Oct 22, 2024 16:26 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (3 responses)
The modern layout existed, but (e.g.) Fords from the 1920s had a mix of layouts - indeed, a Model T and a Model A had different control schemes, and some of the things I mentioned that now seem odd were used by different 1920s Ford models (foot pedals for gear selection, lifting the clutch not pressing it).
If you're used to that sort of array of different possibilities, where you need to read the fine manual before trying to drive because there's so many options, learning how to drive a modern car isn't that hard; just work out how the modern control map to what you expect, and complain because the car does timing advance, choking etc for you. But (as evidenced by people who can drive an automatic transmission, but can't drive a manual transmission) going the other way is harder - you have to do more things that a modern car does for you.
And I've not driven anything without a modern control layout - I've only seen them in museums with my grandfather, who wanted to show me the cars he dreamt about being able to own when he was a child.
Posted Oct 22, 2024 16:42 UTC (Tue)
by paulj (subscriber, #341)
[Link] (2 responses)
Posted Oct 22, 2024 16:50 UTC (Tue)
by farnz (subscriber, #17727)
[Link] (1 responses)
These were all models that my English grandfather had heard about as a child and really wanted at the time, but could never afford - he was a serious car nut.
But that does lead to a serious point; it isn't unusual for different countries to have kept different standards from the past, even though they "could" unify with the rest of the world. For example, on pedal cycles, some countries put the front brake on the left lever, while others put it on the right lever. Arguably, the only reason this didn't happen with the motor car is that we had a large crowd of ex-military drivers in the late 1940s who all knew the same standard no matter where in the world they were going back to, and so everyone settled on one standard.
Posted Oct 22, 2024 18:17 UTC (Tue)
by joib (subscriber, #8541)
[Link]
Case in point, the International System of Units (SI) is adopted by almost the entire world, except Myanmar, Liberia, and some other country whose name escapes me at the moment.
Posted Oct 21, 2024 7:46 UTC (Mon)
by roc (subscriber, #30627)
[Link] (2 responses)
Linux grabbed the "free-software OS for commodity PCs" ecosystem niche. Perhaps it could have been MINIX with a more community-oriented owner and a better license, but it was Linux. Reasonably well-run open-source projects in important niches accrue powerful network effects.
Over the same time period, what people expect from the OS --- userspace APIs and hardware support --- grew massively, making it much harder to build a viable competitor.
And yes, for a long time the Linux kernel design was good enough ... good enough that the cost of replacing it (including the cost of migrating higher-level software to a new design) has never been justified.
But I think it would be wrong to conclude that the Linux kernel, or the general Unix-style kernel interface, is in any sense optimal. Linux has a lot of serious problems that are becoming more serious over time. The monolithic design has led us to a point where the kernel is too big to trust and developers are overwhelmed with CVEs. Relying on namespaces and seccomp for isolation makes sandboxing brittle and very complicated; I wish the system was much more capability-oriented. ptrace and signals are notoriously problematic.
Posted Oct 22, 2024 0:27 UTC (Tue)
by Cyberax (✭ supporter ✭, #52523)
[Link]
I can see a world 15 years from now, where all the classic synchronous Linux system calls are reimplemented as user-space compat shims on top of a minimalistic native io_uring kernel.
Posted Oct 22, 2024 2:40 UTC (Tue)
by interalia (subscriber, #26615)
[Link]
Posted Oct 20, 2024 0:28 UTC (Sun)
by willy (subscriber, #9762)
[Link] (4 responses)
Linux is also an engineering project, not a research project. So a lot of work is put into making Linux understandable and modifiable.
Posted Oct 23, 2024 4:38 UTC (Wed)
by raven667 (subscriber, #5198)
[Link] (3 responses)
What I was thinking is (aside from the tty layer that no one wants to touch with 2m borrowed pole) how many distinct Linux kernel designs have existed over the last 30+ years? What would define the eras, since change is happening all over, removal of BKL, switch from stable/dev branch in 2.6 to continuous integration, udev, some particular scheduler or memory allocator? What would a kernel developer see as distinct coherent design eras? How much code has been unchanged in the last 5y, 10y and again the 5y, 10y before that, how many Ships of Theseus have been built?
Posted Oct 23, 2024 13:36 UTC (Wed)
by raven667 (subscriber, #5198)
[Link] (2 responses)
Posted Oct 23, 2024 15:16 UTC (Wed)
by Wol (subscriber, #4433)
[Link]
Cheers,
Posted Oct 31, 2024 10:34 UTC (Thu)
by FluffyFox (guest, #162692)
[Link]
Posted Nov 11, 2024 7:02 UTC (Mon)
by wtarreau (subscriber, #51152)
[Link] (1 responses)
Posted Dec 13, 2024 14:01 UTC (Fri)
by roblucid (guest, #48964)
[Link]
Core work still going on 33 years later
Core work still going on 33 years later
Core work still going on 33 years later
Wol
Core work still going on 33 years later
Core work still going on 33 years later
Core work still going on 33 years later
Core work still going on 33 years later
Core work still going on 33 years later
Which @#$& side of the car is it [Fuel Socket] on.
Which @#$& side of the car is it [Fuel Socket] on.
Which @#$& side of the car is it [Fuel Socket] on.
Which @#$& side of the car is it [Fuel Socket] on.
Which @#$& side of the car is it [Fuel Socket] on.
Wol
Which @#$& side of the car is it [Fuel Socket] on.
Which @#$& side of the car is it [Fuel Socket] on.
Which @#$& side of the car is it [Fuel Socket] on.
Which @#$& side of the car is it [Fuel Socket] on.
Core work still going on 33 years later
Why should hardware be involved in security?
Why should hardware be involved in security?
Core work still going on 33 years later
Core work still going on 33 years later
by letting the engine speed fall a little, if changing up; or blipping the throttle, if changing down - often while continuing to hold the brake pedal),
before disengaging the clutch again and completing the gear change.
Core work still going on 33 years later
Core work still going on 33 years later
Core work still going on 33 years later
Core work still going on 33 years later
Core work still going on 33 years later
Core work still going on 33 years later
Core work still going on 33 years later
Core work still going on 33 years later
Linux kernel and interface
Core work still going on 33 years later
Core work still going on 33 years later
Core work still going on 33 years later
Core work still going on 33 years later
Wol
Core work still going on 33 years later
A good tradeoff
A good tradeoff
As I understand it, you'll see a very small increase to throughput efficiency at the risk of massive increase to latency on a busy system. That was why the full pre-emption and work on voluntary cond_resched was done on uniprocessors decades ago when it really mattered to interactive responsiveness and people became obsessed with the scheduluer eg) BFS (brain f**k scheduler) before CFS with process groups satisfied most people.
In the old days on UNIX the much longer ticks caused blocked processes to have increased priority while the running process lost priority, the scheduler when looking at which thread to preempt can use a similar score to pick on the longest running first.
So what's wrong with having the more complicated logic in the scheduler rather than making the timer tick test an extra flag, rather than unconditionally set a single one on running tasks? Effectively the first time the scheduler fires it can prefer preempting candidates that were previously considered and mark new ones after finding a better victim.
Often the scheduler will have idle cores and not even look at pre-emption, you "over book" a CPU with frequently blocking tasks that share cores and while runing at very low priority long running ones to soak up idle time which as batch jobs simply don't care about latency.