BPF and the realtime patch set
Back in July, Linus Torvalds merged a patch in the 5.3 merge window that added the PREEMPT_RT option to the kernel build-time configuration. That was meant as a signal that the realtime patch set was moving from its longtime status as out-of-tree code to a fully supported kernel feature. As the code behind the configuration option makes its way into the mainline, some friction can be expected; we are seeing a bit of that now with respect to the BPF subsystem.
The thread started with a patch posted by Sebastian Andrzej Siewior to the BPF mailing list. The patch mentioned three problems with BPF running when realtime is enabled and added Kconfig directives to only allow BPF to be configured into kernels that did not have PREEMPT_RT:
- it allocates and frees memory in atomic context
- it uses up_read_non_owner()
- BPF_PROG_RUN() expects to be invoked in non-preemptible context
Siewior said that he had tried to address
the memory allocation problems "but I have no idea how to address the
other two issues
". In that thread, he also gave an overview
of what is needed to "play nicely" with the realtime patch set.
Daniel Borkmann replied that the
simple approach Siewior took would not actually disable all of BPF,
as there are other BPF-using subsystems that would not be affected by the
change. Siewior asked for
feedback on one possible way to solve that, but David Miller made
it clear that he does not think this approach makes sense:
"Turning off BPF just because PREEMPT_RT is enabled is a non-starter
it is
absolutely essential functionality for a Linux system at this point.
"
However, as Siewior said, there are fundamental incompatibilities between the implementation of BPF and the needs of the realtime patch set. Thomas Gleixner provided more detail on the problem areas in the hopes of finding other ways to deal with them:
- #1) BPF disables preemption unconditionally with no way to do a proper RT substitution like most other infrastructure in the kernel provides via spinlocks or other locking primitives.
- #2) BPF does allocations in atomic contexts, which is a dubious decision even for non RT. That's related to #1
- #3) BPF uses the up_read_non_owner() hackery which was only invented to deal with already existing horrors and not meant to be proliferated.
Miller replied
that BPF is needed by systemd and the IR drivers already; "We're
moving to the point where even LSM modules will be implemented in
bpf.
" In his earlier message, he said that turning off BPF would
disable any packet sniffing so that tcpdump and Wireshark would
not function. To a certain extent, he oversold the need for BPF as
Gleixner pointed
out; Gleixner was running Debian testing with Siewior's patch applied and not
encountering any systemd or other difficulties. Furthermore, even though packet
sniffing is not using BPF, thus requiring a copy to user space for each
packet, it does still work, Gleixner said, so that is not really an
argument for requiring BPF either.
Beyond that, though, he was really looking for feedback on
"how to tackle these issues on a technical level
". Some of
that did start to come about in a sub-thread
with BPF maintainer Alexei Starovoitov, who wondered about disabling
preemption and noted that he is a "complete noob in RT
".
Gleixner explained
the situation at some length.
Essentially, the realtime kernel cannot disable preemption or interrupts for arbitrarily long periods of time. The realtime patches substitute realtime-aware locks for the spinlocks and rwlocks that do disable preemption or interrupts in the non-realtime kernel. Those realtime-aware locks can sleep, however, so they cannot be used from within code sections that have explicitly disabled preemption or interrupts.
As Starovoitov explained, BPF
disables preemption "because of per-cpu maps and per-cpu data structures
that are shared between bpf program execution and kernel execution
".
But, he said, BPF does not call into code that might sleep, so there should
be no problems on that score. But that is only when looking at the BPF code
from a non-realtime perspective, Gleixner said;
because of the lock substitution, code that does not look like it could
sleep actually can sleep since the realtime locks (e.g. sleeping spinlocks)
do so. That's what makes using preempt_disable() (and
local_irq_disable()) problematic in the realtime context.
He said that the local_lock() mechanism in the realtime tree might
be a way forward to better handle the explicit preemption disabling in BPF.
But, he said, there is still the outstanding problem of BPF making calls to up_read_non_owner(), which allows a read-write semaphore (rwsem) to be unlocked by a process that is not the owner of the lock. That breaks the realtime requirement that the locker is the same as the unlocker in order to deal with priority inheritance correctly.
Starovoitov also said that BPF does not have unbounded runtime within the preemption-disabled sections, since it has a bound on the number of instructions that can be in a BPF program. But the limit on the number of instructions was recently raised from 4096 to one million, which will result in unacceptable preemption-disabled windows as Gleixner noted:
Even the earlier limit of 4096 would result in 2µs of preemption-disabled
time, which may be problematic Clark Williams said; "[...] there
are some customer cases on
the horizon where 2us would be a significant fraction of their max latency
".
The local_lock() scheme seemed viable to Starovoitov, but he thought the overall approach taken by Siewior's patch was backward:
If RT wants to get merged it should be disabled when BPF is on and not the other way around.
But Gleixner did
not see things that way at all; he noted that he was planning to
investigate local locks as a possible way forward and that there was no
insistence on anything. In addition, turning off realtime when BPF was
enabled was always an option; "[...] I could have done that right
away without even talking to
you. That'd have been dishonest and sneaky.
" He also lamented that the
discussion had degraded to that point.
For his part, Starovoitov said
that he simply thinks disabling an existing in-kernel feature in order to
ease the path for the realtime patches is likely to "backfire on
RT
". He suggested getting the code upstream without riling other
subsystem developers was a better way forward:
That's where things were left at the time of this writing. It is not clear that the practical effect of which subsystem disables the other makes any real difference to users. For now, users of BPF will not be able to use the realtime patches and vice versa. Fixing the underlying problems that prevent the two from coexisting certainly seems more important than squabbling over who disables who; there will be users who want both in their systems after all. For those who run mainline kernels, though, that definitely cannot happen until realtime gets upstream; once it does, a piecemeal approach to resolving the incompatibilities between BPF and realtime can commence.
Index entries for this article | |
---|---|
Kernel | BPF |
Kernel | Realtime |
Posted Oct 23, 2019 21:16 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (9 responses)
Posted Oct 26, 2019 8:25 UTC (Sat)
by togga (subscriber, #53103)
[Link] (8 responses)
Posted Oct 26, 2019 14:05 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (7 responses)
The exceptions are niche: XDP and network packet capture.
Posted Oct 26, 2019 15:13 UTC (Sat)
by togga (subscriber, #53103)
[Link] (6 responses)
Personally, I love RT patch going mainline, it's a long way coming and I have no issues with two conflicting features (although it'd be amazing if we can refactor BPF to play nice with RT domain) but silently dropping BPF when you enable it (and thus silently affect lots of user space tools under the hood).
I can see this happen as RT patch might be tempting for anyone with latency requirements (extreme gaming, interactive sound/video processing etc.).
Posted Oct 26, 2019 20:05 UTC (Sat)
by Cyberax (✭ supporter ✭, #52523)
[Link] (5 responses)
It's not fine for pretty much everything else, from syscall filters to LSMs.
Posted Oct 26, 2019 20:49 UTC (Sat)
by togga (subscriber, #53103)
[Link] (4 responses)
I didn't get what you used instead for other parts but for LSM you use compiled kernel modules?
Posted Oct 28, 2019 18:57 UTC (Mon)
by Cyberax (✭ supporter ✭, #52523)
[Link] (3 responses)
I absolutely despise the push to use BPF everywhere. It's just a bad technology - undebuggable, untraceable and hard to use. Moreover, BPF is now used as a fix for broken and dysfunctional subsystems in Linux, like the whole LSM infrastructure.
I guess the only other significant valid use for BPF is perf/tracing. Although I'd like a better language there, with support for string operations.
Posted Oct 31, 2019 22:09 UTC (Thu)
by togga (subscriber, #53103)
[Link] (2 responses)
The intersection between LSM or tracing with RT patch is likely not an empty set. If RT is driven by latency requirements, i'd also say that fast high level packet routing is interesting in this area whenever the network is involved in the latency (remote interaction).
Posted Nov 1, 2019 5:08 UTC (Fri)
by Cyberax (✭ supporter ✭, #52523)
[Link]
Posted Feb 19, 2020 20:10 UTC (Wed)
by voodoosound (guest, #121807)
[Link]
Posted Oct 24, 2019 6:34 UTC (Thu)
by Lionel_Debroux (subscriber, #30014)
[Link] (1 responses)
The "if it's not in mainline, it doesn't exist" mentality is a disease, which allowed the current bad design and implementation of BPF, which doesn't care about latency, to blossom unchecked for a while.
The measurements reproduced in this article show that BPF is unusable for a number of real-time purposes, and therefore, some, if not most, users of the RT patchset can't use BPF anyway... so why make their life disabling it harder, and (try to) block the merging of an important feature which has been worked on for many years, on the bad design of a recent and still fortunately highly optional feature ?
Posted Oct 24, 2019 10:49 UTC (Thu)
by knan (subscriber, #3940)
[Link]
Posted Oct 24, 2019 9:07 UTC (Thu)
by cyphar (subscriber, #110703)
[Link] (1 responses)
Posted Oct 31, 2019 7:23 UTC (Thu)
by marcH (subscriber, #57642)
[Link]
In plain (and symmetric) English they're just mutually exclusive.
Is this just some Kconfig user interface nuance I'm missing?
Posted Oct 24, 2019 12:03 UTC (Thu)
by clugstj (subscriber, #4020)
[Link] (2 responses)
Posted Oct 24, 2019 13:53 UTC (Thu)
by tglx (subscriber, #31301)
[Link]
Posted Feb 19, 2020 20:16 UTC (Wed)
by voodoosound (guest, #121807)
[Link]
Posted Oct 24, 2019 14:10 UTC (Thu)
by david.a.wheeler (subscriber, #72896)
[Link]
I think it should be entirely expected that some mechanisms (such as BPF) have trouble interacting with a "new" capability. I would like to see reasoned, careful discussions about how to try to resolve them, instead of simply saying "you can't do 2 things that are both useful". I hope that's where this will go.
Posted Oct 24, 2019 14:47 UTC (Thu)
by SEJeff (guest, #51588)
[Link] (1 responses)
Posted Oct 24, 2019 16:19 UTC (Thu)
by nevets (subscriber, #11875)
[Link]
Yes, although the -rt patch set is out of tree, it was a major player in making Linux into the dominant operating system it is today. It should not be considered a second class citizen, EVER!
Posted Oct 24, 2019 23:54 UTC (Thu)
by flussence (guest, #85566)
[Link]
Posted Oct 25, 2019 2:03 UTC (Fri)
by Kamilion (subscriber, #42576)
[Link]
Ref: https://www.youtube.com/watch?v=O7oD_oX-Gio
Posted Oct 31, 2019 23:19 UTC (Thu)
by naptastic (guest, #60139)
[Link]
With -rt merged, there will be four selections for preemption model. What would make sense, IMO, is for BPF's behavior to change depending on that setting. For PREEMPT_NONE || PREEMPT_VOLUNTARY, the current behavior seems obviously correct to me. For the other settings, though, the user is saying "I'm willing to sacrifice bandwidth to get bounded latencies" and the whole kernel, including BPF, should respect that.
Could BPF do the same thing interrupts did, with a top half and a bottom half, and only the top half runs in atomic context? Could it be done in such a way that performance for BPF isn't adversely affected if preemption is voluntary or off?
BPF and the realtime patch set
Fortunately, it's not. BPF is optional for systemd and infrared drivers are used exceedingly rarely.
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
Couldn't the technical conflicts between BPF and RT be resolved after merging the RT patchset, which should be done rather sooner than later, to (among other reasons !) help avoid the introduction of other similarly badly designed infrastructure ?
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
RT merge *hooray*, let's work the problems
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
BPF and the realtime patch set
"You got your performance feature in my correctness feature!"
"" What?? ""
"" ... Delicious! ""
Let's all just get along.
BPF and the realtime patch set