BPF and the realtime patch set

By Jake Edge
October 23, 2019

Back in July, Linus Torvalds merged a patch in the 5.3 merge window that added the PREEMPT_RT option to the kernel build-time configuration. That was meant as a signal that the realtime patch set was moving from its longtime status as out-of-tree code to a fully supported kernel feature. As the code behind the configuration option makes its way into the mainline, some friction can be expected; we are seeing a bit of that now with respect to the BPF subsystem.

The thread started with a patch posted by Sebastian Andrzej Siewior to the BPF mailing list. The patch mentioned three problems with BPF running when realtime is enabled and added Kconfig directives to only allow BPF to be configured into kernels that did not have PREEMPT_RT:

Disable BPF on PREEMPT_RT because

it allocates and frees memory in atomic context
it uses up_read_non_owner()
BPF_PROG_RUN() expects to be invoked in non-preemptible context

Siewior said that he had tried to address the memory allocation problems "but I have no idea how to address the other two issues". In that thread, he also gave an overview of what is needed to "play nicely" with the realtime patch set.

Daniel Borkmann replied that the simple approach Siewior took would not actually disable all of BPF, as there are other BPF-using subsystems that would not be affected by the change. Siewior asked for feedback on one possible way to solve that, but David Miller made it clear that he does not think this approach makes sense: "Turning off BPF just because PREEMPT_RT is enabled is a non-starter it is absolutely essential functionality for a Linux system at this point."

However, as Siewior said, there are fundamental incompatibilities between the implementation of BPF and the needs of the realtime patch set. Thomas Gleixner provided more detail on the problem areas in the hopes of finding other ways to deal with them:

#1) BPF disables preemption unconditionally with no way to do a proper RT substitution like most other infrastructure in the kernel provides via spinlocks or other locking primitives.
#2) BPF does allocations in atomic contexts, which is a dubious decision even for non RT. That's related to #1
#3) BPF uses the up_read_non_owner() hackery which was only invented to deal with already existing horrors and not meant to be proliferated.

Miller replied that BPF is needed by systemd and the IR drivers already; "We're moving to the point where even LSM modules will be implemented in bpf." In his earlier message, he said that turning off BPF would disable any packet sniffing so that tcpdump and Wireshark would not function. To a certain extent, he oversold the need for BPF as Gleixner pointed out; Gleixner was running Debian testing with Siewior's patch applied and not encountering any systemd or other difficulties. Furthermore, even though packet sniffing is not using BPF, thus requiring a copy to user space for each packet, it does still work, Gleixner said, so that is not really an argument for requiring BPF either.

Beyond that, though, he was really looking for feedback on "how to tackle these issues on a technical level". Some of that did start to come about in a sub-thread with BPF maintainer Alexei Starovoitov, who wondered about disabling preemption and noted that he is a "complete noob in RT". Gleixner explained the situation at some length.

Essentially, the realtime kernel cannot disable preemption or interrupts for arbitrarily long periods of time. The realtime patches substitute realtime-aware locks for the spinlocks and rwlocks that do disable preemption or interrupts in the non-realtime kernel. Those realtime-aware locks can sleep, however, so they cannot be used from within code sections that have explicitly disabled preemption or interrupts.

As Starovoitov explained, BPF disables preemption "because of per-cpu maps and per-cpu data structures that are shared between bpf program execution and kernel execution". But, he said, BPF does not call into code that might sleep, so there should be no problems on that score. But that is only when looking at the BPF code from a non-realtime perspective, Gleixner said; because of the lock substitution, code that does not look like it could sleep actually can sleep since the realtime locks (e.g. sleeping spinlocks) do so. That's what makes using preempt_disable() (and local_irq_disable()) problematic in the realtime context. He said that the local_lock() mechanism in the realtime tree might be a way forward to better handle the explicit preemption disabling in BPF.

But, he said, there is still the outstanding problem of BPF making calls to up_read_non_owner(), which allows a read-write semaphore (rwsem) to be unlocked by a process that is not the owner of the lock. That breaks the realtime requirement that the locker is the same as the unlocker in order to deal with priority inheritance correctly.

Starovoitov also said that BPF does not have unbounded runtime within the preemption-disabled sections, since it has a bound on the number of instructions that can be in a BPF program. But the limit on the number of instructions was recently raised from 4096 to one million, which will result in unacceptable preemption-disabled windows as Gleixner noted:

Assuming a instruction/cycle ratio of 1.0 and a CPU frequency of 2GHz, that's 500us of preempt disabled time. Out of bounds by at least one order of [magnitude] for a lot of RT scenarios.

Even the earlier limit of 4096 would result in 2µs of preemption-disabled time, which may be problematic Clark Williams said; "[...] there are some customer cases on the horizon where 2us would be a significant fraction of their max latency".

The local_lock() scheme seemed viable to Starovoitov, but he thought the overall approach taken by Siewior's patch was backward:

But reading your other replies the gradual approach we're discussing here doesn't sound acceptable ? And you guys insist on disabling bpf under RT just to merge some out of tree code ? I find this rude and not acceptable.

If RT wants to get merged it should be disabled when BPF is on and not the other way around.

But Gleixner did not see things that way at all; he noted that he was planning to investigate local locks as a possible way forward and that there was no insistence on anything. In addition, turning off realtime when BPF was enabled was always an option; "[...] I could have done that right away without even talking to you. That'd have been dishonest and sneaky." He also lamented that the discussion had degraded to that point.

For his part, Starovoitov said that he simply thinks disabling an existing in-kernel feature in order to ease the path for the realtime patches is likely to "backfire on RT". He suggested getting the code upstream without riling other subsystem developers was a better way forward:

imo it's better to get key RT bits (like local_locks) in without sending contentious patches like the one that sparked this thread. When everyone can see what this local_lock is we can figure out how and when to use it.

That's where things were left at the time of this writing. It is not clear that the practical effect of which subsystem disables the other makes any real difference to users. For now, users of BPF will not be able to use the realtime patches and vice versa. Fixing the underlying problems that prevent the two from coexisting certainly seems more important than squabbling over who disables who; there will be users who want both in their systems after all. For those who run mainline kernels, though, that definitely cannot happen until realtime gets upstream; once it does, a piecemeal approach to resolving the incompatibilities between BPF and realtime can commence.

Index entries for this article
Kernel	BPF
Kernel	Realtime

BPF and the realtime patch set

Posted Oct 23, 2019 21:16 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

> Turning off BPF just because PREEMPT_RT is enabled is a non-starter it is absolutely essential functionality for a Linux system at this point.
Fortunately, it's not. BPF is optional for systemd and infrared drivers are used exceedingly rarely.

BPF and the realtime patch set

Posted Oct 26, 2019 8:25 UTC (Sat) by togga (subscriber, #53103) [Link] (8 responses)

Don't you think that degrading system performance back to the 90s is a non-starter?

BPF and the realtime patch set

Posted Oct 26, 2019 14:05 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (7 responses)

Turning off BPF does NOT degrade performance. It improves it instead.

The exceptions are niche: XDP and network packet capture.

BPF and the realtime patch set

Posted Oct 26, 2019 15:13 UTC (Sat) by togga (subscriber, #53103) [Link] (6 responses)

Interesting. What do you use as an alternative solution to BPF? (Routing, packet filtering, tracing, ...)

Personally, I love RT patch going mainline, it's a long way coming and I have no issues with two conflicting features (although it'd be amazing if we can refactor BPF to play nice with RT domain) but silently dropping BPF when you enable it (and thus silently affect lots of user space tools under the hood).

I can see this happen as RT patch might be tempting for anyone with latency requirements (extreme gaming, interactive sound/video processing etc.).

BPF and the realtime patch set

Posted Oct 26, 2019 20:05 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

BPF is fine for what it was designed - packet filtering.

It's not fine for pretty much everything else, from syscall filters to LSMs.

BPF and the realtime patch set

Posted Oct 26, 2019 20:49 UTC (Sat) by togga (subscriber, #53103) [Link] (4 responses)

There you have it, packet filtering will be degraded without BPF.

I didn't get what you used instead for other parts but for LSM you use compiled kernel modules?

BPF and the realtime patch set

Posted Oct 28, 2019 18:57 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (3 responses)

Not many people actually care about BPF for packet filtering. It's mostly useful for high-performance routers doing crazy deep-packet inspection and routing stuff with XDP. The intersection of these people and the RT users is pretty much an empty set.

I absolutely despise the push to use BPF everywhere. It's just a bad technology - undebuggable, untraceable and hard to use. Moreover, BPF is now used as a fix for broken and dysfunctional subsystems in Linux, like the whole LSM infrastructure.

I guess the only other significant valid use for BPF is perf/tracing. Although I'd like a better language there, with support for string operations.

BPF and the realtime patch set

Posted Oct 31, 2019 22:09 UTC (Thu) by togga (subscriber, #53103) [Link] (2 responses)

Well, how much you dislike BPF, you are confirming that today, there are no better alternative for both packet filtering, LSM and tracing.

The intersection between LSM or tracing with RT patch is likely not an empty set. If RT is driven by latency requirements, i'd also say that fast high level packet routing is interesting in this area whenever the network is involved in the latency (remote interaction).

BPF and the realtime patch set

Posted Nov 1, 2019 5:08 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

I seriously doubt that any RT system would need BPF-based packet filtering. Tracing is probably more realistic use-case, but probably not really that urgent.

BPF and the realtime patch set

Posted Feb 19, 2020 20:10 UTC (Wed) by voodoosound (guest, #121807) [Link]

I am in the empty set.

BPF and the realtime patch set

Posted Oct 24, 2019 6:34 UTC (Thu) by Lionel_Debroux (subscriber, #30014) [Link] (1 responses)

While it's strictly true that the RT patchset is out of tree, this characterization by the BPF maintainer is somewhat dishonest. The RT patchset has been around for much longer than the eBPF mud which has been churning out a number of vulnerabilities with significant impact, and whose sore security areas therefore need to be disabled by security-conscious administrators, unless the users of the systems really need the extra performance...

The "if it's not in mainline, it doesn't exist" mentality is a disease, which allowed the current bad design and implementation of BPF, which doesn't care about latency, to blossom unchecked for a while.

The measurements reproduced in this article show that BPF is unusable for a number of real-time purposes, and therefore, some, if not most, users of the RT patchset can't use BPF anyway... so why make their life disabling it harder, and (try to) block the merging of an important feature which has been worked on for many years, on the bad design of a recent and still fortunately highly optional feature ?
Couldn't the technical conflicts between BPF and RT be resolved after merging the RT patchset, which should be done rather sooner than later, to (among other reasons !) help avoid the introduction of other similarly badly designed infrastructure ?

BPF and the realtime patch set

Posted Oct 24, 2019 10:49 UTC (Thu) by knan (subscriber, #3940) [Link]

Please refrain from mudslinging and badmouthing in lwn comments.

BPF and the realtime patch set

Posted Oct 24, 2019 9:07 UTC (Thu) by cyphar (subscriber, #110703) [Link] (1 responses)

I'm not sure I understand the very strong objections by Miller. It appears fairly obvious (to me at least) that the requirement is intended to be dropped once BPF works properly on RT, and the proposed solution by Miller (to disable RT if BPF is enabled) has the same net result as the original patch -- users of RT will have to disable BPF for the short-term while the other problems are ironed out. RT is not going to be enabled by default, so it won't affect stock kernel.org builds nor distribution kernels (unless they enable RT). Is the issue that allyesconfig will favour the wrong thing? If so, why such a strong reaction?

BPF and the realtime patch set

Posted Oct 31, 2019 7:23 UTC (Thu) by marcH (subscriber, #57642) [Link]

The contrapositive of RT => not BPF is: BPF => not RT.

In plain (and symmetric) English they're just mutually exclusive.

Is this just some Kconfig user interface nuance I'm missing?

BPF and the realtime patch set

Posted Oct 24, 2019 12:03 UTC (Thu) by clugstj (subscriber, #4020) [Link] (2 responses)

So, how does "out-of-tree" RT work today with respect to BPF?

BPF and the realtime patch set

Posted Oct 24, 2019 13:53 UTC (Thu) by tglx (subscriber, #31301) [Link]

It disables BPF because BPF was not on the top priority of things to support on RT.

BPF and the realtime patch set

Posted Feb 19, 2020 20:16 UTC (Wed) by voodoosound (guest, #121807) [Link]

For me it works fine with XDP packet processing of audio and video streams (AVB).

RT merge hooray, let's work the problems

Posted Oct 24, 2019 14:10 UTC (Thu) by david.a.wheeler (subscriber, #72896) [Link]

I am *very* delighted that the realtime patch set is moving into the mainline kernel tree. There are massive numbers of devices that need this better real-time capabilities.

I think it should be entirely expected that some mechanisms (such as BPF) have trouble interacting with a "new" capability. I would like to see reasoned, careful discussions about how to try to resolve them, instead of simply saying "you can't do 2 things that are both useful". I hope that's where this will go.

BPF and the realtime patch set

Posted Oct 24, 2019 14:47 UTC (Thu) by SEJeff (guest, #51588) [Link] (1 responses)

Reminder that the out of tree -rt patchset is why enormous projects that benefit all Linux users just as removing the BKL (Big Kernel Lock) were undertaken. Other big features like hires timers also were originally part of the out of tree -rt patchset. All Linux users have benefited from the work of the -rt patchset team, even if they've never used -rt enabled kernels.

https://kernelnewbies.org/BigKernelLock

BPF and the realtime patch set

Posted Oct 24, 2019 16:19 UTC (Thu) by nevets (subscriber, #11875) [Link]

The -rt patch was the push behind lockdep, and a major reason why Linux runs so well on large SMP machines today. Back in 2005, with turning spinning locks into mutexes, we were able to trigger deadlocks on a uni-processor machine that would require 8 or more CPUs to trigger upstream. As it effectively made every thread act as a separate CPU. Since the -rt developers were so tired of playing whack-a-mole in solving these deadlocks, the push for lockdep came about. Since then, the number of deadlocks in the mainline kernel has dropped significantly!

Yes, although the -rt patch set is out of tree, it was a major player in making Linux into the dominant operating system it is today. It should not be considered a second class citizen, EVER!

BPF and the realtime patch set

Posted Oct 24, 2019 23:54 UTC (Thu) by flussence (guest, #85566) [Link]

I disagree that BPF should never be turned off, or it wouldn't be optional in the first place. From where I'm standing, BPF seems about as “essential” as nvidia.ko. Or systemd. If a feature's being defended using tautologies like that it makes the whole thing suspect.

BPF and the realtime patch set

Posted Oct 25, 2019 2:03 UTC (Fri) by Kamilion (subscriber, #42576) [Link]

"You put your correctness feature in my performance feature!"
"You got your performance feature in my correctness feature!"
"" What?? ""
"" ... Delicious! ""

Ref: https://www.youtube.com/watch?v=O7oD_oX-Gio
Let's all just get along.

BPF and the realtime patch set

Posted Oct 31, 2019 23:19 UTC (Thu) by naptastic (guest, #60139) [Link]

I've been watching the -rt tree try to get merged since it started, and with how much of it has been merged, I'm frankly baffled that something that can disable interrupts and preemption for an unpredictable period of time snuck its way into the kernel.

With -rt merged, there will be four selections for preemption model. What would make sense, IMO, is for BPF's behavior to change depending on that setting. For PREEMPT_NONE || PREEMPT_VOLUNTARY, the current behavior seems obviously correct to me. For the other settings, though, the user is saying "I'm willing to sacrifice bandwidth to get bounded latencies" and the whole kernel, including BPF, should respect that.

Could BPF do the same thing interrupts did, with a top half and a bottom half, and only the top half runs in atomic context? Could it be done in such a way that performance for BPF isn't adversely affected if preemption is voluntary or off?

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

RT merge *hooray*, let's work the problems

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

BPF and the realtime patch set

RT merge hooray, let's work the problems