An instruction-level BPF memory model

By Daroc Alden
June 4, 2024

There are few topics as arcane as memory models, so it was a pleasant surprise when the double-length session on the BPF memory model at the Linux Storage, Filesystem, Memory Management, and BPF Summit turned out to be understandable. Paul McKenney led the session, although he was clear that the work he was presenting was also due to Puranjay Mohan, who unfortunately could not attend the summit. BPF does not actually have a formalized memory model yet; instead it has relied on a history of talks like this one and a general informal understanding. Unfortunately, ignoring memory models does not make them go away, and this has already caused at least one BPF-related bug on weakly-ordered architectures. Figuring out what a formal memory model for BPF should define was the focus of McKenney's talk.

LWN has written about memory models for kernel code before. Briefly, accessing memory is so slow that modern CPUs put a great deal of effort into minimizing and parallelizing requests to memory. This often takes the form of caching, but can also involve reordering memory accesses. This means that, from the point of view of a thread running on another CPU, memory operations can appear to occur out of order. A memory model says which reorderings are permitted, and which instructions programmers can use to control this behavior. "Strong" memory models are those that permit fewer reorderings (and therefore often require less explicit synchronization), while "weak" memory models are more permissive.

McKenney opened the session by noting that — in an example of weak ordering typical of memory-modeling work — the patches he hoped to result from the talk had actually already been sent to the mailing list. Mohan had done an impressive amount of work in the week leading up to the summit, including fixing several problems with atomic instructions in BPF on weakly-ordered architectures and adding code to herd7, a memory model simulator, in order to show the changes were correct.

After acknowledging Mohan's work, McKenney then walked the attendees through the common assumptions embedded in BPF code about memory ordering, which amount to an informal memory model. BPF has a set of atomic instructions — notably xchg and cmpxchg — that provide full ordering. All CPUs and tasks agree that the side effects of all instructions that come before an atomic instruction are visible before it, and all effects of subsequent instructions are not visible until it has been executed. McKenney noted that this was "straightforward, but really important". Other atomic instructions, such as atomic adds, are unordered, unless they specify the optional FETCH flag.

The other source of memory ordering guarantees in BPF is jump instructions. Unconditional jumps don't affect ordering, but conditional branches do under some circumstances. When there is a load instruction, the result of which is used in the comparison for a branch, and then that value is stored following the branch, the store is guaranteed to occur after the load. McKenney noted that this was a pretty subtle point, and means that optimizing BPF programs requires some care. Alexei Starovoitov interjected: "Don't run BPF assembly through an optimizing compiler", which produced some chuckles.

Those properties form the bare minimum of what most BPF programs currently assume about ordering. But in the absence of a formalized model, it's impossible to say whether any given bug resulting from memory operations being reordered is a bug in the BPF program, the compiler, or the BPF just-in-time compiler (JIT). McKenney has four main goals for a formal memory model: it should operate at the level of individual instructions, so that it is directly applicable to the JIT; it should be consistent with the Linux kernel memory model (LKMM); it should support low-overhead mappings to supported hardware; and it should be able to grow as BPF does.

These goals pose some challenges. In order to be efficiently mapped to different kinds of hardware, the memory model should avoid forbidding reorderings that those architectures permit — which necessarily means that it will end up being a "weaker" memory model than any of those architectures alone. The benefit is that, if the formal model is successful, the JIT should not need to emit synchronization instructions or memory barriers in most cases.

McKenney thought that some people might wonder: don't we already have the Linux kernel memory model? Why not just map BPF assembly to LKMM C? Unfortunately, that doesn't really work, he said, at least not trivially. High-level languages like C and low-level languages like assembly have different event structures that make a simple mapping impossible. Assembly also has additional constraints around individual registers or similar low-level constructs that the LKMM doesn't address. He did think starting from an existing memory model was a good idea, however, since the basics of any memory model are likely to be fairly similar.

Of the existing memory models, McKenney considered a few alternatives. X86 is much too strong, and PowerPC is not actively developed (and missing atomic instructions). Ultimately, he chose ArmV8 as a possible starting point, since it is actively maintained and full-featured. The downside is that it includes some irrelevant hardware features, and that it is still stronger than PowerPC, so some changes will be needed.

McKenney showed the attendees a few selected sections of the ArmV8 memory model specification, to give a feeling for what adapting it would be like. Then he went into a series of examples of things "that kinda hit me over the head really hard" — unintuitive consequences of the model that would need to be considered while adapting it to BPF.

Examples

The first example dealt with dependencies between loads and stores. Suppose that there are two reads (called R1 and R2), and one write (called W1). In the program, the instructions occur in the order R1, R2, W1. When executed, R1 reads from some known address (a pointer), returning a second address that R2 reads from. W1 writes to a different address. The question is: is the CPU allowed to re-order the write to occur before the reads? After describing the scenario, McKenney paused to give the experienced kernel developers in the room a chance to guess.

As it turns out, both ArmV8 and PowerPC forbid that reordering, but the LKMM doesn't. The reason has to do with aliasing — the CPU, when it is executing those instructions, can't know whether R2 and W1 will access the same location until read R1 actually resolves. So of course W1 can't occur before R1, because that might cause R2 to read the wrong value. The LKMM, however, operates on a more abstract level. If the pointers involved in this scenario are passed in as arguments to a function written in C, the LKMM is allowed to assume that they don't alias. Since they don't refer to the same location, reordering W1 before R1 doesn't cause any problems.

In this case, BPF is in a position more like a CPU than like the C abstract machine. The JIT could theoretically know whether two accesses alias, but that would seriously complicate code generation, so it makes more sense to adopt the same restriction as ArmV8 and forbid the reordering.

The second example was somewhat more esoteric. Suppose that you have two threads and one variable X. One thread reads X twice, in read operations R1 and R2. The other thread writes to X, with an operation W1. Suppose that R2 occurs before W1 (i.e., it doesn't see the value written). All the operations use the platform's equivalent of READ_ONCE() and WRITE_ONCE(), i.e. atomic reads and writes with weak ordering guarantees. ~~Does R1 also necessarily occur before W1?~~ [McKenney clarified in the comments that this is not quite the correct framing. The proper question is: does R2 being unable to read a value older than the value read by R1 have any other ordering consequences?] After a moment of silence, one audience member pointed out that the answer can't possibly be "Yes", because otherwise McKenney wouldn't be asking, producing another round of laughter.

And indeed, on PowerPC, ~~operation R1 is allowed to read the value written by W1, even though R2 doesn't.~~ even though R2 might read a value later than that of R1, that doesn't impose additional ordering consequences. To prove it, McKenney showed the scenario running in herd7, showing that it found a proof that the reordering is possible. The scenario did need to include a few additional complications in order to actually make the test meaningful, and McKenney spent a while walking through the requirements to actually measure the scenario so that the audience could understand what was happening in the example. "If you're serious about understanding this — which I'm not sure you should be —", McKenney advised, "you can consult the example in the book", referring to his book about parallel programming.

McKenney noted that this is an example of PowerPC being exceedingly weak — ArmV8 forbids the same reordering. The BPF developers probably prefer to go with PowerPC's version, however, so that they can avoid emitting extra memory synchronization instructions during code generation. BPF doesn't currently have an equivalent of READ_ONCE() and WRITE_ONCE(), so the example doesn't necessarily apply, but it's something to remain mindful of.

In short, even starting from the ArmV8 specification will require a certain amount of adaptation and careful thought, McKenney summarized. He then showed some of the support for BPF code in herd7. Mohan is working to extend that support, which will probably prove useful for validating any proposed formal model. The session ended with some discussion about where the memory model should live once it was written. McKenney plans to get it into the kernel's tools/standardization or documentation directories, but isn't particular about where exactly it ends up.

Index entries for this article
Kernel	BPF
Conference	Storage, Filesystem, Memory-Management and BPF Summit/2024

An instruction-level BPF memory model

Posted Jun 5, 2024 1:01 UTC (Wed) by comex (subscriber, #71521) [Link] (12 responses)

Huh. Doesn’t the second example violate C++11’s read-read coherence requirement? [1] That is, assuming you translate the reads and writes described in the example into std::atomic loads and stores that use memory_ordering_relaxed. Yet the C++11 memory model was designed with the intent that relaxed loads and stores would compile into simple load and store instructions, without barriers, on the architectures common at the time – which includes POWER.

So what gives? Am I misunderstanding the example, am I misunderstanding the C++ memory model, or is there actually a conflict here?

[1] As described in: https://en.cppreference.com/w/cpp/atomic/memory_order

An instruction-level BPF memory model

Posted Jun 5, 2024 4:37 UTC (Wed) by foom (subscriber, #14868) [Link] (11 responses)

PPC definitely doesn't allow the reordering on a single memory location described here as example 2.

I suspect the description may have been oversimplified or mis-transcribed from the talk, and likely originally involved mutations on multiple memory locations.

An instruction-level BPF memory model

Posted Jun 5, 2024 13:38 UTC (Wed) by daroc (editor, #160859) [Link] (10 responses)

It is, of course, entirely possible that I have made an error. Looking at my notes, I am quite sure that McKenney's second example did involve only one memory location, but it's possible that I missed some nuance of the specific kind of loads and stores used. There was a substantial portion of the talk dedicated to explaining how the reordering could actually be detected which went far enough over my head (and into the weeds) that I elided over it in the article.

An instruction-level BPF memory model

Posted Jun 5, 2024 23:45 UTC (Wed) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (9 responses)

You got it right, the "hazard ordered before" example has only the single variable "x".

Admittedly unconventional for a litmus test, but then again, this particular litmus tests probes an obscure corner of the ARMv8 memory model. ;-)

An instruction-level BPF memory model

Posted Jun 6, 2024 12:25 UTC (Thu) by foom (subscriber, #14868) [Link] (8 responses)

Just to be clear, you are saying the article's statement "operation R1 is allowed to read the value written by W1, even though R2 doesn't." is an accurate reflection of your intended statement?

That, on powerpc CPUs, if the initial value is 0 and W1 stores a 1, you could have R1 read 1 (as written by W1), yet R2 (later in program order than R1, on the same thread) read the initial value of 0?

I'm pretty sure that would be a novel claim, if so.

An instruction-level BPF memory model

Posted Jun 6, 2024 13:53 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (7 responses)

That most certainly would be a novel claim. ;-)

But the question is instead: "In ARMv8's hazard-ordered-before (HOB), does the fact that R2 cannot read a value older than that read by R1 impose additional ordering consequences?"

The answer for PowerPC is "no". So to the extent that HOB does impose additional ordering consequences, the BPF memory model needs to avoid imposing similar ordering consequences. This is not a surprise, given that ARMv8 is other-multicopy atomic and PowerPC is non-multicopy atomic. So we should expect that there are constraints in the ARMv8 memory model that do not apply to the BPF memory model. This is all OK, because it is OK for hardware to provide stronger ordering than required, with x86 being the usual poster child for this extra strength.

For more detail, please see slides 59-77 of the presentation at https://drive.google.com/file/d/1zLkpBVL1chDfyqZATOyKb_1Q...

Perhaps one of your questions is addressed by slide 64 and later. The initial scenario involves only one variable, but the added scaffolding required to test for additional ordering does add another variable. Does that help?

An instruction-level BPF memory model

Posted Jun 6, 2024 16:48 UTC (Thu) by comex (subscriber, #71521) [Link] (2 responses)

So Daroc indeed described this incorrectly in the article.

Which is quite understandable! The property actually being tested is subtle enough that even with the benefit of reading the slides repeatedly and then reviewing the ARM architecture reference manual, I still don't really understand it :)

An instruction-level BPF memory model

Posted Jun 6, 2024 17:06 UTC (Thu) by daroc (editor, #160859) [Link] (1 responses)

Thank you for catching that; I had to read through Paul's slides again myself to figure out what was going on.

I think that I have made a correct correction to the article.

An instruction-level BPF memory model

Posted Jun 6, 2024 17:10 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Looks good to me!

An instruction-level BPF memory model

Posted Jun 6, 2024 18:43 UTC (Thu) by Alan.Stern (subscriber, #12437) [Link] (3 responses)

The situation is very confusing; a more detailed explanation may help.

For one thing, consider the description of Hazard Ordered Before (HOB) on slide 60 of your talk. The slide isn't entirely clear; does it require that R1 is coherence-before W2, or does it require that R1 executes before R3? The first meaning (which seems to be the one you're using) follows directly from the read-read coherence rule, so it is a consequence of per-variable sequential consistency. Since you undoubtedly want the BPF memory model to include this, there's no need to consider HOB. And to be crystal clear, PowerPC does not allow an earlier read to observe a later value than a later read of the same variable in the same thread.

Adding to the confusion is the fact that, despite its title, the litmus test on slide 69 does not in fact test for HOB. That is, the condition in the "exists" clause isn't 0:r3=1 /\ 0:r5=0. The whole added business about y and the memory barriers is irrelevant for HOB. (Compare it to the Linux kernel code fragment on slide 62.) If you wanted to discuss the R pattern in slides 63-68, you shouldn't have gotten it all mixed up with HOB. (That could well be what confused Daroc.) It probably would have been better to avoid both topics and consider other-multicopy atomicity instead.

Another thing is worth pointing out, relating to slide 57. Why doesn't the LKMM include Dependency Ordered Before? Not because it knows the full execution a priori -- that's equally true for the hardware memory models. The real reason goes back to the early days of development of the LKMM. Originally we had two different versions, a strong and a weak model. The strong model _did_ include Dependency Ordered Before. But we ended up dropping the strong model; the current LKMM is a descendant of our weak model.

An instruction-level BPF memory model

Posted Jun 7, 2024 15:40 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (2 responses)

No argument whatsoever with the confusing part! ;-)

Agreed, HOB has to be more than coherence-before. And the point of those slides was in fact to come up with a test for some ordering stronger than coherence-before. Slide 69's attempt does show a difference between ARMv8 and PPC using R, but you are quite right if you are claiming that I did not prove either: (1) That this difference was in fact a consequence of ARMv8 HOB, or (2) That there is not some better test for the effects of HOB beyond coherence-before. I would of course be very happy for you to show me a better litmus test demonstrating the beyond-coherence-before effects of HOB! I am with you in suspecting that other-multicopy atomicity has something to do with it, but I have not yet proved this to myself. (Nor have I yet asked the ARM folks because experience indicates that if I haven't beat my head pretty hard against the question, I won't understand their answer.)

I am not so sure that I agree with your suggested change to the "exists" clause, given that DOB requires that R3 be coherence-before E2. What am I missing here?

You are quite correct on the historical sequence of events leading to our deciding to leave DOB out of LKMM. I was instead giving a post-facto rationale for that decision. I do not see how current hardware with current restrictions on speculation could leave out DOB without undue violence to the rest of its memory model, which might well be a failure of imagination on my part. So what are your thoughts on how reasonable hardware could leave out DOB?

An instruction-level BPF memory model

Posted Jun 7, 2024 18:58 UTC (Fri) by Alan.Stern (subscriber, #12437) [Link] (1 responses)

Responding to your later points first...

> I am not so sure that I agree with your suggested change to the "exists" clause, given that DOB requires that R3 be coherence-before E2. What am I missing here?

Here we are discussing a litmus test (on slide 69) for HOB, not DOB, so I'll assume your "DOB" above is a typo.

Yes, HOB requires that R3 is coherence-before E2. Hence the "exists" clause should test for the value of R3 (i.e., 0:r5) being the original value of y, namely 0. The conclusion of HOB is that R1 must also be ordered before E2, in other words, that the value of R1 (i.e., 0:r3) must also be the original value of y, namely 0. (Or to put it another way, HOB says that while R1 might be executed later than R3, out of program order, it can't be executed so _much_ later that it observes the value written by E2.) Therefore to test for a violation of HOB, the "exists" clause has to ask whether 0:r3 is different from 0 and 0:r5 is equal to 0. That's what my suggested change does. (Notice that none of the executions listed on slide 70 violate HOB in this way.)

> I do not see how current hardware with current restrictions on speculation could leave out DOB without undue violence to the rest of its memory model, which might well be a failure of imagination on my part. So what are your thoughts on how reasonable hardware could leave out DOB?

I agree with your point of view; DOB is a practical necessity. Another thing it is needed for, which the slides don't mention, is precise exceptions. Consider the Linux-kernel litmus test on slide 41; if the value of gp happened to be NULL so that the write to r1[i].value caused an addressing exception, the exception wouldn't be precise if the write to x (later in the program) had already been executed.

Nevertheless, even though I expect DOB will be enforced by all hardware that runs the Linux kernel, there doesn't seem to be any good reason for including it in the memory model. As far as I know, nobody has suggested any use cases for it or shown that any existing code in the kernel relies on it.

> Slide 69's attempt does show a difference between ARMv8 and PPC using R,

Indeed it does. And I would claim that this difference is _entirely_ due to R and has no connection with HOB at all. Despite what the slide says.

> I would of course be very happy for you to show me a better litmus test demonstrating the beyond-coherence-before effects of HOB!

I don't know what effects you're talking about. As far as I can tell, the only effect of HOB is related to coherence ordering; it says that the value read by R1 has to be earlier in the coherence order than the value written by E2.

An instruction-level BPF memory model

Posted Jun 11, 2024 17:08 UTC (Tue) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Yes, I should have said "HOB", apologies for my confusion.

On the "exists" clause, no memory model that enforces single-object sequential consistency can have 0:r3=1 and 0:r5=0. I was therefore looking for some consequence of HOB beyond single-object sequential consistency, hence my "exists" clause matching the HOB defintion.

A simpler approach, which I should have thought of to begin with, is to disable HOB in the Aarch64 memory model, for example, like this:

let haz-ob =
  (* Exp-haz-ob | *) TLBI-ob | IC-ob

(If you try this in a clone of the herd7tools repo, don't forget to “make install” so that your herd7 command knows about the change. And don't forget to do so again after undoing this act of vandalism!)

With this change, running herd7 on catalogue/aarch64/tests/CoRR.litmus says Sometimes, which supports your analysis of HOB, and invalidates mine. So, like DOB, HOB is not optional from a hardware viewpoint.

I agree that LKMM should continue to ignore DOB, at least until someone comes up with a good use case. But given that leaving out HOB causes the ARMv8 memory model to say Sometimes for CoRR.litmus, HOB is not optional.

Thank you for keeping me honest!

An atomic instruction is a full barrier to everything?

Posted Jun 5, 2024 5:59 UTC (Wed) by epa (subscriber, #39769) [Link] (3 responses)

All CPUs and tasks agree that the side effects of all instructions that come before an atomic instruction are visible before it, and all effects of subsequent instructions are not visible until it has been executed.

Is that really true? I am sure there are CPUs that are ultra-strict like that. But I would have expected that only side effects of that instruction (and the earlier instructions which caused it to happen) are guaranteed visible.

If I do an atomic swap between register r9 and address 123, but no other code anywhere uses r9 or looks anywhere near that address, is it necessarily the case that my swap instruction makes visible the effect of all instructions before it? So I could do things with other registers and addresses, then do any swap instruction on any register to act as a barrier for everything done so far?

An atomic instruction is a full barrier to everything?

Posted Jun 5, 2024 10:46 UTC (Wed) by farnz (subscriber, #17727) [Link] (2 responses)

There are CPUs that are ultra-strict like that (all x86 CPUs, I believe, meet this requirement); in your example, though, the detail you're missing is that if no other code uses r9 or address 123, there's no way to tell if the swap instruction was executed.

And it's a one-way condition; if the swap instruction was executed, then you know that all instructions before the swap instruction were executed. But if you do not know whether or not the swap instruction was executed, you do not know if the instructions before it are visible or not - and if you neither look at r9, nor inspect address 123, how do you know it executed?

An atomic instruction is a full barrier to everything?

Posted Jun 5, 2024 10:57 UTC (Wed) by epa (subscriber, #39769) [Link] (1 responses)

Yes I see your point. So a more pedantic way to state the property would be that “if the effect of the atomic instruction is visible then the effects of all instructions before it are visible”.

An atomic instruction is a full barrier to everything?

Posted Jun 5, 2024 11:14 UTC (Wed) by farnz (subscriber, #17727) [Link]

Yes, and that's implicit in virtually all descriptions of memory ordering; what we normally talk about is "happens-before" ordering, where we say that A happens-before B if and only if we can deduce that A must have happened given that we can demonstrate that B has happened.

There's also sometimes the need to talk about whether a happens-before ordering is global (i.e. if thread X sees that A happens-before B, then thread Y must agree that A happens-before B) or local (if thread X sees that A happens-before B, thread Y could still see that B happens-before A, that A happens-before B, or that there's no ordering relationship between A and B). This is, I think what you were reaching towards; an atomic instruction does not always guarantee a global ordering, but on x86, it's defined as doing so (LOCK prefixes are defined this way).