Zapping pointers out of thin air

By Daroc Alden
October 15, 2024

Kangrejos 2024

Paul McKenney gave a presentation at Kangrejos this year that wasn't (directly) related to Rust. Instead, he spoke about the work he has been doing in concert with many other contributors on improving the handling of subtle concurrency problems in C++. Although he cautioned that his talk was only an overview, and not a substitute for reading the relevant papers, he hoped that the things the C++ community is working on would be of interest to the Rust developers present as well, and potentially inform future work on the language. McKenney's talk was, as is his style, full of subtle examples of weird multithreaded behavior. Interested readers may wish to refer to his slides in an attempt to follow along.

Lifetime-end pointer zapping

He began with lifetime-end pointer zapping, which is a somewhat obscure topic, so he gave an example of when it could be relevant. Suppose that a C++ programmer has written a stack with two operations: push and clear. They want this to be multithreaded, so the idea is that multiple threads can push to this stack, and multiple threads can clear it, but only one clearing thread will "win" and get all of the pushed items. The natural way to model this situation is with a linked list and atomic compare-and-swap (CAS) operations to swap the pointer to the head of the list.

So suppose that the top of the stack points to object A, and a thread is trying to push a second object, B. The thread puts a pointer to A inside B, but the top still points to A. Then, the thread is preempted, and a second thread clears the whole stack. It processes the entries, and frees A. Now A is on the free list, but B still points to it. Then the second thread allocates and pushes object C. The malloc() implementation gives C the same chunk of memory as A, so now B points to C — or so we might think. The first thread resumes, the compare-and-swap operation sees the same pointer (to A or C), and the push of B succeeds. See the linked slide for a graphical depiction of the process.

But all is not well. The pointer from B to C is an example of a "zombie pointer", McKenney said — it's a pointer that would be legal in assembly code, and the types match, but it's not allowed in C++. Pointers in C or C++ are not just the addresses of particular locations in memory, they also have provenance. And B's pointer to A is invalidated as soon as A is freed — even if C ends up being put back into the same location. So if the compiler can prove that A will be freed and B will hold onto a reference to it anyway, then it knows any attempts to read through that pointer are undefined behavior, and it is free to make optimizations that assume this won't happen. Plus, while the simple example of this stack is "ABA tolerant", not every data structure deals so well with problems like this.

Björn Baron agreed that this particular problem would occur in Rust as well. Even though Rust's provenance rules are not completely settled, it's still definitely undefined behavior to have a pointer to an allocation that has been freed. There has been some discussion about adding a compiler intrinsic to help annotate this case, he said. McKenney said that this matched his experience working on the problem — a lot of discussion for relatively little progress.

One problem is that compiler developers really like provenance, because it permits otherwise impossible optimizations. Tracking provenance is complicated, though. Provenance can be erased, such as by converting a pointer to an integer. That isn't a complete solution to lifetime-end pointer zapping, however. Last year Davis Herring proposed "angelic provenance", McKenney said, which requires the compiler to pick the provenances that make a program valid, if there's more than one possible interpretation. This helps generally simplify the reasoning around provenance, but does not help with the example he gave.

One potential fix would be to make atomic<T*> values (C++'s atomic pointers) erase provenance, including pointers referenced by successful CAS operations. This proposal has gone to the C++ committee, and initial reviews have been positive. Benno Lossin said that since Rust is trying to develop its own provenance rules, it probably makes sense for the Rust developers to consider having similar APIs that remove provenance. Alice Ryhl mentioned that Miri (Rust's undefined behavior sanitizer) actually already found a problem similar to the one with the stack McKenney presented in the queue implementations in Rust's standard library earlier this year.

Out of thin air

Next, McKenney discussed out-of-thin-air (OOTA) cycles. He gave an example involving two processes and two variables (initially both zero). Suppose that one process reads the value 42 from X, using a relaxed read. Then it stores that value into Y. The other process does a relaxed read of Y — getting 42 — and then stores it into X, for the first process to read. If both variables were initially zero, where did the 42 come from? In C++ terminology, it came "out of thin air". The value does not have to be 42; a cycle like this could produce any value at all, which is a major problem for reasoning about programs, so we would like to be able to show that it does not happen. The "cycle" part of the name comes from the fact that if you draw a diagram of where the value came from, it makes a loop — X got it from Y, and Y got it from X.

OOTA cycles are intuitively impossible, and in fact do not actually happen on real hardware — but formalizing a description of why it does not happen turns out to be surprisingly hard to fit into existing memory models. This is important, because the memory model of a language is part of determining which programs are valid, and therefore which optimizations are valid as well. The C language committee eventually gave up, McKenney said, just defining the memory model and then tacking on a note saying "and also don't do this." Java did the same thing, when trying to formalize it around 2000, and eventually gave up.

Researchers from the University of Kent have a promising theoretical model that takes into account compiler optimizations, and could finally put the matter to rest. But their model is theoretical and not yet ready, he said. McKenney has a simpler explanation for why this isn't an issue in real life: reads take time, and instruction execution takes time. To form an OOTA cycle, he said, at least one of those steps must go backward in time in the real world. Either a read must somehow get a future value, or executing a read and a store must take negative time. The only way this can be broken is from bad hardware speculation (but bad hardware can do anything anyway), or from the compiler making it so that the operations take no time at all to execute (such as by optimizing them out).

The reason this is a poor match for existing memory models is because they don't have a notion of time — only a notion of causal relationships. And so expressing the reasons that an OOTA cycle is impossible in that timeless formal language is difficult.

Baron suggested that maybe the CPU could predict that there would be a store and later check whether that guess was correct. He wasn't aware of any CPUs that actually did that, but it could theoretically happen. McKenney was glad that someone had brought that up unprompted — "and I haven't paid that man" he said with a grin.

McKenney explained that there are really two kinds of links between events: temporal and atemporal links. A store-to-load link is temporal; when a store is performed, it takes time for that store to propagate across a system. Time has to elapse for information to propagate in the real world. But store-to-store links are atemporal — one store can 'win' over another even if it happened earlier in time, because CPU caches don't keep a global clock. McKenney has done experiments on actual hardware showing that the winning store can happen more than a microsecond before the store that it overwrites. Finally, load-to-store links are also atemporal and tell you nothing about the ordering of the events, because you might get an old value.

So, on real computers, the only kind of events that you can rely on to establish the actual order in which things happened in the real world are store-to-load links, where one CPU stores a value, and another loads it. How is this related to speculation? Well, suppose that a CPU decided to speculate a value. It would, at some point, need to check whether that speculation was correct. In doing so, it would need to perform a load. If that load sees a different value, then it needs to throw away the speculation. But if it sees the speculated value, it still can't take a negative amount of time, since it did eventually need to do that load. This means that an OOTA cycle would still be impossible, even with speculation, because speculation cannot actually get rid of a load — only move work that depends on that load earlier in time.

This sounds comforting, but what's the takeaway for compiler developers? The fact that OOTA can't occur on real hardware for physics-based reasons means nothing to compiler developers, who are forced to rely on the memory models that the C language specification provides. Ultimately, McKenney had one takeaway for compiler people: don't invent non-volatile atomic accesses. It is already against the rules to invent accesses to volatile atomics, but it is somewhat less clear why inventing accesses to non-volatile atomics would cause a problem; refraining from doing so prevents OOTA cycles.

The reason is semantic dependencies. The C abstract machine may not have a real notion of time, but it does have a notion of data dependencies. This would be enough to prevent OOTA cycles, except that compilers are permitted to invent extraneous loads (such as to reduce register pressure). McKenney showed an example where adding a second load of a non-volatile atomic boolean enabled a sequence of optimizations that actually removed a data-dependency between two other variables. With the dependency removed, the compiler could then reorder the remaining statements to produce a value out of thin air.

So McKenney's final piece of advice on the topic of avoiding OOTA cycles is simply don't do that. Inventing, duplicating, or repurposing non-volatile atomic loads is a necessary step to cause OOTA cycles, so if the compiler doesn't do that, then it doesn't need to worry about this class of errors. Whether the C++ committee will accept McKenney's paper on the topic saying as much remains to be seen — but since this issue has vexed the C++ community for some time, it seems likely.

Provenance

Posted Oct 15, 2024 16:20 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (4 responses)

Historically Rust's pointer model appears compatible with the pure PVI (Provenance Via Integers) "bag of bits" model in which a pointer is just an address. On x86-64 you can still (though it's a bad idea) just 'as' cast a pointer to the usize 64-bit unsigned integer and later cast it back and it will probably work, after all the same trick works in C and C++ on the same LLVM backend... mostly.

Aria added an experimental (nigthly) set of Rust pointer APIs as an experiment to find out whether most (all?) Rust programmers could live with strict PNVI (Provenance NOT Via Integers) rules. Rust's Strict Provenance APIs are now getting a FCP (Final Comment Period) which is one of the last steps before the stabilization of the associated APIs. Aria's experiment also adds APIs for "Angelic determinism" (so-called PNVI-ae-ua: Provenance Not Via Integers, but with Address Exposure and User Disambiguation, the model currently proposed for the C programming language) but if you pick this you need to appropriately expose any pointers that you're expecting to be able to construct later this way.

In practice this latter API is equivalent to the existing (but to perhaps be deprecated one day) 'as' cast of an integer type to a pointer. The cases where PVI "should" work but PNVI-ae-ud could not are typically not actually working with the compiler you get today because LLVM never actually supported them.

At the very bottom where the Linux kernel hardware drivers live some amount of these "Angelic" APIs is inevitable. You literally can't express the idea "I know the physical address of this MMIO register" in the strict provenance model. However there's a reason to prefer strict provenance - a reason which doesn't apply as well to C++. Analysis both by a compiler and at runtime benefits from the PNVI semantics, in the weaker models (and especially PVI) it's not practical to do much analysis, for most non-trivial code the analysis results will just be a confused shrug. Maybe it's fine? In Rust it's practical to move the relatively modest amount of unsafe code that *could* use strict provenance to the strict API, and the part which is both unsafe *and* reliant on tricks not possible with PNVI to the exposure API, a tiny fraction which will include some Linux drivers but not most of the kernel or userspace software. In C++ if a window of time to attempt such a thing existed it was decades ago.

I hope the effect of this talk was not to derail stabilization of the provenance APIs in Rust, on the other hand, if the kernel people need something specific that's in neither strict nor exposed APIs (I don't think so) this is the last time to say so and not expect rotten fruit to be hurled by other Rust users.

Provenance

Posted Oct 15, 2024 23:36 UTC (Tue) by dcoutts (subscriber, #5387) [Link] (3 responses)

Quite often when I read these kinds of (fascinating) articles about the intricacies of concurrent lock free programming in C/C++, I come away thinking smugly to myself "it isn't a problem in my favourite GC'd language".

As far as I've understood, the ABA problem can't occur in a GC'd language, because the memory location cannot get recycled by the storage manager since there's still a live reference to the old object. The same goes for the "zombie pointer" example:

> Then, the thread is preempted, and a second thread clears the whole stack. It processes the entries, and frees A. Now A is on the free list, but B still points to it.

In a GC'd language this doesn't happen. B still pointing to A keeps A live. Thus the memory location for A cannot yet be reused. And thus C will necessarily have a different address from A and so the CAS will do the right thing.

Obviously, one has to pay the GC performance tax. But having done so, it appears one can take advantage of lock free programming much more easily. Am I wrong to be so smug? Is this dangerous over-simplification?

Provenance

Posted Oct 16, 2024 5:00 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> Obviously, one has to pay the GC performance tax. But having done so, it appears one can take advantage of lock free programming much more easily. Am I wrong to be so smug? Is this dangerous over-simplification?

You're actually right. Garbage collection is great help for many lockless algorithms. There's a great article about it: https://bitbashing.io/gc-for-systems-programmers.html ("Garbage Collection for Systems Programmers")

Sidenote, not ALL garbage-collected languages are created equal. For example, Go uses "fat pointers" for interfaces, and they can't be copied atomically, so it can't guarantee this kind of safety.

Provenance

Posted Oct 25, 2024 10:02 UTC (Fri) by jch (guest, #51929) [Link] (1 responses)

> Go uses "fat pointers" for interfaces, and they can't be copied atomically, so it can't guarantee this kind of safety.

Couldn't you take a pointer to an interface, and manipulate that atomically instead of the interface itself?

Provenance

Posted Oct 25, 2024 16:13 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Only via unsafe code. It's also a source of constant issues because a `nil` in Go might not be equal to another `nil`.

C++ already gives us a tool for pointer zapping

Posted Oct 15, 2024 16:25 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (25 responses)

Specifically, if you want to do this:

> One potential fix would to make atomic<T*> values (C++'s atomic pointers) erase provenance, including pointers referenced by successful CAS operations.

Then you just call std::launder() every time you retrieve a pointer from an atomic<T*>, and if I am reading the standard correctly, this is perfectly legal, but might invoke implementation-defined behavior because the pointer is technically invalid and we are evaluating the pointer's value before it is passed to std::launder(). I would like to think that most compilers would be sane enough not to attempt to define and document some restriction on that (which they are required to do if they want to mess with it, since it is IB and not UB). On hardware such as CHERI, this whole idea is either unworkable (std::launder() cannot exist in full generality on that platform) or unnecessary (because compare-and-swap could hypothetically also compare the hardware's pointer metadata, and then you'd never end up in this position in the first place because the push of B would fail and have to be retried with the correct pointer to C).

I have no idea if there is an analogous function in C. Rust has expose_provenance, but it's still unstable and also does not work on CHERI.

C++ already gives us a tool for pointer zapping

Posted Oct 15, 2024 20:24 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (23 responses)

Thinking about this a little more deeply, on the one hand, I'm not 100% sure that I have read the "pointer-interconvertibility" requirement of std::launder() correctly, so it may be the case that I am mistaken and this is in fact UB.

But thinking more broadly, something has to give in the realm of provenance. We cannot have all three of the following:

1. Pointers have indispensable provenance - that is, you can't/shouldn't ignore the provenance requirements (regardless of whether std::launder() is the correct incantation for doing so or not).
2. Pointer comparisons, including atomic compare-exchange, are agnostic to or ignorant of provenance (i.e. pointers with non-overlapping provenances may compare equal, and the compiler or architecture may then proceed to treat them differently anyway).
3. Pointer atomics are practically useful for implementing lock-free data structures.

Assuming we don't want to lose (3), we can give up either of (1) or (2). (1) is already talked about both in the article and my initial comment, but the problem is that on CHERI, provenance really is indispensable at the hardware level, and there are no magic compiler incantations to change that, meaning we have no choice but to drop (2) instead of (1) when implementing C++ (or Rust, for that matter) on CHERI. So if you have source code that relies on such incantations, it will not work on CHERI, and might even fail to compile depending on how the CHERI implementation decides to deal with whatever part of the standard specifies that magic incantation.

I'm not sure how the standards committee should approach this if they want to allow programmers to write one codebase that works on both CHERI and regular hardware. You could say something like "either provenance does not exist and no std::launder-like incantation is required, or else it does exist and pointers with non-overlapping provenance must compare unequal in the first place." But then you lose a lot of compiler optimizations on standard (non-CHERI-like) hardware, so they probably don't want to do that either.

Another option would be to specify that == and similar operations have the effect of propagating provenance from one pointer to the next, but now you have two problems:

1. == is symmetric, so this expands provenance rather than replacing it. CHERI does not support a pointer that has more than one provenance at the hardware level, so the compiler has to somehow divine which direction the provenance should flow (or use "angelic provenance," but that is not an algorithm for what provenance-transfer instructions you should really emit on a CHERI-like platform, it's just a hack to say that non-CHERI-like platforms should DTRT).
2. == is transitive, so this also expands provenance non-locally (i.e. if we previously know that x == y, and we later discover that y == z, then we must expand provenance to x even though it may be in an entirely different part of the program). That makes escape analysis very hard.

C++ already gives us a tool for pointer zapping

Posted Oct 15, 2024 21:32 UTC (Tue) by tialaramex (subscriber, #21167) [Link] (15 responses)

For (2) Compiler reality is already that whether you were supposed to be able to detect provenance is unclear. The language owners would like a Yes/ No answer, they have a preferred answer but they could live with the other option if they have to - however the compiler answer is "Um, maybe?"

LLVM has a pile of open tickets about this including from both Clang and Rust, read all of https://github.com/rust-lang/rust/issues/107975 to get some idea how insane the situation is, and if you want to weep read the related LLVM tickets too.

IIUC an actual CHERI hardware person chimed in on the Rust strict provenance stuff and said the whole point of CHERI is that the PNVI-ae-udi C or C++ style APIs are nonsense on CHERI and that's OK. If you can't live with strict provenance (and some of the kernel cannot) you can't have CHERI and that's too bad, you'd better be very, very careful instead. There is no way to "make it work" and attempts to do so were a disaster in the Morello hardware by their telling.

C++ already gives us a tool for pointer zapping

Posted Oct 15, 2024 22:29 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (14 responses)

Strictly speaking, we don't need CHERI to support non-strict provenance checking. We merely need it to answer honestly about whether two pointers are actually equivalent (substitutable, not just numerically equal) when we test them for equality, and we also need that behavior to carry over into its atomics.

Since I have already called out that I might be wrong about std::launder(), for the rest of this comment I am going to use the word "launder" to mean "whatever incantation causes the compiler to ignore provenance requirements for a given pointer, assuming that such an incantation exists or would be standardized." Under PVNI-plain, this incantation is a simple pointer-to-integer round trip, but I want to speak generally for all possible semantics and not just PVNI-ish semantics.

As for the LLVM thing, I think it depends on how this gets formalized. There are four-ish possible formalizations that I can think of:

0. Comparing pointers which have non-overlapping provenance is implementation-defined or UB, so LLVM's behavior is right (but maybe they still need to explicitly document it). In that case, we can't make any useful atomic<T*> data structure that works on CHERI (and possibly not even on conventional hardware) unless the compiler provides a language extension explicitly supporting it. Even laundering probably doesn't help in all cases because atomics do not (currently) provide an API to say "do a compare-exchange, but atomically launder your stored value first."
1. Pointers with non-overlapping provenance may compare as equal or unequal. Such a comparison has no effect on provenance. This probably also makes atomics unworkable without laundering, but at least now we're not looking at nasal demons.
2. Pointers with non-overlapping provenance must compare as unequal, but the implementation may choose to arbitrarily launder any pointer at any time, in which case it may compare as equal to any other pointer to an object at the same address. In other words, the compiler may arbitrarily pretend that a pointer has been laundered wherever it so pleases, but it must be consistent. Once the implementation launders a pointer, that pointer must stay laundered for future operations (but not past operations - the implementation is explicitly *not* required to "go back in time" and launder any other pointers that were ever compared with a given pointer, so my earlier concern about the transitivity of == would not apply here). Believe it or not, I think this probably can be implemented on CHERI-with-honest-pointer-comparisons (as described above), simply by configuring the compiler to never optimize the result of a pointer comparison to anything other than unequal (if you can prove that the provenances do not match, then they compare unequal, otherwise let the hardware sort it out, meaning that it returns true on regular hardware and false on CHERI, and that's OK because both results will cause the program to behave acceptably on their respective platforms). This would possibly require some additional tracking on the part of the compiler to ensure that it is consistent about its reasoning with respect to provenance, but I don't see this as unduly burdensome. We're not asking the compiler to solve the halting problem or anything, just don't insert optimization X unless you have already applied optimization Y to the same code.
3. Pointer comparisons must fully respect provenance. Pointers must compare equal if and only if they have the same provenance (or at least overlapping, as in the case of array element pointers) and are numerically equal. This is the most correct behavior, but I suspect it defeats some optimizations and/or would require compilers to explicitly track provenance at runtime (on non-CHERI hardware), both of which are less than ideal.

C++ already gives us a tool for pointer zapping

Posted Oct 15, 2024 22:58 UTC (Tue) by NYKevin (subscriber, #129325) [Link] (3 responses)

Oh, and there's also:

-1. Pointers compare equal if and only if they point to the same address. Provenance has nothing to do with comparisons. atomic<T*> is completely impossible to use portably, as with (0) and (1).

C++ already gives us a tool for pointer zapping

Posted Oct 16, 2024 10:19 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (2 responses)

Your -1 is the option preferred by Rust's language owners.

Basically for comparison purpose Rust wants ptr1 == ptr2 is the same as writing ptr1.addr() == ptr2.addr()

Today because of bugs LLVM will sometimes decide that ptr1.addr() != ptr2.addr() even though these two integers are literally the same. You can (with some clever footwork) make this evident in safe Rust, so then either the LLVM IR is broken (as is actually the case) or Rust's compiler is lowering the Rust representation to LLVM IR incorrectly (nope) because Rust defines the safe Rust as not having this issue, so it's definitely a bug somewhere.

C++ already gives us a tool for pointer zapping

Posted Oct 16, 2024 13:55 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (1 responses)

Also I don't think this actually does mean you can't use AtomicPtr<T> (Rust's analogue of C++ Atomic<T*>)

You do need to be careful. Hazard Pointers let you delay reclamation and avoid the aliasing problem here for example.

Thread 1 finds the front of the LIFO is A, it successfully points its HP at A, then it makes B, then it connects B to A, it's about to attempt the compare-exchange but then...

Thread 2 interrupts, it marks A for reclamation, but A cannot be reclaimed immediately because there's a Hazard Pointer pointing at it, so it's on the "reclaim later" pile. It makes C, and puts C at the front of the LIFO

Thread 1 resumes, it tries to atomically replace A with B at the front of the LIFO, this fails A.addr() != C.addr() and so thread 1 learns that C is now at the front. It releases the HP and points it successfully at C instead, it adjusts the address inside B to connect to C instead and this time a compare-exchange succeeds, swapping C for B.

Some time later, for any reason, reclamation is attempted for A and it now succeeds.

C++ already gives us a tool for pointer zapping

Posted Oct 16, 2024 14:17 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

Notice that if thread 2 interrupts us and replaces A with C while we're setting and checking the hazard pointer (so A isn't yet protected by the HP), but A and C had the same address, we end up learning that we successfully made a HP pointing at C (even though we asked for a HP pointing to A) which is the new first item, and we connect B to C not A (even though mechanically that's identical) and our compare exchange succeeds. We didn't invoke Undefined Behaviour, although we did do what looks like needless work reading a value twice - the work wasn't needless because the second read made our code correct (and in other circumstances would stop us in our tracks when there's a real problem to be fixed and we should try again).

In CHERI, this case works properly, despite A and C having distinct magic provenance bits, the checks all work and the correct pointer is stored in B, so long as the code was written very carefully. And I believe MIRI can check this is correct (in Rust) despite the fact most programmers do not have any CHERI test hardware.

Sorry I'll stop now, this was rattling in my head.

C++ already gives us a tool for pointer zapping

Posted Oct 16, 2024 7:07 UTC (Wed) by comex (subscriber, #71521) [Link] (4 responses)

LLVM is working on making pointer comparisons consistent:

https://github.com/rust-lang/unsafe-code-guidelines/issue...

But even if LLVM never optimized pointer comparisons, you would still be in trouble. The problem is alias analysis. The compiler is always tracking which pointers can legally alias which other pointers, and if it thinks pointer A can’t alias pointer B, then it can reorder accesses to A with respect to accesses to B. And it assumes that a freshly malloced pointer doesn’t alias with any other pointer not derived from it. Suppose pointer A is freshly malloced, and pointer B was previously freed, but you did a pointer comparison (which was not optimized) and it turns out that A and B have the same address. So you expect to be able to, say, write a value to pointer A and then read the same pointer back from pointer B. But oops, the compiler assumes that the write and the read don’t alias and switches their order!

This behavior can be justified by pointer zapping or by provenance. But if you wanted to get rid of the behavior instead of justifying it, then you would need the compiler to stop assuming that new allocations don’t alias. Either the compiler has to disable that optimization entirely, or it has to detect that you did the pointer comparison and conditionally disable it. But conditionally disabling it is a lot harder than it sounds, since e.g. the comparison might be hidden behind a function call, or it might even have been optimized away. And disabling it entirely… well, if it were just about heap allocations, you would get a small but measurable loss in optimization potential. But to be fully consistent you would probably need to also disable it for *stack* allocations, and that might be quite a bit worse.

Admittedly, it’s hard to be sure how bad the performance loss would be, because nobody has actually tried to write an optimizer that’s as aggressive as possible while still being formally sound under some sort of limited-provenance model.

The ‘fun’ thing is that this particular type of alias analysis issue essentially never occurs in real programs. And yet it’s extremely difficult to formally rule out without adding relatively-invasive provenance rules. And if you don’t formally rule it out, then it can probably be exploited from safe Rust code. And so we end up with the provenance rules.

C++ already gives us a tool for pointer zapping

Posted Oct 16, 2024 7:24 UTC (Wed) by NYKevin (subscriber, #129325) [Link]

Well, yes, that is the main problem here, which is why I think (3) is totally impossible to implement (outside of CHERI) without something like pervasive runtime provenance tracking. But I'm still not convinced that (2) is impossible. It "just" requires the compiler to look at every pointer comparison and break it down into one of four cases:

1. We can statically prove that the pointers have the same address and overlapping or identical provenances. Fold it to true.
2. We can statically prove that the pointers do not have the same address, or have non-overlapping provenances. Fold it to false.
3. We can statically prove that if the pointers have the same address, then they also have overlapping or identical provenances. Emit a comparison instruction.
4. We can't statically prove any of the above. Insert provenance barriers for both operands, which only apply prospectively to these exact variables (i.e. calling functions don't have to know we did this unless we return one of the pointers, in which case we were already in the business of more complicated escape analysis anyway), and then emit a comparison instruction.

(3) sounds weird, but it probably happens fairly often (if both pointers are known to be valid-or-else-UB-happened, if both pointers came from the same array as in a simple for loop, etc.). Meanwhile, if we're in case (4), then I find it hard to believe that we're really losing much performance from the optimization barriers anyway. And (4) is pessimistic anyway. If you can prove that one of the pointers is valid-or-else-UB-happened, you probably don't need to insert a provenance barrier for that pointer at all. The set of optimizations we're losing here looks really, really small to me, but maybe I just don't understand how compilers work.

On CHERI-with-rich-pointer-comparison, you can replace both (3) and (4) with "We can't statically prove either (1) or (2). Emit a comparison instruction," since in that case the hardware will already DTRT anyway. For that matter, you can *also* do that in any implementation which does not have provenance, since it is semantically equivalent to inserting provenance barriers "everywhere."

C++ already gives us a tool for pointer zapping

Posted Oct 17, 2024 13:52 UTC (Thu) by epa (subscriber, #39769) [Link] (2 responses)

In a 64-bit address space, how hard would it be to guarantee that malloc() returns a unique value each time? So you can never free an object and then get the same address again. Some memory might leak, but once it became a large enough contiguous area the allocator could hand it back to the kernel somehow... right?

C++ already gives us a tool for pointer zapping

Posted Oct 17, 2024 18:35 UTC (Thu) by notriddle (subscriber, #130608) [Link] (1 responses)

> In a 64-bit address space, how hard would it be to guarantee that malloc() returns a unique value each time?

Isn't the real problem nowadays TLB and cache utilization? Because it seems like this would result in severe fragmentation.

C++ already gives us a tool for pointer zapping

Posted Oct 18, 2024 18:05 UTC (Fri) by Wol (subscriber, #4433) [Link]

It would be a bit more work for the programmer, but this sounds like it would be easily possible with arena allocs.

I understand behind the scenes malloc allocates a big area, which it hands out in small bits. If that big area is an entry in the page table, and especially if the programmer can hint to malloc that "these are long lived, these are short lived, these are small, these are large", it's then easy to move/consolidate all this in ram (copy, update the page table, you're done), and malloc can just drop the arena and not re-use the arena's logical address when it's all freed.

Cheers,
Wol

Pointer comparisons ignore provenance

Posted Oct 16, 2024 17:31 UTC (Wed) by Tobu (subscriber, #24111) [Link] (4 responses)

See this comment by Ralf Jung working out the options: your -1 preferred, 1 in LLVM would have been worked around by casting the pointers to usize before doing the comparison. Final decision is that pointer comparison ignores provenance.

There are two consistent options for pointer comparisons in Rust (given that they have to be safe):
only addresses are compared: result is true if and only if the addresses are equal
provenance may be taken into account:

distinct addresses always compare inequal
equal address + equal provenance always compares equal
equal address + inequal provenance may non-deterministically compare either way

Pointer comparisons ignore provenance

Posted Oct 16, 2024 18:40 UTC (Wed) by tialaramex (subscriber, #21167) [Link] (3 responses)

Also notice that if I implement (1) but you insist you want (2) I can just say that's what I'm offering.

You can't prove me wrong because the only difference in (2) is that it's possible to get a different answer, but never required, so, I can tell you that you just got unlucky and that's why it seems as though I implement (1).

So the rational thing to do is implement (1) which is at least explainable.

Pointer comparisons ignore provenance

Posted Oct 19, 2024 19:41 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (2 responses)

(1) is not semantically equivalent to (2). Under (2), if you tell me that pointers x and y are equal, then I am entitled to assume that they have the same provenance and so I can use either pointer to access the same object. Furthermore, (2) also extends to atomic compare-and-swap, meaning that you are required to ensure that the whole pointer-zapping problem described in the article does not arise in the first place (or at least, that the resulting binary behaves as if the problem does not arise).

(1) provides neither of those guarantees, so it is not equivalent to (2) no matter how lucky or unlucky I might get.

Pointer comparisons ignore provenance

Posted Oct 19, 2024 19:46 UTC (Sat) by NYKevin (subscriber, #129325) [Link] (1 responses)

NB: This is assuming you're talking about my (1) and (2), and not Tobu's, because that seemed obviously wrong to me. Tobu's (1) requires that casting both pointers to usize and comparing them produces the same result, whereas (2) permits it to non-deterministically return false in cases where the provenance does not match, so (2) allows optimizations that (1) forbids. It is certainly not the case that compilers will prefer to leave optimizations on the table if they are permitted by whatever standard we end up adopting.

Pointer comparisons ignore provenance

Posted Oct 20, 2024 8:43 UTC (Sun) by tialaramex (subscriber, #21167) [Link]

Unsurprisingly (surely even to you?) when I respond to Tobu, I am referring to what Tobu wrote.

If you want your numbering to exclusively refer to a specific thing you once wrote somewhere you're going to need much bigger numbers, I recommend 128-bit UUIDs for this work, however, since we're talking among people, I would suggest that naming things, though it has some undesirable quirks, is preferable, that's why the "worst case" provenance model offered in Rust and the default provenance proposed for C is referred to as "PNVI-ae-udi" rather than some hard to distinguish UUID, when I typo'd that name earlier everybody still knew what I meant.

As to the substance: Useless "optimisations" have to be painstakingly eliminated so there's no reason to ask for them knowing that you won't use them. Yes it is likely that LLVM will decide they can "optimise" ptrA == ptrB in some way that's not helpful, and if they want to do that Rust will just emit the IR for ptrA.addr() == ptrB.addr() instead so that it gets the required semantics. Today it doesn't matter because (as I wrote) there's an LLVM bug and as a result LLVM can believe that for two machine integers A, B A != B but A - B = 0 which is nonsense and the LLVM devs know it's nonsense but it's a consequence of this sort of "optimisation".

C++ already gives us a tool for pointer zapping

Posted Oct 16, 2024 17:03 UTC (Wed) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (6 responses)

The variant of CHERI provided by ARM MTE works just fine with lockless ABA-tolerant algorithms. The "provenance" bits are part of the pointer, so when the provenance changes, the cmpxchg() operation will fail. The enclosing loop will then retry, and all will be well.

So the exact same code works for traditional pointers and for CHERI-style ARM MTE pointers, no problem!

C++ already gives us a tool for pointer zapping

Posted Oct 16, 2024 17:34 UTC (Wed) by jrtc27 (subscriber, #107748) [Link] (5 responses)

MTE isn't CHERI nor is it inspired by it. It's its own unrelated thing that has some overlapping abilities.

C++ already gives us a tool for pointer zapping

Posted Oct 16, 2024 18:13 UTC (Wed) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (4 responses)

MTE is completely unrelated to CHERI? I will take your word for it for this discussion.

Please note that the LIFO push algorithm described in the paper has been in use since at least the 1980s, and quite likely the 1970s. Therefore, if CHERI fails to support it, I will consider that to be a bug in CHERI rather than a bug in LIFO push. ;-)

C++ already gives us a tool for pointer zapping

Posted Oct 17, 2024 6:39 UTC (Thu) by jrtc27 (subscriber, #107748) [Link] (3 responses)

CHERI can support lock-free CAS algorithms like that just fine, the implementation (whether of a language feature or in your assembly, likely the latter given language-level provenance rules even without CHERI) just needs to do a compare of the full capability, not just the address portion. Morello, being based on an AArch64 with LSE, has CAS(A)(L) instructions that do just that. CHERI-RISC-V, and Morello if you opt to not use LSE, can emulate CAS with a load-linked/store-conditional loop (or non-loop for weak CAS) and a full capability comparison (CHKEQ on Morello) instruction in the middle.

C++ already gives us a tool for pointer zapping

Posted Oct 17, 2024 23:58 UTC (Thu) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (2 responses)

Good to hear, thank you!

If you can say, what is the performance of the CAS(A)(L) instructions that handle both the address and the capability compared to those that handle only the address?

C++ already gives us a tool for pointer zapping

Posted Oct 18, 2024 0:41 UTC (Fri) by jrtc27 (subscriber, #107748) [Link] (1 responses)

Performance is always going to be dependent on the implementation, as a tradeoff with power and area, and there are some quirks on Morello due to being an experimental prototype implementation, but in general it would be reasonable to expect them to perform somewhere between plain CAS(A)(L) and CAS(A)(L)P. On a high-performance implementation there will likely be a 128-bit data path already there (true of Morello) so they probably all perform about the same, but a lower-performance implementation could choose to have a narrower data path to reduce core area and take an extra cycle. But in terms of cache coherency traffic it's the same, it still fits in a single cache line, you're just accessing more of that cache line.

C++ already gives us a tool for pointer zapping

Posted Oct 18, 2024 16:36 UTC (Fri) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

Very good, thank you!

With that performance issue mostly out of the way, one big remaining issue is the memory occupancy. Interesting tradeoff, the overhead of double-size pointers vs. the overhead of killer static analysis. ;-)

C++ already gives us a tool for pointer zapping

Posted Oct 16, 2024 18:04 UTC (Wed) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

The C++ committee did look into use of std::launder, but it does not handle all the cases. For but one example, the pointed-to object might be freed just after the std::launder operation completed, thus zapping the just-now-laundered pointer.

OOTA was a real issue in Java around 2000

Posted Oct 15, 2024 18:37 UTC (Tue) by iabervon (subscriber, #722) [Link]

I heard from someone on a related team at Sun at the time that there were concerns that hardware might speculate a value, copy it around, and then find that the value it had speculated now matched the location it was supposed to have come from. The Java team wanted to just say this was undefined behavior that programmers needed to avoid triggering, but they were unable to convince attackers who wanted to break out of the Java applet sandbox using references to JVM internals to kindly follow the requirements for avoiding undefined behavior.

IIRC, this was at the level of whether the JVM needed some memory barriers that were relatively expensive, rather than in formalizing the Java language. I think it ended up with them not having a way to promise programmers that OOTA couldn't happen by formal specification, but the runtime has to prevent it anyway in order to provide security properties to the system owner.

Provenance checks

Posted Oct 16, 2024 6:22 UTC (Wed) by HIGHGuY (subscriber, #62277) [Link] (4 responses)

In the first case with zombie pointers, isn’t the issue at hand not simply that the compare and swap checks the pointer address but not the provenance?

If that compare and swap checked provenance too, then it would fail (because A!=C, even if addresses are the same) and the loop would retry.

Then it’s up to the compiler to prove that, under this model, the variant that doesn’t perform the additional loop (because pointers are values after all and most architectures don’t track provenance) is equivalent to the one that does. An architecture with CHERI for example, is then expected to perform that additional loop because the pointers values don’t compare equal.

Provenance checks

Posted Oct 16, 2024 17:09 UTC (Wed) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (3 responses)

Yes, in CHERI-inspired systems, such as ARM MTE, the approximation to provenance carried in the pointer will result in cmpxchg() failures, and the resulting retry will fix things up. But there are a finite number of bits, so there is a number of ABA cycles that will return to the same bit pattern.

In other words, things like ARM MTE can reduce the probability of ABA, but the cannot eliminate it.

Note that defining cmpxchg() to always account perfectly for provenance is a no-go because often the provenance is unknown. Yes, there are tools such as Miri that might be able to figure it out, but we might not want such tools to be on the normal edit-build-run cycle.

Provenance checks

Posted Oct 17, 2024 13:57 UTC (Thu) by DemiMarie (subscriber, #164188) [Link] (2 responses)

One can eliminate ABA by ensuring that address space is not reused while there is a valid capability that could be used to access it. This also eliminates heap use-after-free vulnerabilities.

Provenance checks

Posted Oct 17, 2024 15:35 UTC (Thu) by intelfx (subscriber, #130118) [Link] (1 responses)

> while there is a valid capability that could be used to access it

How is this supposed to be tracked (in a way that does not amount to a GC implementation, for obvious reasons)?

Capability GC

Posted Oct 18, 2024 22:01 UTC (Fri) by DemiMarie (subscriber, #164188) [Link]

This does require GC, but in a lot of cases that is okay, especially since the GC can be made concurrent. See Microsoft’s Cornucopia Reloaded for details.