An introduction to lockless algorithms

February 19, 2021

This article was contributed by Paolo Bonzini

Lockless algorithms are of interest for the Linux kernel when traditional locking primitives either cannot be used or are not performant enough. For this reason they come up every now and then on LWN; one of the last mentions, which prompted me to write this article series, was last July. Topics that arise even more frequently are read-copy-update (RCU — these articles from 2007 are still highly relevant), reference counting, and ways of wrapping lockless primitives into higher-level, more easily understood APIs. These articles will delve into the concepts behind lockless algorithms and how they are used in the kernel.

Other articles in this series:

Low-level knowledge of the memory model is universally recognized as advanced material that can scare even the most seasoned kernel hackers; our editor wrote (in the July article) that "it takes a special kind of mind to really understand the memory model". It's been said that the Linux kernel memory model (and in particular Documentation/memory-barriers.txt) can be used to frighten small children, and the same is probably true of just the words "acquire" and "release".

At the same time, mechanisms like RCU and seqlocks are in such widespread use in the kernel that almost every developer will sooner or later encounter fundamentally lockless programming interfaces. For this reason, it is a good idea to equip yourself with at least a basic understanding of lockless primitives. Throughout this series I will describe what acquire and release semantics are really about, and present five relatively simple patterns that alone can cover most uses of the primitives.

Acquire, release, and "happens before"

In order to make the big step from the (relative) comfort of synchronization primitives to lockless programming, we shall first take a look at why locks work in the first place. Usually, this is taught in terms of mutual exclusion: locks prevent multiple threads of execution from reading or writing the same data concurrently. But what does "concurrently" really mean? And what happens when thread T is done with that data and thread U starts using it?

In order to answer these questions, we can turn to a theoretical framework that Leslie Lamport established in his 1978 paper "Time, Clocks and the Ordering of Events in a Distributed System". According to the paper, the events in a distributed system can be ordered according to whether an event P happens before another event Q:

The ordering of events is total within a single thread of execution. In layman's terms, for any two events from the same thread you can always say which came first and which came second.
If two events do not happen within a single thread of execution, event P happens before event Q if event P is a message send and event Q is the corresponding message receive.
The relation is transitive. Therefore, if event P happens before event Q and event Q happens before event R, then event P happens before R.

The "happens before" relation is a partial ordering: it is possible to have two events P and Q such that neither happens before the other. When this happens, the two events are concurrent. Remember how a lock prevents concurrent accesses to the same data structure? That is because, when you protect a data structure with a lock, all accesses to the data structure form a total ordering, as if they came from a single thread. Lamport's framework also provides a basic idea of what happens when a lock is handed off from a thread to another: some kind of "message passing" ensures that the unlock operation of thread T "happens before" the lock operation of thread U.

As it turns out, this is not just theory: in order to ensure that their caches are coherent, CPUs exchange messages over buses such as Intel's QPI or AMD's HyperTransport. However, this level of detail is definitely too much for our purposes. Instead, we will generalize the "happens before" definition to cover any kind of synchronization primitive.

Lamport's fundamental insight was that synchronization happens when two threads operate with symmetric operations on the same data structure. In our generalization, we will list pairs of operations (such as sending and receiving a message through the same queue) that synchronize one thread with another. Furthermore, we will classify the operations in each pair as either release or acquire; equivalently we can say that they "have release (or acquire) semantics".

Within a pair, the release operation synchronizes with the corresponding acquire operation. "Synchronizing" means that, whenever one thread performs a release operation and another thread performs the corresponding acquire operation, the "happens before" relation grows edges from the releasing thread to the acquiring thread. "Happens before" remains a partial order in the general case, but it now spans two threads, or more than two, thanks to transitivity. More formally:

The ordering of operations is total within a single thread.
If an operation P with release semantics synchronizes with an operation Q with acquire semantics, then operation P happens before operation Q, even if they happen in different threads.
Just like before, the relation is transitive and defines a partial ordering of operations.

The old definition follows by giving release semantics to message send and acquire semantics to message receive. A message send synchronizes the sending thread with the thread (or threads) that receives the message. We can also rephrase our previous findings according to this new definition: for locks to work, unlocking must have release semantics and must synchronize with locking—which in turn has acquire semantics. No matter if the lock is contended or not, the resulting "happens before" edges ensure a smooth hand-off from one thread to the other.

Acquire and release semantics may seem like an abstract concept, but they truly provide simple explanations of many common multithreaded programming practices. For example, consider two user-space threads accessing a global variable s:

    thread 1                              thread 2
    --------------------------------      ------------------------
    s = "hello";
    pthread_create(&t, NULL, t2, NULL);
                                          puts(s);
                                          s = "world";
    pthread_join(t, NULL);
    puts(s);

Are the accesses to the variable safe? Can thread 2 assume that s will read as "hello", and likewise can thread 1 assume that s will be "world" after pthread_join()? The answer is affirmative, and we can explain why in terms of acquire and release semantics:

pthread_create() has release semantics and synchronizes with the start of thread 2 (which has acquire semantics). Therefore, anything written before a thread is created can be safely accessed from the thread.
Exiting thread 2 has release semantics and synchronizes with pthread_join() (which has acquire semantics). Therefore, anything the thread writes before exiting can be safely accessed after pthread_join().

Note that data is flowing from one thread to the other with no lock involved: congratulations, you have made it through the first example of lockless programming. To sum up:

If the programmer wants thread 2 to "see the effects of" everything that happened so far in thread 1, the two threads need to synchronize with each other: this is done with a release operation in thread 1 and an acquire operation in thread 2.
Knowing which APIs provide acquire/release semantics lets you write code that relies on the ordering provided by those APIs.

Having understood how release and acquire semantics work for high-level synchronization primitives, we can now consider them in the context of individual memory accesses.

The message-passing pattern

In the previous paragraph we have seen how the acquire and release semantics of pthread_create() and pthread_join() allow the creator of a thread to exchange information with that thread and vice versa. We will now see how this kind of communication can happen in a lockless manner while threads run.

If the message is a simple scalar value, for example a boolean, it could be read and written directly to a memory location. However, consider what happens if the message is a pointer, as in the following example:

    thread 1                            thread 2
    --------------------------------    ------------------------
    a.x = 1;
    message = &a;                       datum = message;
                                        if (datum != NULL)
                                          printk("%d\n", datum->x);

If message is initially NULL, thread 2 will read either NULL or &a, we don't know which. The problem is that, even if

    datum = message;

were to read &a, that assignment is still not synchronized against the assignment in thread 1:

    message = &a;

Therefore, there is no edge in the happens before relation connecting the two threads:

    a.x = 1;                            datum = message;
       |                                    |
       |   happens before                   |
       v                                    v
    message = &a;                       datum->x

Because the two threads of execution are disconnected, there is still no guarantee that datum->x will read as 1; we don't know that the assignment to a.x happens before that read or not.. For this to happen, the store and load must be endowed with release and acquire semantics respectively.

To that end, we have the "store-release" and "load-acquire" operations. A store-release operation P, in addition to writing to a memory location, synchronizes with a load-acquire operation Q if Q reads the value that was written by P. Here is a fixed version of the above code, using Linux's smp_store_release() and smp_load_acquire():

    thread 1                                  thread 2
    --------------------------------          ------------------------
    a.x = 1;
    smp_store_release(&message, &a);          datum = smp_load_acquire(&message);
                                              if (datum != NULL)
                                                printk("%x\n", datum->x);

With this change, if datum is &a we can affirm that the store happened before the load. (I am assuming for simplicity that only one thread can write &a to message. Saying "thread 2 reads the value written by thread 1" does not refer to the specific bit pattern that goes in memory, it really means that thread 1's store is the last whose effect is visible to thread 2). The relation looks like this now:

    a.x = 1;
       |
       v
    smp_store_release(&message, &a);  ----->  datum = smp_load_acquire(&message);
                                                  |
                                                  v
                                              datum->x

And everything works. Because of transitivity, whenever thread 2 reads the value written by thread 1, everything that thread 1 did up to the store-release will also be visible to thread 2 after the load-acquire. Note that, unlike the pthread_join() case, "synchronizes with" does not mean that thread 2 "waits for" thread 1 to do the write. The above drawing only applies if thread 2 happens to read the value that thread 1 has written.

In the Linux kernel, the above code will often be written in a slightly different manner:

    thread 1                              thread 2
    --------------------------------      ------------------------
    a.x = 1;
    smp_wmb();
    WRITE_ONCE(message, &a);              datum = READ_ONCE(message);
                                          smp_rmb();
                                          if (datum != NULL)
                                            printk("%x\n", datum->x);

In this case, the release and acquire semantics are provided by the memory barriers smp_wmb() and smp_rmb(). Memory barriers also have acquire and release semantics, but they are a bit more complicated to reason about than simple loads and stores. We will get back to them when we talk about seqlocks.

Regardless of whether one uses load-acquire/store-release or smp_rmb()/smp_wmb(), this is an extremely common pattern, and one that should be understood well. Among its uses we find:

All sorts of ring buffers. Each entry of the ring buffer often points to other data; usually, there are also head/tail locations that contain an index in the ring buffer. The producer side will use store-release operations, synchronizing with load-acquire operations in the consumer.
RCU. As far as the compiler is concerned, the familiar rcu_dereference() and rcu_assign_pointer() APIs are similar to load-acquire and store-release operations. Thanks to some assumptions that are true for all processors except the Alpha, rcu_dereference() can be compiled to a regular load; still, rcu_assign_pointer() synchronizes with rcu_dereference() as if it were a load-acquire operation.

Publishing pointers into an array. In this (modified) KVM snippet, if kvm_get_vcpu() sees the incremented kvm->online_vcpus, the associated entry in the array will be valid:

    kvm_vm_ioctl_create_vcpu()                     kvm_get_vcpu()
    -----------------------------------------      -----------------------------------------------
    kvm->vcpus[kvm->online_vcpus] = vcpu;          if (idx < smp_load_acquire(&kvm->online_vcpus))
    smp_store_release(&kvm->online_vcpus,            return kvm->vcpus[idx];
                      kvm->online_vcpus + 1);      return NULL;

Apart from the mechanics of the load-acquire/store-release operations, there is another aspect of the message-passing pattern that you should ponder: it is a single producer algorithm. If there are multiple writers, they must be protected against each other by other means, for example with a mutex. Lockless algorithms do not exist in a void; they are but one part of the concurrent programming toolbox, and they work best when combined with other, more traditional tools.

This is just the beginning of an extended series on lockless algorithms. The next installment will look further at how atomic memory operations can be ordered, and look into how memory barriers are at the heart of both the "seqcounts" mechanism and the Linux scheduler.

Index entries for this article
Kernel	Lockless algorithms
GuestArticles	Bonzini, Paolo

An introduction to lockless algorithms

Posted Feb 19, 2021 22:12 UTC (Fri) by randomguy3 (subscriber, #71063) [Link] (1 responses)

This was very clear, thank you! I feel like I finally understand acquire and release semantics, and introducing them via message passing really helps with remembering which way round they go (the before operation releases a message, which the after operation acquires).

An introduction to lockless algorithms

Posted Feb 21, 2021 14:37 UTC (Sun) by Sesse (subscriber, #53779) [Link]

Agreed, explaining acquire/release as message-passing suddenly made everything clear to me.

An introduction to lockless algorithms

Posted Feb 20, 2021 0:05 UTC (Sat) by pebolle (guest, #35204) [Link] (29 responses)

I'm dumb: what does "total" mean?

An introduction to lockless algorithms

Posted Feb 20, 2021 0:12 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (2 responses)

From the fine article:

"The ordering of events is total within a single thread of execution. In layman's terms, for any two events from the same thread you can always say which came first and which came second."

An introduction to lockless algorithms

Posted Feb 20, 2021 0:25 UTC (Sat) by pebolle (guest, #35204) [Link] (1 responses)

> [...] all accesses to the data structure form a total ordering [...]

and

> The ordering of operations is total within a single thread.

Totally clear.

An introduction to lockless algorithms

Posted Feb 20, 2021 7:21 UTC (Sat) by pbonzini (subscriber, #60935) [Link]

"All accesses to the data structure form a total ordering" means that you can always say which access came first and which came second. This is what you'd expect from a mutex, because:

1) you can always say which critical section came first and which came second (lock waits for the other critical section to complete, if there's one), and

2) accesses within a critical section come from a single thread, and there is always a total order (defined in the article) within a thread.

An introduction to lockless algorithms

Posted Feb 20, 2021 2:32 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link] (22 responses)

It's a mathematical term (a "totally ordered set"). It means that for any two events you can tell which one came earlier.

This is opposed to a partially ordered set. A good example of a partially ordered set is a genealogical tree. From the information within the tree you can tell that some people are older than other (a father is always older than a son) but for some cases there's not enough information (e.g. you can be younger or older than your cousin).

An introduction to lockless algorithms

Posted Feb 20, 2021 9:34 UTC (Sat) by pebolle (guest, #35204) [Link] (21 responses)

> A good example of a partially ordered set is a genealogical tree.

That helps.

I'm mathematically challenged. So, for example, I found it very annoying in programming exercises that recursion would always be taught with Fibonacci numbers. Who cares about that? It wasn't until I once had to walk through directories and filenames that recursion started to make sense to me.

An introduction to lockless algorithms

Posted Feb 21, 2021 11:20 UTC (Sun) by ballombe (subscriber, #9523) [Link] (20 responses)

Maybe you missed the point of the exercise ?
The point is that recursion *does not* work for computing Fibonacci numbers...

An introduction to lockless algorithms

Posted Feb 21, 2021 12:13 UTC (Sun) by pebolle (guest, #35204) [Link] (9 responses)

> [...] recursion *does not* work for computing Fibonacci numbers...

I just double checked - I was afraid that my memory let me down again - but recursion does work for Fibonacci. At least, that's what the web and the only introductory programming book I have on my shelves tells me.

Perhaps we disagree on the meaning of "recursion" and "work"?

An introduction to lockless algorithms

Posted Feb 21, 2021 12:29 UTC (Sun) by Wol (subscriber, #4433) [Link] (6 responses)

int fibonacci( int f)
{
if (f <= 1)
return f;
else
return fibonacci( f-1) + fibonacci( f-2);
}

Which will blow your stack for large values of f :-)

I hope I've got it right ... :-)

Cheers,
Wol

An introduction to lockless algorithms

Posted Feb 21, 2021 13:08 UTC (Sun) by dtlin (subscriber, #36537) [Link]

With that implementation, there's a good chance you'll experience integer wraparound (and thus wrong results) before you blow the stack. fib(92) = 7540113804746346429, fib(93) = 12200160415121876738 which doesn't fit in a 64-bit signed integer.

The problem with the recursive solution is the exponential runtime; the stack usage is linear.

An introduction to lockless algorithms

Posted Feb 21, 2021 15:24 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

I knew I should have checked it ...

The start of the conditional should be

if (f <= 2) return 1;

Cheers,
Wol

An introduction to lockless algorithms

Posted Feb 21, 2021 16:16 UTC (Sun) by yann.morin.1998 (guest, #54333) [Link]

> The start of the conditional should be
> if (f <= 2) return 1;

Not necessarily. The generalised Fibonacci sequence starts with F(0) = 0 and F(1) = 1 [0]. So your initial condition was (not in)correct.

[0] https://en.wikipedia.org/wiki/Fibonacci_number

An introduction to lockless algorithms

Posted Feb 21, 2021 17:10 UTC (Sun) by matthias (subscriber, #94967) [Link]

This will definitely not blow the stack. Even for quite small numbers of f, the runtime will be longer than the age of the universe.

An introduction to lockless algorithms

Posted Feb 21, 2021 23:41 UTC (Sun) by neilbrown (subscriber, #359) [Link]

> Which will blow your stack for large values of f :-)

Only if you try to implement it on a computer.
As a mathematical abstraction, this is perfectly fine and well defined for all f.

An introduction to lockless algorithms

Posted Mar 5, 2021 0:13 UTC (Fri) by qzPhUZ7af5M6kjJi3tDgMaVq (guest, #140267) [Link]

Recursive definitions work just fine in certain programming languages.

https://wiki.haskell.org/The_Fibonacci_sequence

In a Haskell interpreter:

Prelude> let fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
Prelude> fibs !! 100000

(20,899 digit number omitted... it took about 2 seconds to compute it)

An introduction to lockless algorithms

Posted Feb 22, 2021 9:49 UTC (Mon) by ballombe (subscriber, #9523) [Link] (1 responses)

To compute fibo(50) the naive sequential algorithm needs 50 additions, the naive recursive algorithm needs
12586269025 additions. That is what I call 'not working'.

An introduction to lockless algorithms

Posted Feb 22, 2021 10:04 UTC (Mon) by pebolle (guest, #35204) [Link]

I'll file this under 'Disagreeing what "work" means' too.

An introduction to lockless algorithms

Posted Feb 22, 2021 11:47 UTC (Mon) by jezuch (subscriber, #52988) [Link] (9 responses)

Actually I don't think I ever saw a programming tutorial which introduced recursion using Fibonacci sequences and said that this was a bad example. Which I too think is sad. But I guess this is the prototypical "recursive function" known from the math classes, so...

An introduction to lockless algorithms

Posted Feb 22, 2021 12:14 UTC (Mon) by Wol (subscriber, #4433) [Link] (8 responses)

Thing is, the *definition* of fibonacci is recursive. Doesn't mean it's the best way to calculate it ...

Cheers,
Wol

An introduction to lockless algorithms

Posted Feb 24, 2021 23:49 UTC (Wed) by NYKevin (subscriber, #129325) [Link] (7 responses)

The recursive definition is perfectly fine for computing it as long as you memoize appropriately. You don't even need all that much memo space, because you only ever need to keep the largest two values (which I believe will generally also be the two most recently computed values, so you can just bolt on a bog-standard LRU cache).

Of course, if you *really* want to motivate memoization, you probably want to talk about something more interesting, like Hashlife. I remember reading a lengthy series of articles on Eric Lippert's blog where he spends much of the series painstakingly squeezing out every ounce of performance from the iterative approach (vectorization, bit-twiddling, lookup tables, etc.), and then he switches to Hashlife ("Gosper's algorithm"), with hardly any weird optimizations at all beyond basic algorithm design and memoization, using *garbage collected objects* to represent everything, and suddenly the whole thing is orders of magnitude faster* for large-enough N. My conclusion: Using the right algorithm is far, far more effective than heavily optimizing the wrong algorithm.

* Strictly speaking, it is faster when solving the problem "Figure out what the board will look like at generation N." If the problem is instead "Figure out what the board will look like at every generation from 1 through N," then Hashlife is much less effective. So using it for an interactive UI that sequentially displays one generation at a time is probably not a great idea. But if you want to advance directly to generation 2^27 or something like that, Hashlife is basically the only game in town.

An introduction to lockless algorithms

Posted Mar 2, 2021 20:07 UTC (Tue) by jezuch (subscriber, #52988) [Link] (6 responses)

Well actually...

Fibonacci numbers are defined by a recursive function which has an exact solution, so you can calculate every number in O(1) time (assuming no bignums are needed). So it's doubly (or triply) a bad example to intoduce recursion to new programmers ;)

An introduction to lockless algorithms

Posted Mar 5, 2021 3:16 UTC (Fri) by dtlin (subscriber, #36537) [Link] (5 responses)

It has a closed-form solution but that does not make it O(1).

An introduction to lockless algorithms

Posted Mar 11, 2021 9:56 UTC (Thu) by anselm (subscriber, #2796) [Link] (4 responses)

It has a closed-form solution but that does not make it O(1).

If your platform only supports integers up to a certain maximum (say 2^64-1), it is O(1) because it is straightforward to create a lookup table for all n where F(n) < 2^64 and use that rather than the recursive definition.

An introduction to lockless algorithms

Posted Mar 11, 2021 19:28 UTC (Thu) by nybble41 (subscriber, #55106) [Link] (3 responses)

> … it is O(1) because it is straightforward to create a lookup table for all n where F(n) < 2^64 and use that rather than the recursive definition.

If you strict permissible inputs to a predetermined finite range then every algorithm becomes O(1). That's not a very useful definition, though. And while a 93-entry table suffices to represent all Fibonacci numbers less than 2^64, many languages have built-in arbitrary-precision integer types limited only by available memory.

An introduction to lockless algorithms

Posted Mar 12, 2021 10:49 UTC (Fri) by anselm (subscriber, #2796) [Link] (1 responses)

Obviously. I think the point here is that Fibonacci numbers are often used as a prime example for the use of recursion, but in practice, calculating Fibonacci numbers through a 1:1 implementation of their recursive definition is not what one would actually ever do. As such it gives people silly ideas. It would be better to come up with examples where there are more compelling arguments for using recursion in the first place, or at the very least put a big health warning on the Fibonacci example.

An introduction to lockless algorithms

Posted Mar 13, 2021 8:54 UTC (Sat) by neilbrown (subscriber, #359) [Link]

I recall in first year CompSci, recursion was introduced using the Towers of Hanoi game. I think that is an excellent vehicle, much better than Fibonacci.

An introduction to lockless algorithms

Posted Mar 16, 2021 20:49 UTC (Tue) by nix (subscriber, #2304) [Link]

The canonically crazy example of this is the wonderful Hutter search: https://arxiv.org/abs/cs/0206022. The title describes it well: it is, in fact, the fastest and shortest general algorithm possible if you want one single algorithm that can solve all well-defined problems -- but given that it involves searching for and then executing all possible programs until it finds the one that solves the problem you asked about, it is somewhat impractical. You might find it useful if you are looking for something for your hypercomputer to run when you are bored of having it work out all the busy beaver numbers up to 10^100...

An introduction to lockless algorithms

Posted Feb 21, 2021 21:31 UTC (Sun) by thoughtpolice (subscriber, #87455) [Link] (2 responses)

Many people are giving you exact definitions of the word "total ordering" which is relevant to the article, obviously, but it is also worth pointing out that the word "total", when applied to some operation, be it mathematics or a computer program, means it is an operation that works and "makes sense" over every possible input you could feed it. This is an extremely common occurrence computer programmers encounter regularly, though it may not seem like it.

For example, many people define division as a function that takes two parameters, 'n' and 'd', and computes n / d. But most people would agree that d=0 is an invalid parameter your program will actually fault. It's nonsensical.[1] So division is not total, then: you can give it an input and it can fail. But multiplication is total: it doesn't matter what two integers you give it. You can always multiply them, and there is always a sensible result. Thus, being "total", or "totality", is a property of algorithms/functions/actions you can perform on some objects (those objects being the set of all possible parameters, commonly called "the input domain".)

There is another operator that you often use when programming, and it is called "is x less than y", or <. You use it to figure out how things are ordered. Is 1 less than 2? It turns out, this is very similar to the question "does a thing happen before another thing?" and it's a very important question. They are both a way of asking what order things occurred in, just for different kinds of "things". But, as an algorithm, ordering is not always total: sometimes you just can't tell if A happened before B, or if A is less than B, or not. It might not make sense for the objects in question.

I'll also say that partiality and totality, and especially partially ordered sets (sometimes called "posets"), are an extremely, extremely fundamental component of computer science and mathematics. They're very important, and very well worth understanding, even if it's only at a high level.

[1] It's only nonsensical to some people or computers. On some CPUs, such as RISC-V for example, division by zero is well defined, as n/0=0. This is fine, and not really problematic. At the end of the day, math is what human beings make of it. For rigorous mathematics, n/0=0 might be heresy. But for computers, for humans using them, it's maybe OK to let that slide.

An introduction to lockless algorithms

Posted Feb 25, 2021 1:47 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (1 responses)

Another hint: For *finite* sets (which is basically all the sets the average programmer is likely to care about), a poset is (mostly) equivalent to a DAG. So for example:

* Git history is a poset (under the relationship "commit X is an ancestor of commit Y").
* The packages in a (sufficiently powerful) package manager are a poset (under the relationship "package X is a transitive dependency of package Y").
* In languages with OOP and inheritance, the set of classes (or classes and interfaces) is a poset (under the relationship "class/interface X is a direct or indirect subclass/subtype of class/interface Y").

In short: A poset is a thing that lets you have some objects come "logically before" other objects, for whatever definition of "logically before" is appropriate.

Then, a totally ordered set is just a poset with the additional requirement that for every X and Y, either X comes before Y or Y comes before X (or X = Y). In the DAG interpretation, this is what you would get if you require that the DAG forms a simple chain A -> B -> C ... -> Z, without any branching or whatnot allowed. So, to continue our examples:

* Git history is totally ordered, if you never create any branches or merges, never use detached HEAD mode, and never create a second root commit (i.e. a commit without parents, other than the first commit in the repository).
* Packages are pretty much never totally ordered, in real-world cases. But maybe, if you wanted to ensure a totally deterministic installation order, you could insert artificial dependencies to force the package manager to install everything in the exact order you want.
* In languages with single inheritance, each individual class's set of superclasses (not counting interfaces) is totally ordered. But the language as a whole is not, in most reasonable cases.

An introduction to lockless algorithms

Posted Feb 26, 2021 21:07 UTC (Fri) by thoughtpolice (subscriber, #87455) [Link]

In fact it's so close to a DAG that the given Hasse diagram of a poset (when drawn correctly) might make you blink twice. :) My original post was already long enough, though...

An introduction to lockless algorithms

Posted Feb 20, 2021 0:12 UTC (Sat) by mss (subscriber, #138799) [Link] (2 responses)

Nice article Paolo, thanks for bringing this subject up.

It is worth noting that the Linux kernel memory model is very similar (or one can say: based on) C11/C++11 memory model - after all, both were based on the underlying behavior of real-world computer systems.

However, since the C/C++ working groups had the benefit of designing an atomic synchronization API from scratch the resulting API seems cleaner and more consistent for me than the kernel API.

I also find Mark John Batty's PhD dissertation called "The C11 and C++11 Concurrency Model" a very good read on the subject.
It not only includes a more formal description of the model but is also rich with examples describing how various CPU architectures behave and why particular model design decisions were (probably) made.

The aforementioned dissertation is available here: https://www.cs.kent.ac.uk/people/staff/mjb211/docs/toc.pdf

An introduction to lockless algorithms

Posted Feb 20, 2021 7:11 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (1 responses)

The C11 model has some quirks of its own that make it sometimes harder to use than the Linux kernel memory model. I had a half-baked idea of a sixth article on the topic of C11 atomics for kernel programmers, based on the comments here it seems like there's interest.

An introduction to lockless algorithms

Posted Feb 20, 2021 16:11 UTC (Sat) by mss (subscriber, #138799) [Link]

It's a nice topic bordering between theoretical CS stuff and real world software engineering.

By the way, you said in a comment under the July article to steer clear of the C11 "sequentially consistent" loads and stores since they are incredibly easy to get wrong.
I assume you mean that people have misconceptions about the ordering guarantees that sequentially consistent semantics give?
After all, they provide acquire and release semantics, too, so any algorithm that's valid under acq/rel will be valid under sequentially consistent semantics, too (just maybe slower).

For me personally the most surprising thing with C11 atomics is that a relaxed operation on a variable plus a barrier of X memory order (before or after that operation) is not the same thing as just performing the same operation but with the X memory order, even though the semantics are very close.

In general, the C11 atomics seem to be tailored more for using atomic operations of particular memory order rather than barriers, while most of the kernel code seems to prefer using ordinary memory accesses with barriers instead.

An introduction to lockless algorithms

Posted Feb 20, 2021 0:18 UTC (Sat) by pebolle (guest, #35204) [Link] (54 responses)

> [...] if event P happens before event Q and event Q happens before event R, then event P happens before R.

I'm still dumb: what makes this insightful?

An introduction to lockless algorithms

Posted Feb 20, 2021 0:56 UTC (Sat) by gus3 (guest, #61103) [Link] (48 responses)

Not dumb at all. It's an axiom, like this one for mathematics:

"For any op <, =, >, if A op B, and B op C, then A op C." That is to say:

-- If A < B, and B < C, then A < C
-- If A = B, and B = C, then A = C
-- If A > B, and B > C, then A > C

Referring to the statement you quoted, it could be re-stated thus:

"Given an action A, the time tA at which action A takes place, and any op <, =, >:
If tP op tQ, and tQ op tR, then tP op tR."

That may seem thick, but instead of just A, B, and C, I've just substituted locations on a time-line instead of a number line.

It may seem obvious, and it kind of is. But stating applicable axioms explicitly is standard practice in rigorous maths discussions.

An introduction to lockless algorithms

Posted Feb 20, 2021 2:55 UTC (Sat) by excors (subscriber, #95769) [Link] (5 responses)

It's also important what rules *aren't* stated. There's no requirement e.g. that for any two events P and Q, you can always determine either that P happens before Q or that Q happens before P. It's perfectly acceptable for there to be no ordering between them (except where the axioms say there must be an ordering).

That's contrary to the everyday human concept of time, where you would say every event happens at a precise number of seconds past midnight and you can always work out which happened first. That concept is problematic to implement in computers, because it's simply impractical for physically separated processors to have perfectly synchronised nanosecond-precision clocks and to attach timestamps to every message they share. So we throw away that concept of time and time-lines and define a brand new concept (called "happens before", but could equally be called "op" or "→") from scratch.

The axioms are trying to define that new concept with the minimal set of requirements on where there must be a well-defined ordering between events, so it's just barely strong enough to achieve the goal (in this case to define some useful kind of synchronisation between concurrent threads). And in this particular case, transitivity is one property that is required else the mathematical proofs stop working, so it has to be stated explicitly in the definition.

An introduction to lockless algorithms

Posted Feb 20, 2021 18:12 UTC (Sat) by Wol (subscriber, #4433) [Link] (3 responses)

This is where relativity helps :-)

Whether two things are sequential or not depends on whether one occurs INSIDE the light-cone of the other. These acquire-release semantics force the two light cones to intersect, so anything after the intersect is visible inside both light cones so you can conclude they are sequential.

And the light-cone thing also explains why things can only be simultaneous if they occur at the same 4-dimensional co-ordinates, because otherwise they are not in each other's light cone, and therefore there is no mutual concept of time.

Cheers,
Wol

An introduction to lockless algorithms

Posted Feb 20, 2021 19:14 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (2 responses)

I don't think it's really about relativity; anything that you can model as a directed graph function just as well as a representation of a partial ordering.

For example think of two cars going from A to B along two different roads. Any time a junction joins the two roads you *might* be able to say who is ahead at that moment. However while the roads are running parallel it's possible (but not certain) that you cannot say that. That is because the points along the two routes form a partial ordering.

An introduction to lockless algorithms

Posted Feb 21, 2021 12:36 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

No I didn't mean to say it was relativity. It's just that if you understand light cones and all that, it's the same problem in a different domain :-)

The send-receive creates a new light cone for which the origin is visible in both the previous two cones :-)

Cheers,
Wol

An introduction to lockless algorithms

Posted Feb 21, 2021 13:00 UTC (Sun) by pebolle (guest, #35204) [Link]

I'll file this under 'Disagreeing what "work" means'.

An introduction to lockless algorithms

Posted Feb 23, 2021 1:06 UTC (Tue) by rgmoore (✭ supporter ✭, #75) [Link]

It's perfectly acceptable for there to be no ordering between them (except where the axioms say there must be an ordering).

From a practical standpoint, it's not just acceptable for there to be no ordering; it's actually desirable for performance. If you can divide up your program into separate tasks that don't have ordering constraints between them, you can run those tasks however makes the best performance sense. This has been the key to all kinds of modern performance advances, like out of order execution, multiprocessing, and what have you. Those things all depend on being able to break up tasks into parts that can be run in different orders. It's the need to synchronize that slows things down.

Axioms

Posted Feb 21, 2021 6:22 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (41 responses)

Right.

Not explicitly stating such an assumption can get you in serious trouble if we want to prove things about a system.

For example: As an IoT developer thinking about how two devices you manufacture can securely communicate over an IP network, you might seize upon the Pre-shared Keys (PSKs) in TLS 1.3 (RFC 8446) as a good approach that avoids worrying about certificates or a PKI. It might seem like a common sense assumption that if device #1 ("Alice"), and device #2 ("Bob") share the same key K, that's going to be secure against any attacker that doesn't know K, the document (as published) doesn't suggest there would be any unexpected problems. And indeed mathematical proofs of TLS 1.3 say it does deliver on the claims in its introduction.

Unfortunately those proofs assume something different from you. They assume your two devices would each need their own key, K1 and K2. If you don't use separate keys you need countermeasures not described in the protocol, because of the "Selfie attack". The attack goes like this, Alice asks Bob something (using key K of course), but Mallory (who doesn't know K) just redirects Alice's message back to Alice. Alice thinks this question came from Bob (after all it uses key K which they share) and answers it, now Alice also thinks that answer came from Bob. Mallory has successfully confused Alice despite being unable to either read or forge messages from either Alice or Bob. Under the proof scenario Alice asks using K1, when the message is redirected back to her Alice is puzzled because it isn't encrypted with key K2 from Bob, so that message is rejected and everything is fine.

If this assumption had been made into an explicit _axiom_ that could have highlighted that the TLS document doesn't explain the problem with just having a single key K and likely its authors would have drafted a section explaining why you might need more keys than seem necessary at first glance, and what alternative strategies would work if you must have one key.

A more broadly famous missing assumption is the Axiom of Choice. In the 1800s mathematicians were confident they understood the Theory of Sets - they had established Axioms stating how it works - but alas they'd actually without making it explicit been assuming that in a set of sets (like the set of sets of integers) you can always choose one of the members of each set. This is definitely true for finite sets. But for infinite sets we can't prove it, so, we ought to explicitly make it an Axiom and say so, if we're wrong any results depending on that assumption may be wrong too. Today most mathematicians will choose ZFC, axioms for set theory which include the Axiom of Choice, because it does after all _feel_ intuitively as though you could choose one from each set. But at least you've written down this specific assumption you're making in case it matters some day.

Axioms

Posted Feb 21, 2021 12:25 UTC (Sun) by ale2018 (guest, #128727) [Link] (40 responses)

A possibility could be to assimilate any exchange of information as a message transmission. There may always be entities with witch you don't exchange any information at all. They may exist or not, you never know. Like the set of all sets, if you cannot ping it, you may consider it doesn't exist.

Thus, among "existing" things you have defined a partial time order. Can one obtain relativity from Newtonian physics by just introducing information exchange?

Axioms

Posted Feb 21, 2021 12:47 UTC (Sun) by Wol (subscriber, #4433) [Link] (39 responses)

> Thus, among "existing" things you have defined a partial time order. Can one obtain relativity from Newtonian physics by just introducing information exchange?

Almost certainly not. Newtonian Physics assumes time IS constant from all vantage points - if you see time move ten seconds so will I. Relativity says time APPEARS constant from all vantage points.

If we assume I travel to Alpha Centauri and back at 0.9c, you on Earth will see me take about 9 years to do so. Newton says that means I must experience the same 9 years. But relativity says time will run slower for me, I may only experience 1 year passing.

It's like a multi-core CPU with each core having a different clock ... and time is measured in ticks ...

I don't see how exchanging information will alter anything.

Cheers,
Wol

Axioms

Posted Feb 22, 2021 13:04 UTC (Mon) by ale2018 (guest, #128727) [Link] (38 responses)

> I don't see how exchanging information will alter anything.

For one thing, introducing information and the ability to process it breaks time symmetry. You can easily call "past" the side you remember and "future" the side you predict. Time reversal doesn't work any more. At this point you've already left the classical, Newtonian vision, and dropped those misconceptions about time.

Special relativity was introduced as an attempt to account for the speed of light being a universal constant. As a speed limit, it entails that information cannot travel faster than light either. My question was whether it's possible to derive some kind of relativity the opposite way around. That is, assuming that things exists only if you can exchange information with them, and then assuming the partial ordering principle outlined in the article, is it possible to derive a concept similar to the light cone?

Of course, it is not possible to compute the actual speed of light based on pure logic... Or is it?

Axioms

Posted Feb 22, 2021 13:46 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (37 responses)

> Of course, it is not possible to compute the actual speed of light based on pure logic... Or is it?

AFAIR, it's not possible to compute a one-way speed at all. You can only get a round-trip speed (light could be traveling at some factor based on its angle with a "universal axis" and we'd never be able to tell) because you can't communicate back about any time synchronization without being affected by the same limits. So, no, without an axiom that "the universe has no preferred axis for light travel", you get an infinite number of possible solutions for a speed-of-light gradient.

Axioms

Posted Feb 22, 2021 18:44 UTC (Mon) by ale2018 (guest, #128727) [Link]

That's right.

Let me quote Leslie Lamport, from the 1978 article cited by Paolo Bonzini:

This definition will appear quite natural to the reader familiar with the invariant space-time formulation of special relativity, as described for example in [1] or the first chapter of [2]. In relativity, the ordering of events is defined in terms of messages that could be sent. However, we have taken the more pragmatic approach of only considering messages that actually are sent. We should be able to determine if a system performed correctly by knowing only those events which did occur, without knowing which events could have occurred.

The references [1] and [2] are two introductory texts on relativity of those times.

Axioms

Posted Feb 23, 2021 0:03 UTC (Tue) by Wol (subscriber, #4433) [Link] (31 responses)

Well, as far as light is concerned, the speed of light ... ISN'T. No time passes whatsoever between a photon leaving wherever it was created, and being absorbed by whatever it hits ...

Oh, and I believe it IS possible to calculate the speed of light based on pure logic ...

You'll have to look it up for yourself, but if you take a bunch of mathematical constants like e, pi, ln and so on, I think they somehow dictate a set of possible values for c, g, and other physical constants. It's too long ago, but I'm sure I've seen the physical constants defined in terms of the mathematical constants. And that's why astronomers are so unhappy with the cosmological constant. Because they can't define it in terms of anything else, it shouldn't exist ...

Cheers,
Wol

Axioms

Posted Feb 23, 2021 0:38 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (26 responses)

Eh, that sounds like numerology to me. Which is "more fundamental", e or pi? Is the Feigenbaum constant important? Gamma? The golden ratio? Silver ratio?

It'd also lay a blow to various theories of multiple universes which differ in the physical constants. Also, what is a "fundamental constant"? Is the radius of a proton a constant or a derived value? The size of a helium-4 nucleus? Ratio between the lepton sizes? Even physics is a *bit* arbitrary here.

Axioms

Posted Feb 23, 2021 1:19 UTC (Tue) by Wol (subscriber, #4433) [Link] (25 responses)

> Eh, that sounds like numerology to me. Which is "more fundamental", e or pi? Is the Feigenbaum constant important? Gamma? The golden ratio? Silver ratio?

Who said anything about fundamental? Both e and pi are mathematical constants, as is the golden ratio. They're all of equal importance.

Things like c, g, permittivity, that stuff, are all physical constants, which should be explainable in terms of maths.
Cheers,
Wol

Axioms

Posted Feb 23, 2021 13:10 UTC (Tue) by mathstuf (subscriber, #69389) [Link] (7 responses)

> Things like c, g, permittivity, that stuff, are all physical constants, which should be explainable in terms of maths.

But…why? What is this assumption based upon? Is it a bit of Platonic idealism?

Axioms

Posted Feb 24, 2021 0:27 UTC (Wed) by Wol (subscriber, #4433) [Link] (6 responses)

I don't know why, I just remember coming across it somewhere that most of the physical constants could be defined in terms of other things, and the reason they had the values they did was because the maths didn't make sense otherwise.

Cheers,
Wol

Axioms

Posted Feb 24, 2021 21:58 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (5 responses)

Sure, but (as stated elsewhere) your numerical values are limited to our arbitrarily defined units. A mole is a number of atoms of a substance in a number of grams equal to its molecular weight. A meter is basically an arbitrary nice length, Seconds are based on a breakdown of the rotational speed of Earth. Sure, we've rebased them onto more stable footing now, but the counts of those things into the units are "arbitrary" based on what we had before.

If we wanted, we could say that the speed of light is pi/e Wol_lengths/Wol_time intervals. Doesn't mean that those units are any more useful than our numerical value in m/s.

Axioms

Posted Feb 25, 2021 7:48 UTC (Thu) by jem (subscriber, #24231) [Link] (4 responses)

One fun fact about the base unit for mass, kg, is that it is derived (indirectly) from the size of the Earth. On a bigger Earth, we would have a different "kilogram", with a bigger mass. (And I'm not talking about the gravitational force, which would increase even more, because of the bigger mass of both a kilogram and the Earth.)

Axioms

Posted Mar 16, 2021 21:09 UTC (Tue) by nix (subscriber, #2304) [Link] (3 responses)

Not any more it isn't. In 2019 it was redefined in terms of the Planck constant -- though, of course, its *actual value* comes down to the original French Revolutionary definition: but even that was the mass of a cubic centimetre of water at the melting point of ice. This, of course, depends on atmospheric pressure (since they didn't use the triple point), and thus indirectly on the mass of the Earth and a bunch of other contingent values -- but the current definition depends on no such things.

Axioms

Posted Mar 17, 2021 10:37 UTC (Wed) by jem (subscriber, #24231) [Link]

>Not any more it isn't. In 2019 it was redefined in terms of the Planck constant

Well, I said derived, not defined.

Axioms

Posted Mar 18, 2021 10:14 UTC (Thu) by mgedmin (subscriber, #34497) [Link] (1 responses)

> mass of a cubic centimetre of water

a cubic *deci*meter (which is also one liter) of water.

Axioms

Posted Mar 20, 2021 1:24 UTC (Sat) by nix (subscriber, #2304) [Link]

I changed that repeatedly before posting... and picked the wrong one, despite the metric system being the only system I actually know. Sigh.

(And obviously you're right, unless water has suddenly become much, much denser than lead :) )

Axioms

Posted Feb 23, 2021 15:24 UTC (Tue) by rschroev (subscriber, #4164) [Link] (16 responses)

π² m/s² is an approximation for g, but only because of our definitions for the meter and the second, and because we happen to live on Earth. I really don't see why there should be any connection between mathematical constants and the values of constants that govern physical phenomena. Mathematics is purer than that: it can with equal ease describe a universe with completely different constants or even different physical laws.

Axioms

Posted Feb 24, 2021 0:23 UTC (Wed) by Wol (subscriber, #4433) [Link] (1 responses)

Are you confusing g and G, or am I?

I think the equation is G=g.m1.m2/d^2

G is 9.8m/s^2, while g is the gravitational constant, which is believed to be the same everywhere in the universe.

Cheers,
Wol

Axioms

Posted Feb 24, 2021 0:47 UTC (Wed) by rschroev (subscriber, #4164) [Link]

G is the gravitational constant (or indeed the universal gravitational constant), approximately 6.674×10^−11 m³/(kg⋅s²).
g is the gravity of earth, approximately 9.81 m/s².
The equation is F = G⋅m1⋅m2/r². In case of Earth, that's F = G⋅m⋅m_earth/r_earth², with g = G⋅m_earth/r_earth². So in that case we get F = m⋅g.

Axioms

Posted Feb 24, 2021 0:31 UTC (Wed) by Wol (subscriber, #4433) [Link] (11 responses)

> Mathematics is purer than that: it can with equal ease describe a universe with completely different constants or even different physical laws.

It can *de*scribe it, yes. But I think you'll find it also *pre*scribes it and says "completely different constants and laws doesn't make logical sense".

Cheers,
Wol

Axioms

Posted Feb 24, 2021 1:01 UTC (Wed) by rschroev (subscriber, #4164) [Link] (10 responses)

I think you're mistaken in this, but I would be happy to be proven wrong because that would lead to very interesting insights in physics and the relationship between physics and mathematics.

BTW there's a video of Richard Feyman's lecture about that relationship here: https://youtu.be/obCjODeoLVw
And some comments about it: https://medium.com/cantors-paradise/richard-feynman-on-th...

Interesting stuff, if you're interested in that kind of stuff.

Axioms

Posted Feb 24, 2021 18:28 UTC (Wed) by Wol (subscriber, #4433) [Link] (9 responses)

Whoops. I didn't mean to say there aren't other solutions for the equations. Just that the possible universes are seriously constrained by the maths.

Cheers,
Wol

Axioms

Posted Feb 24, 2021 21:54 UTC (Wed) by mathstuf (subscriber, #69389) [Link] (3 responses)

That we had such a firm grasp on physics at that level to tell a "broken" from a "working" universe given a set of "knob settings" is news to me. That we even know all of the relevant knobs is news even. I thought we were still at the "we have a working universe on our hands…what makes it tick" stage about the fundamentals. Do you have any journal papers about this you could cite?

Axioms

Posted Feb 24, 2021 23:22 UTC (Wed) by Wol (subscriber, #4433) [Link] (2 responses)

How many dimensions does the Universe have? We know it's not three, Einstein proved that.

We think it may be four - that's what Einstein thought but that has a whole bunch of problems. If it *is* four, I believe that means string theory is correct as it's the explanation for relativistic singularities.

Or it could be ten or eleven. Anything betwen five and nine inclusive just doesn't work because we get an explosion of infinities - infinity itself isn't a problem, but there are different sorts of infinity and for reality to work they need to cancel out. For those dimensions they don't. (These universes, if I remember correctly, define mass as the fifth dimension ...)

(That's why I was moaning about computers crashing when you divide by zero. If you declare zero and infinity as non-numbers for which arithmetic doesn't work, you're in trouble. If you say "to make arithmetic work, they swap places on division", then you can do this sort of maths and come up with something that makes sense.)

At the end of the day, we have loads of maths that describes what we see. And that *constrains* what is a plausible universe. We have a local maximum or minimum, don't know which. By adjusting some values, we can force others to impossible values. Plausible universes must have all these values at maximum or minimum, not off the scale or impossible or at some non-equilibrium value.

Cheers,
Wol

Axioms

Posted Feb 24, 2021 23:36 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> Or it could be ten or eleven. Anything betwen five and nine inclusive just doesn't work because we get an explosion of infinities - infinity itself isn't a problem, but there are different sorts of infinity and for reality to work they need to cancel out. For those dimensions they don't. (These universes, if I remember correctly, define mass as the fifth dimension ...)
You can construct a universe with multiple time axes, with many more dimensions, and so on. The math will be internally consistent.

Axioms

Posted Feb 25, 2021 0:05 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

> That's why I was moaning about computers crashing when you divide by zero.

To say that division by zero is undefined is mathematically correct and not just an arbitrary computer limitation. There is no proper answer to the question "What number, when multiplied by zero, gives the non-zero product X?", at least not in any system that would uphold basic idioms such as the product of a number and its reciprocal being equal to one. "Infinity" times zero is not equal to any particular finite number X, so that isn't a solution. Depending on the particular forms of the equations which gave rise to the infinity and the zero (or infinitesimal) the product could be another infinity or any real number; it depends on how you phrase the question. $\lim_{x \to 0+} x ln \frac{1}{x} = 0$, but $\lim_{x \to 0+} x \frac{1}{x} = 1$. In both cases you're multiplying an infinitesimal ($\lim_{x \to 0+} x$) by an infinity ($\lim{x \to 0+} ln \frac{1}{x}$ or $\lim{x \to 0+} \frac{1}{x}$, respectively) but the specific form of the equation changes the result.

Axioms

Posted Feb 24, 2021 22:24 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

Not really. Math is just a language, it basically restricts nothing.

If you want to read something mind-blowing, try the "Clockwork Rocket" series by Greg Egan. It's accompanied by a thesis-sized exploration of its (fictional) physics: http://www.gregegan.net/ORTHOGONAL/ORTHOGONAL.html

Axioms

Posted Feb 24, 2021 23:26 UTC (Wed) by Wol (subscriber, #4433) [Link] (3 responses)

Maths is just a language, true. But if we try to define a - six-dimensional universe, say - the maths just doesn't add up. So yes the maths does restrict our universes - it says six dimensions just won't work.

Cheers,
Wol

Axioms

Posted Feb 24, 2021 23:34 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> Maths is just a language, true. But if we try to define a - six-dimensional universe, say - the maths just doesn't add up. So yes the maths does restrict our universes - it says six dimensions just won't work.
Who told you that? A six-dimensional classic (Newtonian) universe works just fine. Sure, you won't have stable orbits but apart from that it's OK.

You can also construct a quantum field theory for such a universe, it also would work just fine.

Axioms

Posted Feb 25, 2021 15:16 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

I think you're confusing the physics of our universe versus those of any possible universe. Sure, *ours* might not make sense with 5-9 dimensions, but that makes no conclusion about *any possible universe* having such a number of dimensions. Maybe you're just being imprecise with your language in places?

Axioms

Posted Mar 8, 2021 14:59 UTC (Mon) by LtWorf (subscriber, #124958) [Link]

I think you are very confused about the dimensions thing in physics.

Physics tries to model measurements.

So for example you measure a planet that is orbiting, make an equation, and see if tomorrow the equation and the position are the same (within a certain range of precision).

Before Galileo saw that Jupiter had satellites, they had perfectly fine equations that predicted where everything would be in the sky. The problem arose because new data could not fit the model.

You can absolutely model an orbit of a planet using 3 dimensions, or you can model it in an higher space with an equation of a lower degree. Both work. We can't really know which is "exact" if both work.

You can keep adding dimensions and make equations that work, but we don't really know what the "truth" is.

Axioms

Posted Feb 24, 2021 1:24 UTC (Wed) by SiB (subscriber, #4048) [Link] (1 responses)

The earliest definition of a meter is the length of a seconds pendulum. By that definition, g=π² m/s² exactly.

Axioms

Posted Feb 24, 2021 8:53 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

Jean Picard's proposed toise universelle was twice the length of a seconds pendulum. Unfortunately, if you set up a seconds pendulum in (say) Cayenne, French Guiana, you would find that it was 0.3% longer than one set up in the Paris Observatory, because the acceleration due to gravity at sea level varies with a number of factors including latitude.

The first formally adopted definition for the metre itself, proposed in 1791 by the French Academy of Sciences and adopted in 1793 by the National Assembly, was one ten-millionth of the distance from the North Pole to the Equator along the Paris meridian.

Axioms

Posted Feb 23, 2021 1:06 UTC (Tue) by mpr22 (subscriber, #60784) [Link] (3 responses)

It is not possible to calculate the speed of light by pure logic, because the numerical value of the speed of light is a consequence of your units of measurement.

There is one particular physical constant, µ_0 (the permeability of free space) which was historically defined (as a consequence of the relationship between electrical current and induced magnetic fields) as exactly 4π × 10^-7 H/m ... but that value is no longer strictly valid (although it's close enough for the vast majority of practical purposes), because the ampere has been redefined in terms of the elementary charge and the second.

Axioms

Posted Feb 24, 2021 1:00 UTC (Wed) by SiB (subscriber, #4048) [Link] (2 responses)

Physical constants that carry units do not carry any physics. The universe does not care about the value of the speed of light.

Some years ago in our coffee room at the institute there was a discussion about the possibility that the fine-structure constant α may not have been constant in the past after all. The question was raised, which of the constituents of α=e²/(4πε₀ħc) may be the reason. Well, the answer is e, if at all. α is a constant without units. It is pure physics. It defines the strength of electromagnetic interactions. The universe cares a lot about the value of the fine-structure constant α. In our experimental units it is the charge of the electron e that represents that strength, but in the end, it also just defines units of measure.

Theoretical physics defines the units differently than experimental physics. They mostly use c=4πε₀=ħ=k=1. Everything is measured in eV or eV¯¹.

From a theoretical point I am a bit unhappy about the new definition of the SI units. They fixed the value of e, although it indirectly represents a quantity the universe cares about. The cost is that µ₀ and ε₀ are now _not_ fixed by definition in the SI, as they were before, like Wol said.

Ideally, all units were defined by fixing the value of some fundamental constant. We are now almost there. The second is still defined by an artifact. At least it is an atom, so it can be reproduced everywhere in the universe. To fix that we need to fix the gravitational constant. But metrology defines units by what they can measure with the best precision. The gravitational constant does not apply, it has to catch up at least eight orders of magnitude in precision.

Axioms

Posted Feb 24, 2021 1:16 UTC (Wed) by SiB (subscriber, #4048) [Link]

s/Wol/mpr22/

Axioms

Posted Feb 25, 2021 3:46 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

> Physical constants that carry units do not carry any physics. The universe does not care about the value of the speed of light.

The universe may not care about the speed of light, but the stuff in the universe surely does. To borrow from your "theoretical physics normalizes everything to 1" example, the mass of the proton is much, much (numerically) smaller than the charge of the proton, and surely there's some physical significance of that? Or, alternatively, we can say that there's nothing wrong with the proton, and instead it's gravity which is too weak, but that's just reframing the same question in different units. I'm not sure you can ask that question without using units at all.

Ultimately, I suppose the "proper" framing of this question is some nonsensically complicated question about why the relevant quantum fields happen to interfere in exactly the way that they do to give the proton the properties that it has. But you can't even ask that question until you know what gravity is, at a quantum level, and to the best of my knowledge, nobody does.

Axioms

Posted Feb 25, 2021 4:35 UTC (Thu) by NYKevin (subscriber, #129325) [Link] (3 responses)

Another way of thinking about it: The one-way speed of light is operationally ill-defined.

Ultimately, science is about predicting the outcomes of experiments and observations. If you can't observe it, directly or indirectly, it is beyond the realm of scientific inquiry. The one-way speed of light falls into this hole. Speed is normally defined as distance divided by time, but when we're talking about relativistic speeds, we need to be clear about which time we mean.

So, let's imagine a specific setup. We'll send a pulse of light from point A to point B. So, what is the one-way velocity of the light? It is the distance between A and B, divided by the time it takes for the signal to arrive. But whose time?

It turns out that, not only is there no reasonable way for the scientists at A and B to communicate the fact that the pulse has arrived (short of sending a second pulse and thereby measuring the round-trip time instead), it is actually not meaningful for an event at A and an event at B to happen "simultaneously." There will always be an observer who thinks that A's event happened first, and an observer who thinks that B's event happened first, unless the events are in each others' light cones, in which case everyone will agree that one comes before the other. But you can't get everyone to agree that both events happen simultaneously.

This creates a problem, because no matter where you put the clock, it has to be synchronized with respect to two distant events:

1. The pulse being sent from A.
2. The pulse arriving at B.

You cannot synchronize a clock with respect to an event unless the clock is physically present at that event, because simultaneity is otherwise relative to one's choice of reference frame. So the clock has to be at both A and B. You could imagine two clocks, synchronized with each other, but then you have to decide what you mean by "synchronized," because simultaneity is still relative. There is always a reference frame in which the clocks are not synchronized.

If the pulse were traveling slower than light, then you could move the clock with the pulse. This would give you a metric called the "proper time," and that actually does have some useful properties in this context. In particular, it is independent of reference frame or choice of coordinates, so that the one-way velocity of any slower-than-light object is perfectly well-defined and even measurable. But you can't do that with a pulse of light, because the pulse is traveling at the speed of light, and the clock can't go that fast if it has any mass. If the clock doesn't have any mass, then it cannot function as a clock in the first place, because massless particles do not experience the passage of time.

As a result, there is no operational definition of the one-way speed of light that is compatible with the laws of physics. This is why Einstein, for example, describes it as a "convention" in his special relativity paper, rather than asserting it as fact. It's not fact, it's just a convenient way of labeling the diagram.

Having said all that, there is one thing we can experimentally verify (and have done so): The time it takes for light to travel along a closed inertial path of length L is L/c, as measured at the start/endpoint of the path, regardless of what the path looks like or how it is shaped. So, if there are any variations in the one-way speed of light that depend on angle, they *always* cancel each other out. As a result, if you simply assume that the one-way speed of light is a constant, then you can use that assumption to "synchronize" distant clocks by sending light pulses between them and adjusting for the speed of light delay. Although this "synchronization" is ultimately just a convention, it behaves a lot like you would expect "true" synchronization to behave (at least with respect to the clocks' reference frames, anyway). Einstein describes how this works in his paper, and others have elaborated on the details (in short, the clocks should be in inertial reference frames and not in relative motion).

Axioms

Posted Feb 25, 2021 22:39 UTC (Thu) by apoelstra (subscriber, #75205) [Link] (2 responses)

> So the clock has to be at both A and B.

Could you entangle two clocks?

Axioms

Posted Feb 25, 2021 22:47 UTC (Thu) by mathstuf (subscriber, #69389) [Link] (1 responses)

You could, but you wouldn't know if they disentangled in the meantime without communicating. Entanglement still doesn't allow information to travel faster than light. Part of the problem is that if they're at the same place to sync with each other, to then be at A and B they'll need to travel outside of an intertial reference frame causing time dilation for them.

Axioms

Posted Feb 25, 2021 22:56 UTC (Thu) by apoelstra (subscriber, #75205) [Link]

Ah, ok, I see. To make meaningful use of the entanglement you'd have to bring the clocks back together (or communicate between them, but that's essentially the same -- there is frame-switching no matter what). But this is exactly the setting of the twin paradox [1] which is a more well-known example of trying to equivocate between events in distinct frames.

[1] https://en.wikipedia.org/wiki/Twin_paradox

An introduction to lockless algorithms

Posted Feb 20, 2021 4:04 UTC (Sat) by itsmycpu (guest, #139639) [Link] (4 responses)

Practically it means that if (step1) the first thread updates data, and then (step2) does a store-release of a value that indicates that the data was written, and then (step3) the second thread does a load-acquire of that value, it will (step4) see all the updates that the first thread has done before step2.

As long as the value read in step3 is the value written in step2 (or a value written after that). Because then you know that step3 > step2, and because of that, step4 > step1.

An introduction to lockless algorithms

Posted Feb 20, 2021 6:15 UTC (Sat) by itsmycpu (guest, #139639) [Link] (3 responses)

>Practically it means that if (step1) the first thread updates data, and then (step2) does a store-release of a value that indicates that the data was written, and then (step3) the second thread does a load-acquire of that value, it will (step4) see all the updates that the first thread has done before step2.
>
> As long as the value read in step3 is the value written in step2 (or a value written after that). Because then you know that step3 > step2, and because of that, step4 > step1.

To clarify the not-so-obvious part:

From the point of view of the first thread, step2 > step1 is always true, but without using acquire/release, this doesn't transfer to the second thread. Without that, the second thread may not see step4 > step1.

Why not? For example, the compiler is allowed to produce code for step2 that affects memory before step1, as long as that doesn't change the outcome of the first thread's local computations. The compiler does not have to consider the possible side effect on the second thread, which therefore might see the memory affected by step2 before it is affected by step1.

Only the use of acquire/release requires the compiler to produce code where the update of step1 is written to memory, and therefore visible to other threads, before step2. (Memory or shared cache). So without that, the second thread cannot assume step4 > step1, even if it thinks it can assume step3 > step2.

I think this is the same thing as the article says, just highlighting the practical aspects.

An introduction to lockless algorithms

Posted Feb 20, 2021 7:24 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (2 responses)

Yes, it's exactly the last ASCII art graph. Thanks for explaining it in words; even if a picture is worth a thousand words, sometimes writing them down still helps!

An introduction to lockless algorithms

Posted Feb 20, 2021 9:47 UTC (Sat) by pebolle (guest, #35204) [Link]

> Thanks for explaining it in words

And that goes for everyone taking time to answering my grumpy questions.

When you're not versed in the vocabulary and habits of a certain field it is very to hard determine whether you're reading something profound, something trivial or just plain nonsense. (I've just finished a book of nearly 300 pages explaining that even physicists can be happily employed while apparently only producing hot air.)

I'll restart the article and hope that I won't get stuck on one of my many, many pet peeves again.

An introduction to lockless algorithms

Posted Dec 10, 2021 9:48 UTC (Fri) by Lurchi (guest, #38509) [Link]

    a.x = 1;                            datum = message;
       |                                    |
       |   happens before                   |
       v                                    v
    message = &a;                       datum->x

I think the left "happens before" arrow should not be there (the right one of course is correct, due to the data dependency).

An introduction to lockless algorithms

Posted Feb 20, 2021 3:47 UTC (Sat) by kurogane (subscriber, #83248) [Link] (9 responses)

I'm afraid I keep on interpreting the message example as being code that will only read datum->x once x was already assigned.

    thread 1                            thread 2
    --------------------------------    ------------------------
    a.x = 1;
    message = &a;                       datum = message;
                                        if (datum != NULL)
                                          printk("%d\n", datum->x);

That is that thread 2 will do nothing if "message" and hence "datum" pointer is still null, or receive a pointer to "a" after it already has its x member assigned.

Is the point that thread 2 was supposed to printk(), i.e. wait until datum is non-null? When I first saw this code it seemed natural to me that thread 2 was meant to be a 'just do nothing if object was not ready' thread though.

Or is the point there might be reordering, that CPU instruction-wise the stores and loads might not be not ordered as the C code above is?

An introduction to lockless algorithms

Posted Feb 20, 2021 7:17 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (8 responses)

Yes, the point is that there might be reordering unless the store and load have release and acquire semantics respectively.

In theory the reordering could be on both sides, in practice (except on the Alpha) only the release side might be assigning a.x after message. Even on non-Alpha processors you need to inform the compiler not to do any reordering.

However you could have for example "ready=1" on the producer side and "if (ready) printk("%x", a.x)" on the consumer side. In that case both sides need to be protected against reordering at the processor level (in addition to the compiler level).

An introduction to lockless algorithms

Posted Feb 20, 2021 8:58 UTC (Sat) by kurogane (subscriber, #83248) [Link] (6 responses)

Thanks Paolo. If the processor instructions don't necessarily follow the higher-level language flow shown in the example, then it I get it I believe. Well, could you confim?

The instruction flow in the processor could end up like the below, as the C programmer mentally pictures it?

    thread 1                            thread 2
    --------------------------------    ------------------------
    message = &a;                       ...
                                        ...
    a.x = 1;                            ...

(Focusing just on thread 1 as that is the more likely jeopardy-maker.)

I think I finally understand what memory barriers are - temporal ones, not spatial.

An introduction to lockless algorithms

Posted Feb 20, 2021 9:13 UTC (Sat) by pbonzini (subscriber, #60935) [Link] (5 responses)

Yes, exactly.

On one hand it may not happen on all processors, and details may vary depending on the microarchitecture. On x86 that reordering cannot happen, on ARM and PowerPC it might.

But on the other hand the compiler might always be reordering stuff behind your back. The big insight of the C++ committee was that the same abstraction could cover both processor-level and compiler-level reorderings.

An introduction to lockless algorithms

Posted Feb 20, 2021 11:04 UTC (Sat) by kurogane (subscriber, #83248) [Link] (1 responses)

(thumbs-up sign)

I appreciated this confirmation very much.

An introduction to lockless algorithms

Posted Feb 20, 2021 20:31 UTC (Sat) by danobi (subscriber, #102249) [Link]

Thanks for asking -- had the same questions myself.

An introduction to lockless algorithms

Posted Feb 21, 2021 19:13 UTC (Sun) by khim (subscriber, #9252) [Link] (2 responses)

> On x86 that reordering cannot happen, on ARM and PowerPC it might.

Tiny correction: on x86 it can and will happen. You need memory barriers there, too.

After reading the aforementioned article I was confused for some time: memory barriers arrived on x86 with SSE… yet SMP was supported for more than decade by that time! Some investigation showed that any instruction with lock prefix acts as a memory barrier on x86. But if there are not memory barriers at all… yes, it does happen on x86, too.

An introduction to lockless algorithms

Posted Feb 21, 2021 19:16 UTC (Sun) by khim (subscriber, #9252) [Link]

After looking on that article again I have realized that Jeff added addendum to the end which reveals all the gory details about lock and xchg. So there are no mystery anymore…

An introduction to lockless algorithms

Posted Feb 21, 2021 20:51 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

It can't happen until part 3. :) This article and the next one are only concerned with load-load and store-store reordering, which are not possible on x86.

An introduction to lockless algorithms

Posted Mar 20, 2021 19:18 UTC (Sat) by guzh (subscriber, #140551) [Link]

Can we say that, in this case, the "happens before" effect is just a result of the fact that reordering is disallowed?
And only if the load *actually* happens after the store, we can say "a.x = 1" happens before "datum->x".
(If thread 2 runs really fast and completes before thead 1 starts, it sure won't get the proper value in x.)

An introduction to lockless algorithms

Posted Feb 21, 2021 12:38 UTC (Sun) by ale2018 (guest, #128727) [Link] (1 responses)

> thread 2 will read either NULL or &a,

If pointers are 64 bit, does that statement imply that the cache is being updated by minimal chunks of that size? I mean, why cannot it read the lower part of &a?

An introduction to lockless algorithms

Posted Feb 21, 2021 12:53 UTC (Sun) by pbonzini (subscriber, #60935) [Link]

Generally it's up to the architecture to define the largest size that can be stored atomically. It's almost always (on a modern 32-bit or 64-bit processor) at least the size of a pointer.

The processor itself will load data from memory in cache lines that will usually be 32, 64 or 128 *bytes* long.

An introduction to lockless algorithms

Posted Feb 21, 2021 18:50 UTC (Sun) by andresfreund (subscriber, #69562) [Link] (3 responses)

For me the most confusing part of memory-barriers.txt is that initially it sounds a lot like smp_rmb() + smp_wmb() would equal smp_mb(). Which isn't the case! Hope one of your next issues explains that in a more approachable way.

An introduction to lockless algorithms

Posted Feb 21, 2021 20:52 UTC (Sun) by pbonzini (subscriber, #60935) [Link] (2 responses)

Looks like you want to read part 3 indeed! :)

An introduction to lockless algorithms

Posted Feb 21, 2021 22:29 UTC (Sun) by andresfreund (subscriber, #69562) [Link] (1 responses)

Awesome, looking forward to it.

I don't know how many times I've now spent several hours staring at the read barrier + write barrier != full barrier thing swapping it back into my brain. And how often I couldn't find anything good to tell people to read.

An introduction to lockless algorithms

Posted Feb 25, 2021 18:25 UTC (Thu) by Alan.Stern (subscriber, #12437) [Link]

Here's one way to think about it: A read barrier will order two reads with respect to each other, and a write barrier will order two writes. But not even the combination of a read barrier plus a write barrier will order a write with respect to a read. A full barrier, on the other hand, will.

Example: Suppose the source code says:

write(x)
wmb()
rmb()
read(y)

Then the CPU has to execute the write before the wmb, and it has to execute the read after the rmb. But (depending on the architecture) nothing prevents it from reordering the operations to:

rmb()
read(y)
write(x)
wmb()

(In fact this actually happens on x86, where rmb and wmb are both no-ops and reads can be executed before earlier writes.)

However, with a full barrier:

write(x)
mb()
read(y)

no reordering is possible, and the write *must* be executed before the read.
There are other ways in which a full barrier is stronger than the combination of a read barrier plus a write barrier -- but at least this gets the point across.

An introduction to lockless algorithms

Posted Feb 21, 2021 21:10 UTC (Sun) by thoughtpolice (subscriber, #87455) [Link] (1 responses)

Using Lamport's approach when reasoning about concurrency is very powerful. It's much more rigorous and applicable than you might expect, and it can explain many behaviors easily at a glance that you might struggle over. For instance, in a lockless memory allocator, you need to grab a pointer from a freelist head and claim it for the caller (in the implementation of alloc), and then you need to update the freelist pointer to point to the next free block before you're done (very simplified).

Therefore the allocation is split into two distinct actions. The system however, may preempt your thread after a block is claimed, but before the freelist head is updated. Then another thread could also allocate and erroneously pull the same block, which can't be allowed.

The traditional, simplistic solution to this is to pack a counter value together with every freelist address. When you perform any operation on the freelist, including claiming blocks or updating pointers, you use a CAS2 operation ("Compare-and-swap, for two words of memory") to both update a value, and also increment the counter by one next to it. You can then check the counter value against a known value, to check preemption. If the counter value is g at the beginning and no thread interrupted you, then the result you get back at the end of everything, when you read the counter, should be g+1, i.e. no other threads performed any actions between your own, and so what you did was seen as atomic. If it's any other value, you've been preempted, the address you have in your hand is claimed by someone else, and you must start over.

But this can all be stated another way: every object needs a lamport counter associated with it. By examining the counter, you can determine if you are responsible for a casual action identified by the counter n, as well as n+1. If you are in fact responsible for both of these actions, then n by definition occurred before n+1, and thus no other threads were able to see the indeterminate state of the object. In this way, a Lamport counter is a kind of concrete algorithm for ensuring atomic broadcast amongst threads.

Frankly I consider "Time, Clocks, and Ordering of Events in a Distributed System" is a mandatory text for any modern software engineer. Actually, practically every text written by Lamport is extremely important to read...

An introduction to lockless algorithms

Posted Feb 26, 2021 0:37 UTC (Fri) by jschrod (subscriber, #1646) [Link]

> Actually, practically every text written by Lamport is extremely important to read...

Yes, and that's why the book "Concurrency: the Works of Leslie Lamport" was published by ACM in October 2019. https://dl.acm.org/doi/book/10.1145/3335772
Really worth to read.

There are very few computer scientists who are as important, where ACM published such a book. That said, I'm still waiting on the book "the works of Don Knuth" that goes beyond his work on TACP and TeX and also values his introduction of attribute grammars and loads of other principles. His recent opinion peace https://dl.acm.org/doi/10.1145/3442377 in CACM about the way computing history should be researched is also worth a read, even and especially if one doesn't agree with his approach.

An introduction to lockless algorithms

Posted Mar 3, 2021 6:46 UTC (Wed) by comicfans (subscriber, #117233) [Link]

Great article, before reading this (and the reference) I'm always confused by term 'release' / 'acquire'

An introduction to lockless algorithms

Posted Mar 3, 2021 22:57 UTC (Wed) by prauld (subscriber, #39414) [Link]

Very nice Paolo! I'm looking forward to the rest. I spent a lot of time with that paper when doing my dissertation.

An introduction to lockless algorithms

Posted Mar 7, 2021 14:58 UTC (Sun) by jcm (subscriber, #18262) [Link] (1 responses)

This is good related read: https://www.morganclaypool.com/doi/10.2200/S00962ED2V01Y2... (A Primer on Memory Consistency and Cache Coherency, Second Edition, currently free to download I believe)

I would also personally encourage readers to look elsewhere than the x86 QPI/UPI and HT stuff and toward an openly documented protocol with public documentation of its implementation. The Arm AMBA specs are available, as is their implementation in the Coherent Mesh Network CMN IP. Far more useful than the Intel/AMD alternatives.

An introduction to lockless algorithms

Posted Mar 7, 2021 14:59 UTC (Sun) by jcm (subscriber, #18262) [Link]

And CCIX is based on AMBA CHI in some respects. That protocol is also available for casual reading.

An introduction to lockless algorithms

Posted Mar 7, 2021 15:16 UTC (Sun) by jcm (subscriber, #18262) [Link]

Two other observations (pun intended):

1. Happens before can be thought of as "if it happens". When there are two threads sharing a variable, and there is order through barriers or acquire/release, you're still checking the value of the shared variable. No thread waits for another, it's just that *if* you see an update to the shared variable, you're guaranteed that you also see all that came before. This article is quite good at pointing that out, but many others don't really highlight this part.

2. The difference between the smp_store_release/smp_load_acquire and the WRITE_ONCE/READ_ONCE with barriers is that the latter uses "full" barriers while the former can be thought of as using "half" barriers that only order in one direction for each half of the acquire/release pair, which allows certain other unrelated loads and stores to pass into/out of the critical section doing the shared variable update.

An introduction to lockless algorithms

Posted Mar 25, 2021 12:22 UTC (Thu) by naveenmv7 (guest, #141627) [Link] (1 responses)

"Note that data is flowing from one thread to the other with no lock involved"

Why is this true? Can't pthread_join be implemented using semaphores internally?

An introduction to lockless algorithms

Posted Apr 6, 2021 15:56 UTC (Tue) by pbonzini (subscriber, #60935) [Link]

The internal implementation is irrelevant. From the point of view of the programmer there is no lock, and yet things work because of the way entering the thread synchronizes with pthread_create, and exiting the thread synchronizes with pthread_join.

An introduction to lockless algorithms

Posted Oct 11, 2022 4:50 UTC (Tue) by dafnaf (subscriber, #158251) [Link]

I didn't understand the second example with the smp_r/wmb. Can't there be a case that 'thread 2' starts and finishes before 'thread 1' ? or the case where 'thread 1' only do "a.x=1" and then 'thread 2' executes until the end?