Read-mostly research in 2015

December 16, 2015

This article was contributed by Paul McKenney

[Editor's note: this article was co-written by Paul McKenney and Aravinda Prasad].

It has been just over one year since the last LWN article on read-mostly research. However, it is good to see that there have been a number of interesting papers since then related to RCU and other read-mostly topics. This article gives a quick summary of them.

Validation

Joseph Tassarotti (Carnegie-Mellon University), Derek Dreyer (Max Planck Institute for Software Systems), and Viktor Vafeiadis (also MPI-SWS) carried out a manual proof of correctness of the quiescent-state-based reclamation (QSBR) variant of user-space RCU. This paper modeled rcu_dereference() with C11 memory_order_acquire loads rather than the memory_order_consume loads intended for that purpose. However, given the current parlous state of memory_order_consume [PDF], this shortcut is quite understandable. Their paper is titled “Verifying Read-Copy-Update in a Logic for Weak Memory [PDF]” and appeared in the 2015 Proceedings of the 36^th annual ACM SIGPLAN conference on Programming Language Design and Implementation. Another researcher asked me if I felt that their assumptions were adequate, and as previously reported here, I replied that, since they found no bugs, their assumptions clearly must be unrealistic. In the authors' defense, the proof is highly non-trivial, so the lack of bugs was not due to lack of effort.

Iftekhar Ahmed of Oregon State University gave a presentation to the Scalability microconference of the 2015 Linux Plumbers Conference describing a “mutant” technique for verifying test suites that he is applying to Linux-kernel RCU. Each mutant is a copy of the RCU code, but with a single change, or “mutation”. For example, a given statement might be deleted, a comparison operator might be changed, or a constant might be changed. The question is then whether rcutorture will complain about the mutant. Mutants that rcutorture does not complain about are said to have “survived”.

Of course, some mutants will never produce a complaint. For example, a mutation that removes a BUG_ON() or that changes (say) “while (1)” to “while (2)” produces a correct program. Some manual inspection of surviving mutants is therefore required. However, such manual inspection has been rewarded by the finding of not one but two holes in rcutorture, one of which was hiding a real bug—in Tiny RCU, believe it or not. [Full disclosure: I am collaborating with Iftekhar and his professors, Alex Groce and Carlos Jensen, and am co-author on a resulting paper [PDF].]

Quick Quiz 1: Wait a minute!!! How can you possibly create a logic expression that represents all executions of a parallel program???
Answer

The past year also saw the release of version 5 of the University of Oxford C Bounded Model Checker (CBMC), led by Daniel Kröning. CBMC takes a C program as input, and creates a logic expression that takes the program's inputs as inputs, and that evaluates to true if some combination of those inputs can trigger an assert or result in an array-out-of-bounds condition. This new version introduces multiprocessor capability (including some weak-memory support), and is capable of automatically verifying some trivial RCU implementations, including the Linux kernel's Tiny RCU. It is early days for CBMC's multiprocessor feature, but its appearance is a welcome development.

Using and implementing RCU

Mike Ash posted a description of an RCU-like primitive in Apple's Objective-C runtime. Interestingly enough, this approach identifies read-side critical sections via designated code ranges. This of course means that it scans all CPUs' program counters in order to identify grace periods. This approach is interesting in being a distinct method of achieving zero-overhead read-side critical sections, albeit one that poses some interesting practical challenges for large read-side critical sections that call many functions and for reliable sampling of other CPUs' program counters without undue degradation of realtime response.

Editor's note: for those wondering how an algorithm that takes locks can be "lock-free," you're not alone. According to Paul, in common academic usage, "lock-free" means "at least one thread is guaranteed to make progress." Algorithms that do not take locks at all, instead, are "lockless."

Pedro Ramalhete and Andreia Correia took a much simpler approach, using reader-writer locking to implement RCU, resulting in what they call poor-man's URCU [PDF]. Although this is by no means the first lock-based implementation of RCU, it is, as far as I know, the first lock-based implementation that boasts lock-free readers. Their trick is to use not one but two reader-writer locks. Readers loop doing a read-side trylock on each of these two locks in turn, exiting the loop upon successful acquisition of either of the two locks. Writers are serialized by another lock, and simply write-acquire and write-release the two reader-writer locks in succession. If a writer is delayed while write-holding one of these two locks, readers can still make progress by acquiring the other lock. Although poor man's RCU seems unlikely to become a production-quality implementation of RCU, it is worth studying in its own right. After all, it is not every day that someone achieves non-blocking forward-progress guarantees with what is usually considered to be a blocking primitive!

Maya Arbel and Adam Morrison, both of Technion, wrote a paper titled “Predicate RCU: An RCU for Scalable Concurrent Updates [PDF]”. This title might come as quite a surprise to those of us for whom RCU has long provided eminently scalable updates. However, Arbel and Morrison were working with an internal tree whose non-leaf nodes can contain data, described here [PDF]. It turns out that deleting an item from an internal tree is more complex than in an external tree where only leaf nodes contain data. This added complexity is especially vexing because RCU-protected readers cannot be excluded, which is a particular problem when those readers are finding the spot in which to do an insertion. This interaction between complex deletions and RCU lookup-insertions is handled by holding locks across grace periods, which can degrade both performance and scalability. Of course, holding locks across grace periods places those grace periods on the critical path, which motivated the authors to work hard to reduce grace-period duration, hence predicate RCU.

Predicate RCU allows updaters to wait only on those readers that are involved with the update in question, which should shorten grace periods. This should in turn reduce the penalty for holding locks across grace periods, however, there is no free lunch: Shorter grace periods mean fewer updates per grace period and thus higher per-update overhead. This effect is anything but subtle: a 2004 USENIX paper notes that RCU has been observed satisfying more than 1,000 updates with a single grace period while running an unremarkable workload. Nevertheless, if you must hold a lock across a grace period, a shorter grace period is going to be a good thing, as can be seen in the Linux kernel's synchronize_net() function, which uses synchronize_rcu_expedited() when called with the networking layer's RTNL lock held.

Quick Quiz 2: Given the performance penalties, shouldn't someone stop researchers from holding locks across grace periods?
Answer

Therefore, given that Arbel and Morrison need to hold locks across RCU grace periods, they need short grace periods. To this end, predicate-RCU readers associate themselves with an algorithm-specific value. For example, readers of a hash table might associate themselves with the lookup key, which would allow updaters to wait only for those readers whose keys hashed to the hash bucket being updated. This is in some ways similar to the Linux kernel's SRCU, where the srcu_struct serves as the value associating readers and updaters. Either way, given that updaters only need to wait on a small subset of the readers, one would expect grace periods to elapse more quickly, which would in turn be expected to reduce the penalty for waiting for a grace period while holding a lock. Their performance results meet this expectation: Although they do not achieve linear scalability, shorter grace periods do improve their performance. Of course, avoiding waiting for grace periods while holding locks would likely improve performance and scalability even more!

Developers who assume that academics ignore their work will be happy to see that this paper cites a couple of LWN articles.

Yujie Liu (Lehigh University), Victor Luchangco (Oracle Labs), and Michael Spear (also Lehigh) wrote a 2013 paper titled “Mindicators: A Scalable Approach to Quiescence [PDF]”. This paper presses the scalable non-zero indicator (SNZI) [$PDF] technique into service as a grace-period mechanism, and compares it to several other approaches. The intent appears to be to use this mechanism to implement transactional memory (TM), which also appears to require low-latency grace periods, in part courtesy of TM's linearizability requirements, which in turn seems to limit scalability. They do call out the relationship to RCU.

Alexander Matveev (MIT), Nir Shavit (MIT and Tel-Aviv University), Pascal Felber (University of Neuchâtel), and Patrick Marlier (also University of Neuchâtel) recently published a Symposium on Operating Systems Principles (SOSP) paper titled “Read-Log-Update: A Lightweight Synchronization Mechanism for Concurrent Programming [PDF]”, which can be thought of as a software transactional memory (STM) extension that includes explicitly marked read-only transactions. However, unlike RCU readers, these read-only transactions are guaranteed to see a point-in-time snapshot of the union of all RLU-protected data structures across multiple traversals. Of course, this does impose significant additional overhead on RLU updaters, as they acknowledge in their Figure 7, which shows RLU updaters being 2-5 times slower than RCU updaters. That said, this figure shows a benchmark that favors RCU rather heavily. With or without the point-in-time snapshots, I believe that their realization of the importance of explicitly marking read-only operations is a great step forward. Updates to shared variables are also explicitly marked, providing performance benefits over pure STM similar to those of SwissTM [PDF].

Quick Quiz 3: If RCU does not provide readers a guaranteed consistent snapshot of the data structure, how can anyone successfully use it?
Answer

Their technique scales reasonably well up to 16 CPUs, however, this is a very small system for modern non-mobile workloads. They do have one graph (uppermost graph in Figure 8) that goes up to 80 CPUs, but this shows poor scalability. My first thought was that this poor scalability was due to their single global counter that is atomically incremented on each update, but this did not make sense because RLU is outperforming RCU. Instead, the culprit seems to be the need to hold locks across grace periods. Because RCU optimizes for low per-update overhead at the expense of grace-period latency, and because synchronize_rcu() was used (instead of call_rcu() or synchronize_rcu_expedited()), holding locks across grace periods hurts RCU even more than it hurts RLU. Once again, I recommend either releasing locks before waiting for grace periods or using the asynchronous call_rcu() primitive: Both approaches avoid degraded scalability. In (thankfully rare) Linux-kernel cases where it is absolutely necessary to wait for grace periods while holding a mutex, the synchronize_rcu_expedited() APIs can be used, though these are not particularly good for realtime applications (with the exception of synchronize_srcu_expedited()).

Their performance testing includes both user-space and Linux-kernel scenarios. In the Linux-kernel scenarios shown in Figure 9, their list-traversal code beat that of the Linux kernel by a surprisingly large margin. In an impressive display of good sportsmanship, one of the authors (Marlier) located the Linux-kernel performance bottleneck and submitted a fix that causes the Linux kernel's lists to outperform those of the paper. The problem was a single non-atomic store and load to an unshared location in the running task's stack, with no memory barriers. It appears that current microprocessors' pipelines can be a bit slow to handle a load from a location that the current CPU just stored to. Patrick eliminated this store and load, and his patch was accepted during the v4.4 merge window.

As far as I know, this is the first academic use of the rcutorture test suite to test an alternative RCU-like implementation, which is a nice milestone. Those wanting some more detailed discussion on RLU, including graphics showing scalability issues on larger systems, can find it on this page.

At the end of the paper, the authors express hope that RLU will be used both in kernel and in user-space programs. This of course raises the question of what situations would be best suited for an RLU solution. Two possibilities come to mind:

Quick Quiz 4: Shouldn't RLU be tried on complex RCU uses such as the Linux kernel's VFS dentry cache-walk code?
Answer

Situations where RCU readers are used in conjunction with sequence locking. These situations are already paying the complexity and performance costs of retries, so these RLU disadvantages might be less of a problem in this case.
Situations where a problematic reader-writer lock has proven difficult to convert to RCU. Perhaps RLU's snapshotted readers might better handle some of these situations.

That said, I strongly suspect that successful application of RLU will require the authors to carefully separate those semantics that are actually required from those semantics that are merely fashionable.

It is easy to get irritated at academics' insistence on linearizability, given the large performance and scalability penalties they pay to achieve it. On the other hand, they do use deferral to improve performance of RLU, and perhaps further work along these lines will persuade them to let go of linearizability.

Frans Kaashoek presented on the history of parallelism and operating systems [PDF] during the SOSP 2015 History Day Workshop to mark SOSP's 50^th anniversary. Frans devoted a full slide of a 33-slide deck to RCU, which should be a point of pride for the many developers who have applied RCU within the Linux kernel. (Yes, I am happy on behalf of my work on the RCU infrastructure, but let's face it, the infrastructure is profoundly uninteresting without its many uses.)

Peter Denning also presented in SOSP 2015 History Day, but on OS Foundations [PDF]. Those who know me will not be surprised to hear that Peter's last bullet on his slide 39 resonated with me: “Theory follows practice”.

So what have I been doing? For my part, I have continued my work enabling complex atomic updates to RCU-protected data structures with minimal copying, which I recently presented [PDF] at linux.conf.au. This round dealt with some bottlenecks in user-mode memory allocators, relearning the lesson that Glibc malloc() is not at all scalable. I have also been working to repair C11's and C++11's memory_order_consume feature, which just might be nearly done. I rewrote the Linux kernel's synchronize_rcu_expedited() and synchronize_sched_expedited() to reduce OS jitter and, for good measure, added kernel parameters to suppress expedited grace periods entirely. I also started work on design documentation for the Linux kernel's RCU implementation, starting with its requirements. Finally, I have upgraded rcutorture in an attempt to keep up with the Linux kernel's huge installed base.

Conclusions

It is very good to see continued academic interest in read-mostly techniques such as RCU. I very much hope that the mutual learning process continues, and that it benefits Linux developers and their users!

Acknowledgments

We all owe thanks to Srivatsa Bhat, K. Gopinath, Orran Krieger, Pedro Ramalhete, Mike Ash, Derek Dreyer, Joe Tassarotti, Iftekhar Ahmed, Adam Morrison, and Hagit Attiya for helping to make this human-readable. I am grateful to Jim Wasko for his support of this effort.

Answers to Quick Quizzes

Quick Quiz 1: Wait a minute!!! How can you possibly create a logic expression that represents all executions of a parallel program?

Answer: CBMC converts the parallel program into single static assignment (SSA) form, which results in each variable having a separate instance for each assignment to it. Each read from that variable is then wired up to one of the instances that it might have read its value from, and all combinations are represented in the resulting logic expression.

Of course the resulting logic expression might be quite large, but modern computers and modern SAT solvers have grown quite capable. For example, in 1990, a world-class SAT solver might be able to handle 100 variables, that is, three 32-bit variables with four bits left over. In contrast, in 2015, I have solved a 1.8 million variable SAT problem on my laptop.

The solution took a full ten seconds.

Back to Quick Quiz 1.

Quick Quiz 2: Given the performance penalties, shouldn't someone stop researchers from holding locks across grace periods?

Answer: No.

First, please note that holding locks (or, in the Linux kernel, mutexes) across grace periods makes perfect sense in some cases, for example, in the synchronize_net() example given earlier. In addition, in some cases, holding locks across grace periods on rarely executed slow paths can greatly reduce complexity. Second, even though I recommend strongly against holding locks across grace periods on fast paths, it is quite possible that researchers exploring this technique will nevertheless come up with something useful. That said, I do indeed suspect that they would make better progress if they moved their grace periods outside of locks. Third, waiting for highly optimized grace periods is not likely to be a big problem on small systems. (Pretty amazing that a 64-CPU system can now be considered “small”, isn't it?) Fourth and finally, it is their job to figure out what they work on. I would no more wish to dictate what they do than I would wish them to dictate what I do.

Back to Quick Quiz 2.

Quick Quiz 3: If RCU does not provide readers a guaranteed consistent snapshot of the data structure, how can anyone successfully use it???

Answer: It turns out that RCU does in fact guarantee a consistent snapshot—at zero cost—in some important special cases. Perhaps the most common case is a data structure where updaters are restricted to adding new elements or removing old ones, in other words, where in-place updates are prohibited. If each reader only ever looks up a single element, then readers will automatically see a consistent snapshot of that single element—even if a given reader looks up a different element each time it executes.

There are many other use cases where RCU provides consistent snapshots for free, and quite a few of them may be found in the Linux kernel. However, it is also the case that consistency guarantees are overrated. After all, the finite speed of light sharply limits the degree of consistency that a given data structure can usefully maintain with its corresponding external real-world state. Given that the whole point of many data structures is to track external state, internal consistency all too often becomes nothing more than an expensive fiction.

Back to Quick Quiz 3.

Quick Quiz 4: Shouldn't RLU be tried on complex RCU uses such as the Linux kernel's VFS dentry cache-walk code?

Answer: That might well be the case. I therefore have already pointed one of the authors (Matveev) at Neil Brown's excellent LWN series here, here, and here. Matveev's initial response was as follows: “The dcache/dentry + RCU + various locks is really a complex structure... ” I take this as a good sign, in that Matveev should not be blinded by overconfidence.

Back to Quick Quiz 4.

Index entries for this article
Kernel	Read-copy-update
GuestArticles	McKenney, Paul E.

Read-mostly research in 2015

Posted Dec 19, 2015 16:54 UTC (Sat) by nix (subscriber, #2304) [Link] (5 responses)

Wonderful article, as usual. Lots of interesting papers to read through over Christmas! :)

It is, at least, now *possible* to fix glibc malloc. Before a few years ago, all but the most trivial changes were precluded by the existence of malloc_get_state()/malloc_set_state(), which was used by exactly one program: Emacs. (In theory it could be used by other dumpy programs like Lisp interpreters, but in practice it was just Emacs). Now Emacs is the single biggest consumer of memory on most of my machines but even *I* think this is ridiculous: I'm glad it's changed, and Emacs now uses its own rather worse malloc() but frees up glibc's for future improvement :)

(aside: while I was most interested in the existence-bits stuff in the lca slides, I will not be able to unsee slide 16. As far as I'm concerned, all the dataflows on my machines' buses now have angry faces painted on the front of them.)

Read-mostly research in 2015

Posted Dec 19, 2015 18:22 UTC (Sat) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (4 responses)

Glad you liked it! And glad to hear that glibc malloc() now has the potential to be fixed!

On your aside, do systems with angry data run faster? ;-)

Read-mostly research in 2015

Posted Dec 20, 2015 1:06 UTC (Sun) by nix (subscriber, #2304) [Link] (3 responses)

Systems with angry data run hotter. (This is the real reason for the heat problems plaguing CPU designers these days -- insufficiently placated electrons.)

The existence-bits stuff -- I was very proud, ten years ago now, of implementing what I now realise was about a third of that idea, only much less well thought out, less general, and doing just enough for the one data structure I needed it for. But that was a different employer and not open source and I no longer work there, thank goodness, so I *cannot fix it* to use your idea and this burns my soul.

Read-mostly research in 2015

Posted Dec 20, 2015 18:23 UTC (Sun) by PaulMcKenney (✭ supporter ✭, #9624) [Link] (2 responses)

Interesting! So perhaps significant energy savings would accrue if the many organizations processing world news focused on feel-good stories? And who would have guessed that the flood of too-cute cat photos on various social-media sites might be solving an important global problem!!!

In your defense, there is a lot to be said for doing only that which is needed for the job at hand. And your experience is encouraging in that I was simply solving a stated problem with no assurance that the solution would have any practical value. I am guessing that you are constrained from sharing any details about the data structure in question, but if you can talk about it, I suspect that others would also be interested.

And I bet that there is a lot more that can be done with existence structures!

Read-mostly research in 2015

Posted Dec 21, 2015 23:06 UTC (Mon) by nix (subscriber, #2304) [Link] (1 responses)

I can share details because I cannot imagine that anyone much cares (believe me, it was not business-critical and in fact with modern memory hierarchies is probably a bad idea!)

I was optimizing a hash table based on a splay tree variant: we had very large hash tables accessed by hundreds of threads simultaneously, but where only a few dozen entries were really frequently accessed, and I wanted stable iteration order, so rebalancing a bucket hash became monstrously difficult (I hate rebalancing anyway, it's hard to debug and faults in it lead to horrible intermittent bugs). I could have "cached the cache" with a subsidiary per-CPU hash, but this was in userspace so per-CPU stuff gets really rather difficult. I could have done per-accessor caches, but that makes cache invalidation a pig. So the idea was to use a binary tree (keyed by the hash function output, of course), rotating entries closer to the top of the tree on every access, making the next access faster. But multiple concurrent accesses were routine, and obviously if you want low latencies avoiding lock contention is even more important, and these hash tables were big enough that locks at every node were completely impractical: so I used something very similar to your existence structures (I called it a 'tie' because it 'tied down' a node for as long as it was there) to hold nodes into a consistent state, still accessible via their old route, while the rotation was going on, then release them later, once it was certain that no lookups that might have started before the rotation were still underway: each accessing thread had a pool of pre-prepared existence structures, so I could set them up with a couple of pointer assignments: necessary, because even internal nodes between the two rotation points needed to be tied down this way. One access to a rarely-used node could trigger dozens of ties.

At the time I didn't know about RCU -- when I learned about it I felt like a right idiot :) you'll notice I even had grace periods, though I didn't call them that and they were totally tied to the implementation of my hashing machinery (I had a reference counter of live accesses in the hash *, the owner of the root node, and my "grace periods" ended when the count fell to zero).

In hindsight this was probably a terrible data structure: it was very write-heavy (every read of something not near the top of the tree triggered multiple writes) and write-heaviness is bad on modern machines. But, hey, it worked! (And previous approaches, that used locks rather than the tie approach, were far slower). One major benefit from my perspective was that every conditional in the implementation was more or less guaranteed to have both branches executed quite often -- unlike a rebalancing approach, there were no big blocks of rarely-executed code in this one for bugs to build up in.

Read-mostly research in 2015

Posted Dec 22, 2015 6:10 UTC (Tue) by PaulMcKenney (✭ supporter ✭, #9624) [Link]

I agree that modern systems are much less forgiving of frequent concurrent writes than was older hardware, but your approach does sound quite appropriate for the hardware at hand back in the day. So good show, and thank you for the info!

Read-mostly research in 2015

Posted Dec 21, 2015 17:24 UTC (Mon) by blackwood (guest, #44174) [Link] (3 responses)

Doesn't the poor-man's URCU have a fairness problem? I have no idea whether that's part of the academic lock-free definition, but for catastrophic amounts of bad luck where a reader just always trylocks the lock a writer is holding right at that moment, and gets stalled in-between, it won't make forward progress. And even when the underlying locking primitve guarantees some form of fairness that won't help, since the new locking primitive is implemented using trylocks on top of whatever's there. To fix this there would need to be a new "trylock 2 locks at once, acquire at most 1" primitive, and I have no idea how that could even be implemented without some form of retrying using existing cpu instructions.

Read-mostly research in 2015

Posted Dec 21, 2015 17:29 UTC (Mon) by andresfreund (subscriber, #69562) [Link] (2 responses)

There's wait-free (simplified: each thread is guaranteed to make progress), lock-free (at least one thread makes progress) and obstruction free (the system as a whole can progress if some thread is stuck at any point). It sounds like you're looking for "wait-free"?

Read-mostly research in 2015

Posted Dec 21, 2015 18:41 UTC (Mon) by blackwood (guest, #44174) [Link] (1 responses)

Yeah I guess lock-free indeed has a very unusual (to me) definition in academics. Doesn't the "at least one thread makes progress" trivially hold if you never block while holding the lock? Of course that doesn't work in userspace where the scheduler can kick you off the cpu anytime it feels like, but in the kernel spin_lock_irq seems to fit this. Probably some twist somewhere, so is there some reading guide/canonical definitions of this stuff available somewhere?

Read-mostly research in 2015

Posted Dec 21, 2015 20:24 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

> Yeah I guess lock-free indeed has a very unusual (to me) definition in academics

Note I'm not an academic, nor really an expert in this stuff. I'm reading relevant literature because it's getting more an more relevant for stuff I work on.

From reading papers or seems that the exact definitions change between papers, and not just in the phrasing.

It sounds a bit like your definition is what the article's note defined as lockless?

> . Doesn't the "at least one thread makes progress" trivially hold if you never block while holding the lock?

Well, that was just the gist of the definition ;) I think it usually includes a) a fixed limit of the number of steps until the operation finished b) a somehow stuck participant may nor lock up the system.