Read-mostly research in 2015
It has been just over one year since the last LWN article on read-mostly research. However, it is good to see that there have been a number of interesting papers since then related to RCU and other read-mostly topics. This article gives a quick summary of them.
Validation
Joseph Tassarotti (Carnegie-Mellon University), Derek Dreyer (Max Planck Institute for Software Systems), and Viktor Vafeiadis (also MPI-SWS) carried out a manual proof of correctness of the quiescent-state-based reclamation (QSBR) variant of user-space RCU. This paper modeled rcu_dereference() with C11 memory_order_acquire loads rather than the memory_order_consume loads intended for that purpose. However, given the current parlous state of memory_order_consume [PDF], this shortcut is quite understandable. Their paper is titled “Verifying Read-Copy-Update in a Logic for Weak Memory [PDF]” and appeared in the 2015 Proceedings of the 36th annual ACM SIGPLAN conference on Programming Language Design and Implementation. Another researcher asked me if I felt that their assumptions were adequate, and as previously reported here, I replied that, since they found no bugs, their assumptions clearly must be unrealistic. In the authors' defense, the proof is highly non-trivial, so the lack of bugs was not due to lack of effort.
Iftekhar Ahmed of Oregon State University gave a presentation to the Scalability microconference of the 2015 Linux Plumbers Conference describing a “mutant” technique for verifying test suites that he is applying to Linux-kernel RCU. Each mutant is a copy of the RCU code, but with a single change, or “mutation”. For example, a given statement might be deleted, a comparison operator might be changed, or a constant might be changed. The question is then whether rcutorture will complain about the mutant. Mutants that rcutorture does not complain about are said to have “survived”.
Of course, some mutants will never produce a complaint. For example, a mutation that removes a BUG_ON() or that changes (say) “while (1)” to “while (2)” produces a correct program. Some manual inspection of surviving mutants is therefore required. However, such manual inspection has been rewarded by the finding of not one but two holes in rcutorture, one of which was hiding a real bug—in Tiny RCU, believe it or not. [Full disclosure: I am collaborating with Iftekhar and his professors, Alex Groce and Carlos Jensen, and am co-author on a resulting paper [PDF].]
Answer
Using and implementing RCU
Mike Ash posted a description of an RCU-like primitive in Apple's Objective-C runtime. Interestingly enough, this approach identifies read-side critical sections via designated code ranges. This of course means that it scans all CPUs' program counters in order to identify grace periods. This approach is interesting in being a distinct method of achieving zero-overhead read-side critical sections, albeit one that poses some interesting practical challenges for large read-side critical sections that call many functions and for reliable sampling of other CPUs' program counters without undue degradation of realtime response.
Maya Arbel and Adam Morrison, both of Technion, wrote a paper titled “Predicate RCU: An RCU for Scalable Concurrent Updates [PDF]”. This title might come as quite a surprise to those of us for whom RCU has long provided eminently scalable updates. However, Arbel and Morrison were working with an internal tree whose non-leaf nodes can contain data, described here [PDF]. It turns out that deleting an item from an internal tree is more complex than in an external tree where only leaf nodes contain data. This added complexity is especially vexing because RCU-protected readers cannot be excluded, which is a particular problem when those readers are finding the spot in which to do an insertion. This interaction between complex deletions and RCU lookup-insertions is handled by holding locks across grace periods, which can degrade both performance and scalability. Of course, holding locks across grace periods places those grace periods on the critical path, which motivated the authors to work hard to reduce grace-period duration, hence predicate RCU.
Predicate RCU allows updaters to wait only on those readers that are involved with the update in question, which should shorten grace periods. This should in turn reduce the penalty for holding locks across grace periods, however, there is no free lunch: Shorter grace periods mean fewer updates per grace period and thus higher per-update overhead. This effect is anything but subtle: a 2004 USENIX paper notes that RCU has been observed satisfying more than 1,000 updates with a single grace period while running an unremarkable workload. Nevertheless, if you must hold a lock across a grace period, a shorter grace period is going to be a good thing, as can be seen in the Linux kernel's synchronize_net() function, which uses synchronize_rcu_expedited() when called with the networking layer's RTNL lock held.
Answer
Developers who assume that academics ignore their work will be happy to see that this paper cites a couple of LWN articles.
Yujie Liu (Lehigh University), Victor Luchangco (Oracle Labs), and Michael Spear (also Lehigh) wrote a 2013 paper titled “Mindicators: A Scalable Approach to Quiescence [PDF]”. This paper presses the scalable non-zero indicator (SNZI) [$PDF] technique into service as a grace-period mechanism, and compares it to several other approaches. The intent appears to be to use this mechanism to implement transactional memory (TM), which also appears to require low-latency grace periods, in part courtesy of TM's linearizability requirements, which in turn seems to limit scalability. They do call out the relationship to RCU.
Alexander Matveev (MIT), Nir Shavit (MIT and Tel-Aviv University), Pascal Felber (University of Neuchâtel), and Patrick Marlier (also University of Neuchâtel) recently published a Symposium on Operating Systems Principles (SOSP) paper titled “Read-Log-Update: A Lightweight Synchronization Mechanism for Concurrent Programming [PDF]”, which can be thought of as a software transactional memory (STM) extension that includes explicitly marked read-only transactions. However, unlike RCU readers, these read-only transactions are guaranteed to see a point-in-time snapshot of the union of all RLU-protected data structures across multiple traversals. Of course, this does impose significant additional overhead on RLU updaters, as they acknowledge in their Figure 7, which shows RLU updaters being 2-5 times slower than RCU updaters. That said, this figure shows a benchmark that favors RCU rather heavily. With or without the point-in-time snapshots, I believe that their realization of the importance of explicitly marking read-only operations is a great step forward. Updates to shared variables are also explicitly marked, providing performance benefits over pure STM similar to those of SwissTM [PDF].
Answer
Their performance testing includes both user-space and Linux-kernel scenarios. In the Linux-kernel scenarios shown in Figure 9, their list-traversal code beat that of the Linux kernel by a surprisingly large margin. In an impressive display of good sportsmanship, one of the authors (Marlier) located the Linux-kernel performance bottleneck and submitted a fix that causes the Linux kernel's lists to outperform those of the paper. The problem was a single non-atomic store and load to an unshared location in the running task's stack, with no memory barriers. It appears that current microprocessors' pipelines can be a bit slow to handle a load from a location that the current CPU just stored to. Patrick eliminated this store and load, and his patch was accepted during the v4.4 merge window.
As far as I know, this is the first academic use of the rcutorture test suite to test an alternative RCU-like implementation, which is a nice milestone. Those wanting some more detailed discussion on RLU, including graphics showing scalability issues on larger systems, can find it on this page.
At the end of the paper, the authors express hope that RLU will be used both in kernel and in user-space programs. This of course raises the question of what situations would be best suited for an RLU solution. Two possibilities come to mind:
Answer
- Situations where RCU readers are used in conjunction with
sequence locking.
These situations are already paying the complexity and
performance costs of retries, so these RLU disadvantages
might be less of a problem in this case.
- Situations where a problematic reader-writer lock has proven difficult to convert to RCU. Perhaps RLU's snapshotted readers might better handle some of these situations.
That said, I strongly suspect that successful application of RLU will require the authors to carefully separate those semantics that are actually required from those semantics that are merely fashionable.
It is easy to get irritated at academics' insistence on linearizability, given the large performance and scalability penalties they pay to achieve it. On the other hand, they do use deferral to improve performance of RLU, and perhaps further work along these lines will persuade them to let go of linearizability.
Frans Kaashoek presented on the history of parallelism and operating systems [PDF] during the SOSP 2015 History Day Workshop to mark SOSP's 50th anniversary. Frans devoted a full slide of a 33-slide deck to RCU, which should be a point of pride for the many developers who have applied RCU within the Linux kernel. (Yes, I am happy on behalf of my work on the RCU infrastructure, but let's face it, the infrastructure is profoundly uninteresting without its many uses.)
Peter Denning also presented in SOSP 2015 History Day, but on OS Foundations [PDF]. Those who know me will not be surprised to hear that Peter's last bullet on his slide 39 resonated with me: “Theory follows practice”.
So what have I been doing? For my part, I have continued my work enabling complex atomic updates to RCU-protected data structures with minimal copying, which I recently presented [PDF] at linux.conf.au. This round dealt with some bottlenecks in user-mode memory allocators, relearning the lesson that Glibc malloc() is not at all scalable. I have also been working to repair C11's and C++11's memory_order_consume feature, which just might be nearly done. I rewrote the Linux kernel's synchronize_rcu_expedited() and synchronize_sched_expedited() to reduce OS jitter and, for good measure, added kernel parameters to suppress expedited grace periods entirely. I also started work on design documentation for the Linux kernel's RCU implementation, starting with its requirements. Finally, I have upgraded rcutorture in an attempt to keep up with the Linux kernel's huge installed base.
Conclusions
It is very good to see continued academic interest in read-mostly techniques such as RCU. I very much hope that the mutual learning process continues, and that it benefits Linux developers and their users!
Acknowledgments
We all owe thanks to Srivatsa Bhat, K. Gopinath, Orran Krieger, Pedro Ramalhete, Mike Ash, Derek Dreyer, Joe Tassarotti, Iftekhar Ahmed, Adam Morrison, and Hagit Attiya for helping to make this human-readable. I am grateful to Jim Wasko for his support of this effort.
Answers to Quick Quizzes
Quick Quiz 1: Wait a minute!!! How can you possibly create a logic expression that represents all executions of a parallel program?
Answer: CBMC converts the parallel program into single static assignment (SSA) form, which results in each variable having a separate instance for each assignment to it. Each read from that variable is then wired up to one of the instances that it might have read its value from, and all combinations are represented in the resulting logic expression.
Of course the resulting logic expression might be quite large, but modern computers and modern SAT solvers have grown quite capable. For example, in 1990, a world-class SAT solver might be able to handle 100 variables, that is, three 32-bit variables with four bits left over. In contrast, in 2015, I have solved a 1.8 million variable SAT problem on my laptop.
The solution took a full ten seconds.
Quick Quiz 2: Given the performance penalties, shouldn't someone stop researchers from holding locks across grace periods?
Answer: No.
First, please note that holding locks (or, in the Linux kernel, mutexes) across grace periods makes perfect sense in some cases, for example, in the synchronize_net() example given earlier. In addition, in some cases, holding locks across grace periods on rarely executed slow paths can greatly reduce complexity. Second, even though I recommend strongly against holding locks across grace periods on fast paths, it is quite possible that researchers exploring this technique will nevertheless come up with something useful. That said, I do indeed suspect that they would make better progress if they moved their grace periods outside of locks. Third, waiting for highly optimized grace periods is not likely to be a big problem on small systems. (Pretty amazing that a 64-CPU system can now be considered “small”, isn't it?) Fourth and finally, it is their job to figure out what they work on. I would no more wish to dictate what they do than I would wish them to dictate what I do.
Quick Quiz 3: If RCU does not provide readers a guaranteed consistent snapshot of the data structure, how can anyone successfully use it???
Answer: It turns out that RCU does in fact guarantee a consistent snapshot—at zero cost—in some important special cases. Perhaps the most common case is a data structure where updaters are restricted to adding new elements or removing old ones, in other words, where in-place updates are prohibited. If each reader only ever looks up a single element, then readers will automatically see a consistent snapshot of that single element—even if a given reader looks up a different element each time it executes.
There are many other use cases where RCU provides consistent snapshots for free, and quite a few of them may be found in the Linux kernel. However, it is also the case that consistency guarantees are overrated. After all, the finite speed of light sharply limits the degree of consistency that a given data structure can usefully maintain with its corresponding external real-world state. Given that the whole point of many data structures is to track external state, internal consistency all too often becomes nothing more than an expensive fiction.
Quick Quiz 4: Shouldn't RLU be tried on complex RCU uses such as the Linux kernel's VFS dentry cache-walk code?
Answer: That might well be the case. I therefore have already pointed one of the authors (Matveev) at Neil Brown's excellent LWN series here, here, and here. Matveev's initial response was as follows: “The dcache/dentry + RCU + various locks is really a complex structure... ” I take this as a good sign, in that Matveev should not be blinded by overconfidence.
Index entries for this article | |
---|---|
Kernel | Read-copy-update |
GuestArticles | McKenney, Paul E. |
Posted Dec 19, 2015 16:54 UTC (Sat)
by nix (subscriber, #2304)
[Link] (5 responses)
It is, at least, now *possible* to fix glibc malloc. Before a few years ago, all but the most trivial changes were precluded by the existence of malloc_get_state()/malloc_set_state(), which was used by exactly one program: Emacs. (In theory it could be used by other dumpy programs like Lisp interpreters, but in practice it was just Emacs). Now Emacs is the single biggest consumer of memory on most of my machines but even *I* think this is ridiculous: I'm glad it's changed, and Emacs now uses its own rather worse malloc() but frees up glibc's for future improvement :)
(aside: while I was most interested in the existence-bits stuff in the lca slides, I will not be able to unsee slide 16. As far as I'm concerned, all the dataflows on my machines' buses now have angry faces painted on the front of them.)
Posted Dec 19, 2015 18:22 UTC (Sat)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (4 responses)
On your aside, do systems with angry data run faster? ;-)
Posted Dec 20, 2015 1:06 UTC (Sun)
by nix (subscriber, #2304)
[Link] (3 responses)
The existence-bits stuff -- I was very proud, ten years ago now, of implementing what I now realise was about a third of that idea, only much less well thought out, less general, and doing just enough for the one data structure I needed it for. But that was a different employer and not open source and I no longer work there, thank goodness, so I *cannot fix it* to use your idea and this burns my soul.
Posted Dec 20, 2015 18:23 UTC (Sun)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link] (2 responses)
In your defense, there is a lot to be said for doing only that which is needed for the job at hand. And your experience is encouraging in that I was simply solving a stated problem with no assurance that the solution would have any practical value. I am guessing that you are constrained from sharing any details about the data structure in question, but if you can talk about it, I suspect that others would also be interested.
And I bet that there is a lot more that can be done with existence structures!
Posted Dec 21, 2015 23:06 UTC (Mon)
by nix (subscriber, #2304)
[Link] (1 responses)
I was optimizing a hash table based on a splay tree variant: we had very large hash tables accessed by hundreds of threads simultaneously, but where only a few dozen entries were really frequently accessed, and I wanted stable iteration order, so rebalancing a bucket hash became monstrously difficult (I hate rebalancing anyway, it's hard to debug and faults in it lead to horrible intermittent bugs). I could have "cached the cache" with a subsidiary per-CPU hash, but this was in userspace so per-CPU stuff gets really rather difficult. I could have done per-accessor caches, but that makes cache invalidation a pig. So the idea was to use a binary tree (keyed by the hash function output, of course), rotating entries closer to the top of the tree on every access, making the next access faster. But multiple concurrent accesses were routine, and obviously if you want low latencies avoiding lock contention is even more important, and these hash tables were big enough that locks at every node were completely impractical: so I used something very similar to your existence structures (I called it a 'tie' because it 'tied down' a node for as long as it was there) to hold nodes into a consistent state, still accessible via their old route, while the rotation was going on, then release them later, once it was certain that no lookups that might have started before the rotation were still underway: each accessing thread had a pool of pre-prepared existence structures, so I could set them up with a couple of pointer assignments: necessary, because even internal nodes between the two rotation points needed to be tied down this way. One access to a rarely-used node could trigger dozens of ties.
At the time I didn't know about RCU -- when I learned about it I felt like a right idiot :) you'll notice I even had grace periods, though I didn't call them that and they were totally tied to the implementation of my hashing machinery (I had a reference counter of live accesses in the hash *, the owner of the root node, and my "grace periods" ended when the count fell to zero).
In hindsight this was probably a terrible data structure: it was very write-heavy (every read of something not near the top of the tree triggered multiple writes) and write-heaviness is bad on modern machines. But, hey, it worked! (And previous approaches, that used locks rather than the tie approach, were far slower). One major benefit from my perspective was that every conditional in the implementation was more or less guaranteed to have both branches executed quite often -- unlike a rebalancing approach, there were no big blocks of rarely-executed code in this one for bugs to build up in.
Posted Dec 22, 2015 6:10 UTC (Tue)
by PaulMcKenney (✭ supporter ✭, #9624)
[Link]
Posted Dec 21, 2015 17:24 UTC (Mon)
by blackwood (guest, #44174)
[Link] (3 responses)
Posted Dec 21, 2015 17:29 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link] (2 responses)
Posted Dec 21, 2015 18:41 UTC (Mon)
by blackwood (guest, #44174)
[Link] (1 responses)
Posted Dec 21, 2015 20:24 UTC (Mon)
by andresfreund (subscriber, #69562)
[Link]
Note I'm not an academic, nor really an expert in this stuff. I'm reading relevant literature because it's getting more an more relevant for stuff I work on.
From reading papers or seems that the exact definitions change between papers, and not just in the phrasing.
It sounds a bit like your definition is what the article's note defined as lockless?
> . Doesn't the "at least one thread makes progress" trivially hold if you never block while holding the lock?
Well, that was just the gist of the definition ;) I think it usually includes a) a fixed limit of the number of steps until the operation finished b) a somehow stuck participant may nor lock up the system.
Read-mostly research in 2015
Read-mostly research in 2015
Read-mostly research in 2015
Interesting! So perhaps significant energy savings would accrue if the many organizations processing world news focused on feel-good stories? And who would have guessed that the flood of too-cute cat photos on various social-media sites might be solving an important global problem!!!
Read-mostly research in 2015
Read-mostly research in 2015
Read-mostly research in 2015
Read-mostly research in 2015
Read-mostly research in 2015
Read-mostly research in 2015
Read-mostly research in 2015