-
Notifications
You must be signed in to change notification settings - Fork 66
Imprecise counters to avoid issues with contended atomicModifyIORef
#21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Have a look at https://ghc.haskell.org/trac/ghc/ticket/8972#comment:1. It looks like contended |
can't post on trac for some reason right now (the spam filter dislikes me), but related tickets that compliment the unboxed reference idea are https://ghc.haskell.org/trac/ghc/ticket/8157 and https://ghc.haskell.org/trac/ghc/ticket/7883#comment:16 |
atomicmodifyIORef has a whole busy wait loop structure that probably adds a bunch of overhead too |
the fetch-and-add primop exposed in |
yeah, cacheline width padding is a VERY important detail |
[Since @tibbe wanted to be kept in the loop.] We're currently testing if these ideas are ready for our production system and then will integrate them back into upstream. So, here's a quick summary of what we're doing:
Problem: We use a lot of EKG counters to keep track of various code paths of our system. Each counter is incremented using
atomicModifyIORef
and our binaries usually run with 4 or more hardware threads. We noticed on microbenchmarks that these counters add a non-trivial overhead, most likely because of lots of cache-line contention an the busy-waiting loop used byatomicModifyIORef
. We could use modifyIORef, but that would likely get less and less precise the more cores are involved.Proposed Solution: The basic idea is to use one counter per capability and then sum them up when we reading the current value. This is still a bit imprecise, but probably much less so than the IORef approach. So, you represent a counter as a byte array and each capability writes to a different part of that byte array. These writes don't have to be atomic. For a slight improvement in performance we also make sure that two cores don't share the same cache line. The other parts of the cache line can be used for other counters. So the array looks as follows:
The per-capability counters for counter
c1
are at offset0
,1 * stride + 0
,2 * stride + 0
. The current value of the counter is the sum of all of these per-capability counters. Since we cannot read all values at the same time, we have a race condition, but that's why they're imprecise counters.Due to this imprecision they should be displayed as imprecise counters in the UI as well.
The text was updated successfully, but these errors were encountered: