8000 Imprecise counters to avoid issues with contended `atomicModifyIORef` · Issue #21 · haskell-github-trust/ekg · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Imprecise counters to avoid issues with contended atomicModifyIORef #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nominolo opened this issue Mar 7, 2014 · 5 comments
Open

Comments

@nominolo
Copy link
nominolo commented Mar 7, 2014

[Since @tibbe wanted to be kept in the loop.] We're currently testing if these ideas are ready for our production system and then will integrate them back into upstream. So, here's a quick summary of what we're doing:

Problem: We use a lot of EKG counters to keep track of various code paths of our system. Each counter is incremented using atomicModifyIORef and our binaries usually run with 4 or more hardware threads. We noticed on microbenchmarks that these counters add a non-trivial overhead, most likely because of lots of cache-line contention an the busy-waiting loop used by atomicModifyIORef. We could use modifyIORef, but that would likely get less and less precise the more cores are involved.

Proposed Solution: The basic idea is to use one counter per capability and then sum them up when we reading the current value. This is still a bit imprecise, but probably much less so than the IORef approach. So, you represent a counter as a byte array and each capability writes to a different part of that byte array. These writes don't have to be atomic. For a slight improvement in performance we also make sure that two cores don't share the same cache line. The other parts of the cache line can be used for other counters. So the array looks as follows:

| capability 0      | capabality 1      | capability 2      |
+-----------------------------------------------------------+
| c1 | c2 | c3 | c4 | c1 | c2 | c3 | c4 | c1 | c2 | c3 | c4 |
+-----------------------------------------------------------+
   |<---- stride ----->|

The per-capability counters for counter c1 are at offset 0, 1 * stride + 0, 2 * stride + 0. The current value of the counter is the sum of all of these per-capability counters. Since we cannot read all values at the same time, we have a race condition, but that's why they're imprecise counters.

Due to this imprecision they should be displayed as imprecise counters in the UI as well.

@tibbe
Copy link
Collaborator
tibbe commented Apr 8, 2014

Have a look at https://ghc.haskell.org/trac/ghc/ticket/8972#comment:1. It looks like contended IORefs scale very poorly. I will try to use the C implementation I gave in that ticket in ekg. Should make things up to orders of magnitude faster.

@cartazio
Copy link
cartazio commented Apr 8, 2014

can't post on trac for some reason right now (the spam filter dislikes me), but related tickets that compliment the unboxed reference idea are https://ghc.haskell.org/trac/ghc/ticket/8157 and https://ghc.haskell.org/trac/ghc/ticket/7883#comment:16

@cartazio
Copy link
cartazio commented Apr 8, 2014

atomicmodifyIORef has a whole busy wait loop structure that probably adds a bunch of overhead too

@jberryman
Copy link

the fetch-and-add primop exposed in atomic-primops tends to perform very well (on x86) if you want a slightly simpler solution. Be sure to make your own counter that's padded on it's own cache line

@cartazio
Copy link

yeah, cacheline width padding is a VERY important detail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0