8000 perf: timestampCache causes GC-related slowdown · Issue #16796 · cockroachdb/cockroach · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

perf: timestampCache causes GC-related slowdown #16796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
petermattis opened this issue Jun 29, 2017 · 3 comments
Closed

perf: timestampCache causes GC-related slowdown #16796

petermattis opened this issue Jun 29, 2017 · 3 comments
Assignees
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@petermattis
Copy link
Collaborator

Running a read-only workload on a single node cluster shows performance degradation from ~19k ops/sec to a steady state of ~13.5k ops/sec. Investigation eventually pointed to the timestampCache. Specifically, if timestampCache.AddRequest is disabled, steady state performance is ~18k ops/sec. The workload under investigation is:

kv --read-percent 100 --concurrency 16 --splits 100 --write-seq 100000

Interestingly, the manipulation of the timestampCache.requests btree by AddRequest doesn't seem to be the problem. If AddRequest is tweaked to add and immediately delete the request, performance is good.

Part of the problem seems to be the btree. Replacing the btree with a fixed size ring buffer of approximately the same size results in steady state performance of ~15k ops/sec. The ring buffer experiment isn't a drop in replacement. More work would be required to flesh out its functionality and that additional functionality might end up negating the modest improvement it is providing.

Another experiment was to zero the cacheRequest before inserting it. This brought steady state performance back up to ~17k ops/sec.

My current suspicion is that the non-zeroed cacheRequests are causing additional GC pressure. Enabling GODEBUG=gctrace=1 shows:

GC CPU
master 13%
disabled timestampCache 7%
zeroed cacheRequest 8%

Note that there are two fields in cacheRequest that zeroing affects with this workload: cacheRequest.span and cacheRequest.reads. I'm not sure what to do here. We need the spans in order to later expand the requests into the interval tree. It is surprising to me that merely holding on to these keys is causing additional GC CPU usage. Suggestions are welcome.

@petermattis petermattis self-assigned this Jun 29, 2017
@bdarnell
Copy link
Contributor

Try zeroing the span and reads field independently. I would expect that zeroing span wouldn't make a difference for GC costs (since it just points to bytes), but zeroing reads would (since it's a slice containing pointers).

How much of the Go memory was used by the timestamp cache in these tests? It's usually the biggest single consumer of memory on the go side, in which case it might not be surprising that eliminating it cuts GC costs nearly in half. If that's the case, our options are limited. We could allocate a big slab of memory, copy the keys into it, and store indexes into that slab instead of the pointers in roachpb.Span. Or we could move the whole timestamp cache into C++.

@petermattis
Copy link
Collaborator Author
ops/sec GC CPU
zero none 14909.6 12%
zero all 17503.2 8%
zero span,reads 16906.9 9%
zero reads 16705.9 10%
zero span 15345.6 12%

"zero all" corresponds to req = cacheRequest{}. Strange that doing so is different than req.span, req.reads = roachpb.RSpan{}, nil.

Go memory was in the 300-600MB range during these tests. The timestamp cache was at capacity (i.e. using the full 64 MB), but it is a bit difficult to tell how much Go memory was being used by the timestamp cache. Presumably somewhat more than 64 MB due to the btree and other overhead that isn't accounted for. Moving the cache to C++ would get rid of the GC issues, but introduce Cgo overhead.

@petermattis
Copy link
Collaborator Author

As a sanity check about the feasibility of moving the timestamp cache to C++, TimestampCache.AddRequest takes about ~2us and the various TimestampCache calls in applyTimestampCache consume ~5us. The overhead of a cgo call is 60ns (i.e. 0.06us). So the cgo overhead would be negligible here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

3 participants
0