8000 tscache: introduce skiplist-backed TimestampCache by nvanbenschoten · Pull Request #19508 · cockroachdb/cockroach · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

tscache: introduce skiplist-backed TimestampCache #19508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

nvanbenschoten
Copy link
Contributor

This is still a major work in progress, but I wanted to publish it in-case others wanted
to play around with it as well. I'm going to continue experimenting with this over the
next few days.

The change creates a new implementation of tscache.Cache that's backed by a concurrent
skiplist. It uses a fork of https://github.com/andy-kimball/tscache, which contains a similar
timestamp cache built on top of an arena-based skiplist (https://github.com/andy-kimball/arenaskl).
A couple of important modifications were performed to make the structure usable in CockroachDB, which are located over in https://github.com/nvanbenschoten/tscache. The actual changes here
are minimal, and amount to creating a wrapper type around the library and ripping out now-unnecessary locking.

Nothing here is polished, but that wasn't the intention for now. The important part is that all tests pass with the new tscache.Cache implementation plugged in!

Preliminary benchmarking shows impressive results. Here, the effect of the arena allocator is obvious. The properties of the skiplist also allow us to avoid a request interval map like we have in the current code, which I predict is also contributing to speedup. The following lists some (hastily gathered) performance diffs:

kv, 100% reads

kv --read-percent 100 --concurrency 16 --splits 100 --write-seq 100000

  • GCTrace time dropped from 7% to 0%
  • Throughput increased by 14%
  • 99th percentile latency dropped by 40-60%

kv, 80% reads

kv --read-percent 80 --concurrency 16 --splits 100 --write-seq 100000

  • Throughput increased by 11%
  • 99th percentile latency dropped by 20%

tpcc

default settings

  • Throughput increased by 30%
  • 99th percentile latency dropped by 50-60%

cc. @petermattis @spencerkimball @andy-kimball

@nvanbenschoten nvanbenschoten requested a review from a team October 25, 2017 07:26
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member
tbg commented Oct 25, 2017

Very exciting work, @nvanbenschoten! Does the defaultCacheSize parameter (64MB, which btw sounds tiny to me) allow for approximately the same number of entries in the cache or are we evicting much earlier now? I assume you ran all benchmarks long enough so that they would actually rotate the caches (since the cache is supposedly getting slower as it fills up due to how the skiplist is implemented?) Would be interesting to run the benchmarks with a large cache size such as 512MB (or even few gigs, just to see how far we can push it). It would also be interesting to see how the low watermark rises with this cache compared to the other one.

PS I think you may have accidentally amended one of Andy's commits (Enhance benchmarks) in tscache:
✂-1


Reviewed 2 of 2 files at r1, 2 of 2 files at r2, 6 of 6 files at r3.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


pkg/storage/helpers_test.go, line 335 at r3 (raw file):

// GetTimestampCacheLowWater returns the timestamp cache low water mark.
func (r *Replica) GetTimestampCacheLowWater() hlc.Timestamp {
	// r.store.tsCacheMu.Lock()

Such a satisfying diff.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Very cool. Did you come up with a solution for keeping entries in the cache for MinTSCacheWindow seconds? I was thinking about this morning. (Haven't looked at the code yet).

@nvanbenschoten
Copy link
Contributor Author

Does the defaultCacheSize parameter (64MB, which btw sounds tiny to me) allow for approximately the same number of entries in the cache or are we evicting much earlier now?

I haven't verified this, but they should allow for roughly the same number of entries because the currently used cacheEntrySize accounts for roughly the same amount as what the skiplist's arena will account for (keys, hlc.Timestamp, txnID, entry overhead).

I also agree that 64MB sounds tiny. I think we'll want to size this as some fraction of available memory because that will also scale with the expected amount of load. Anyone benchmarking this may want to play around with increasing this by an order of magnitude or so.

Did you come up with a solution for keeping entries in the cache for MinTSCacheWindow seconds?

@spencerkimball and I were discussing this last night. I was thinking that there could be some form of a dynamic fixedCache count instead of us maintaining only two. We could then add some policy to rotateCaches to only release older fixedCaches when their maxWallTime is more than now-MinTSCacheWindow. We'd probably also want to have some minimum count that the number of fixedCaches can shrink down to (maybe just keep it at 2?).

@petermattis
Copy link
Collaborator

@spencerkimball and I were discussing this last night. I was thinking that there could be some form of a dynamic fixedCache count instead of us maintaining only two. We could then add some policy to rotateCaches to only release older fixedCaches when their maxWallTime is more than now-MinTSCacheWindow. We'd probably also want to have some minimum count that the number of fixedCaches can shrink down to (maybe just keep it at 2?).

Heh, that's exactly what I was thinking. We'd have to maintain the max timestamp per fixedCache, but that seems relatively straightforward.

@tbg
Copy link
Member
tbg commented Oct 25, 2017

@spencerkimball and I were discussing this last night. I was thinking that there could be some form of a dynamic fixedCache count instead of us maintaining only two. We could then add some policy to rotateCaches to only release older fixedCaches when their maxWallTime is more than now-MinTSCacheWindow. We'd probably also want to have some minimum count that the number of fixedCaches can shrink down to (maybe just keep it at 2?).

How would the lookup work in that scenario? You'd check the first cache, and if you fall through to the low watermark but you'd rather have something smaller, you check the next level, until you bottom out? That doesn't seem unreasonable, given that requests near the write front wouldn't usually have to check anything but the first level.

@nvanbenschoten
Copy link
Contributor Author

We'd have to maintain the max timestamp per fixedCache

You'd check the first cache, and if you fall through to the low watermark but you'd rather have something smaller, you check the next level, until you bottom out? That doesn't seem unreasonable, given that requests near the write front wouldn't usually have to check anything but the first level.

Actually, that's exactly what https://github.com/andy-kimball/tscache already does!

@petermattis
Copy link
Collaborator

Finally got around to looking at the implementation. Very neat!

It is a bit wasteful that every read/write span for request gets a copy of the timestamp and the transaction ID, especially given that the data is immutable once it is stored in the arena. For a given request, it would be a significant space savings if we could store the timestamp/txnID once in the arena and then have each node store the offset. I think this could be done, though there are definite challenges. Certainly not worthwhile exploring in the near term.

@nvanbenschoten
Copy link
Contributor Author

It is a bit wasteful that every read/write span for request gets a copy of the timestamp and the transaction ID, especially given that the data is immutable once it is stored in the arena. For a given request, it would be a significant space savings if we could store the timestamp/txnID once in the arena and then have each node store the offset.

@petermattis that's a good idea which could certainly be done, and it may even be something we'd want to look into sooner rather than later. In the case where all nodes have a key and a gap value, it would result in a ~22% space saving (4*4+28)/(2*28). It becomes larger if the offset is used to point to timestamp and txnID together: ~35% savings (2*4+28)/(2*28), but this comes with a few extra complications. Furthermore, I think these savings would actually start to compound as spans subsumed and ratcheted other's key and gap values to their same timestamps and txnIDs. The problem is, to take advantage of this compounding savings, we'd need to remove unused timestamps and txnIDs from the arena. Because the arena is immutable, this wouldn't be trivial. For now, I think we should leave this as is, but I'll add a TODO if we get far enough that this becomes something we should start to think about.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 9, 2017
This change adds a retention policy to `intervalSkl` that it should never evict
pages that contain intervals within a certain window of recency. This will allow
the structure to exhibit the `MinRetentionWindow` semantics that `tscache.Cache`
implementations require.

We do this by replacing the `earlier` and `later` `sklPages` with a linked list
of `sklPages` that can dynamically grow when necessary.

See cockroachdb#19508 (comment)
for more discussion.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 9, 2017
This change adds a retention policy to `intervalSkl` that it should never evict
pages that contain intervals within a certain window of recency. This will allow
the structure to exhibit the `MinRetentionWindow` semantics that `tscache.Cache`
implementations require.

We do this by replacing the `earlier` and `later` `sklPages` with a linked list
of `sklPages` that can dynamically grow when necessary.

See cockroachdb#19508 (comment)
for more discussion.
@nvanbenschoten nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch from cc4d267 to 2686d91 Compare November 10, 2017 21:44
@nvanbenschoten
Copy link
Contributor Author

I've rebased this on top of #19974, so it's now using the intervalSkl from #19672. The two commits local to this PR are still works in progress (and not ready to be carefully reviewed), but they expose a COCKROACH_USE_SKL_TSCACHE environment variable to switch between tscache.Cache implementations.

@petermattis I did some testing with this using kv. I'm not seeing the txn retry errors anymore, so I'm thinking that we were right about them being due to one of the two race conditions addressed in #19672.

@petermattis
Copy link
Collaborator

@nvanbenschoten Are you waiting on review of this or is there still more work to do here?


Review status: 0 of 11 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


Comments from Reviewable

@nvanbenschoten
Copy link
Contributor Author

No, I'm not waiting on a review at the moment. Before pushing this forward I want to improve the randomized testing we perform across Cache implementations to assure that they always give the same results, especially under concurrent workloads.

@nvanbenschoten nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch from 2686d91 to e702f49 Compare November 16, 2017 01:10
@nvanbenschoten
Copy link
Contributor Author

This has been rebased off of #20087 and is ready to be reviewed once that gets resolved.

@nvanbenschoten nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch 2 times, most recently from 1985b71 to 587d44d Compare November 16, 2017 05:31
@nvanbenschoten
Copy link
Contributor Author

@bdarnell would it be possible to stress this (using the COCKROACH_U 8000 SE_SKL_TSCACHE env var) under Jepsen for a day or two?

@bdarnell
Copy link
Contributor

We're not set up to easily run multiple configurations of the jepsen tests, but you could temporarily add a commit to this PR to default that variable to true and then run a one-off build on this branch.

@cuongdo
Copy link
Contributor
cuongdo commented Nov 16, 2017

@bdarnell aren't there current failures in Jepsen that might be confusing to sort through while testing this PR?

@bdarnell
Copy link
Contributor

Yes, but I'm assuming we don't want to wait for all of those to be resolved before testing this change. You'll just have to look at the logs (I can help with this) to see whether it's an inconsistency or some other automation/glue problem (All the current problems are the latter).

@nvanbenschoten nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch from 8f38921 to 0860410 Compare November 16, 2017 17:35
@nvanbenschoten nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch from 0860410 to cf06ace Compare November 16, 2017 23:06
@nvanbenschoten nvanbenschoten requested a review from a team November 16, 2017 23:06
@nvanbenschoten
Copy link
Contributor Author

This has been rebased off master and is now ready for review.

It got most of the way through Jepsen before an SSH failure. I'm running the two tests that didn't get to finish again.

@nvanbenschoten nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch 2 times, most recently from 9289a7a to cf53406 Compare November 17, 2017 04:38
@tbg
Copy link
Member
tbg commented Nov 17, 2017

:lgtm: but remove [DNM] from commits. Exciting!


Reviewed 9 of 10 files at r5, 3 of 3 files at r6.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


build/jepsen-common.sh, line 54 at r6 (raw file):

tests=(
    #"bank"

Just leaving a comment to double-check that this is reverted before merge.


pkg/storage/tscache/cache_test.go, line 515 at r6 (raw file):

	defer leaktest.AfterTest(t)()

	// Run one subtest using a real clock to generate timestampts and

Timestampts :trollface:
Do you need to update the commit here? I think you fixed this in the other PR.
Not reviewing the rest of the test, since I've seen it already. Please let me know if anything changed.


Comments from Reviewable

This change uses the `intervalSkl` structure to create a new `sklImpl` type that
implements the `Cache` interface.
@nvanbenschoten nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch from cf53406 to 45ad7d5 Compare November 17, 2017 19:43
@nvanbenschoten
Copy link
Contributor Author

TFTR! I think you mistook a new test as being the same as an existing one though.


Review status: 8 of 10 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


pkg/storage/tscache/cache_test.go, line 515 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Timestampts :trollface:
Do you need to update the commit here? I think you fixed this in the other PR.
Not reviewing the rest of the test, since I've seen it already. Please let me know if anything changed.

This comment was copied from TestIntervalSklConcurrentVsSequential and TestIntervalSklConcurrency, so that's why it had the same spelling error (fixed now!). This is a new test though. It's testing that treeImpl and sklImpl return the same results when accessed concurrently, which is a little different than the test that makes sure the intervalSkl returns the same results when accessed sequentially and when accessed concurrently.


build/jepsen-common.sh, line 54 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Just leaving a comment to double-check that this is reverted before merge.

Done.


Comments from Reviewable

@nvanbenschoten nvanbenschoten changed the title [DNM] tscache: experiment with skiplist-backed TimestampCache tscache: introduce skiplist-backed TimestampCache Nov 17, 2017
@tbg
Copy link
Member
tbg commented Nov 21, 2017

:lgtm:


Reviewed 2 of 9 files at r4, 4 of 4 files at r7.
Review status: :shipit: all files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@nvanbenschoten nvanbenschoten merged commit c21246c into cockroachdb:master Nov 22, 2017
@nvanbenschoten nvanbenschoten deleted the nvnabenschoten/skiplistTSCache branch November 22, 2017 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0