tscache: introduce skiplist-backed TimestampCache #19508

nvanbenschoten · 2017-10-25T07:26:40Z

This is still a major work in progress, but I wanted to publish it in-case others wanted
to play around with it as well. I'm going to continue experimenting with this over the
next few days.

The change creates a new implementation of tscache.Cache that's backed by a concurrent
skiplist. It uses a fork of https://github.com/andy-kimball/tscache, which contains a similar
timestamp cache built on top of an arena-based skiplist (https://github.com/andy-kimball/arenaskl).
A couple of important modifications were performed to make the structure usable in CockroachDB, which are located over in https://github.com/nvanbenschoten/tscache. The actual changes here
are minimal, and amount to creating a wrapper type around the library and ripping out now-unnecessary locking.

Nothing here is polished, but that wasn't the intention for now. The important part is that all tests pass with the new tscache.Cache implementation plugged in!

Preliminary benchmarking shows impressive results. Here, the effect of the arena allocator is obvious. The properties of the skiplist also allow us to avoid a request interval map like we have in the current code, which I predict is also contributing to speedup. The following lists some (hastily gathered) performance diffs:

kv, 100% reads

kv --read-percent 100 --concurrency 16 --splits 100 --write-seq 100000

GCTrace time dropped from 7% to 0%
Throughput increased by 14%
99th percentile latency dropped by 40-60%

kv, 80% reads

kv --read-percent 80 --concurrency 16 --splits 100 --write-seq 100000

Throughput increased by 11%
99th percentile latency dropped by 20%

tpcc

default settings

Throughput increased by 30%
99th percentile latency dropped by 50-60%

cc. @petermattis @spencerkimball @andy-kimball

cockroach-teamcity · 2017-10-25T07:26:48Z

This change is

tbg · 2017-10-25T13:24:33Z

Very exciting work, @nvanbenschoten! Does the defaultCacheSize parameter (64MB, which btw sounds tiny to me) allow for approximately the same number of entries in the cache or are we evicting much earlier now? I assume you ran all benchmarks long enough so that they would actually rotate the caches (since the cache is supposedly getting slower as it fills up due to how the skiplist is implemented?) Would be interesting to run the benchmarks with a large cache size such as 512MB (or even few gigs, just to see how far we can push it). It would also be interesting to see how the low watermark rises with this cache compared to the other one.

PS I think you may have accidentally amended one of Andy's commits (Enhance benchmarks) in tscache:

Reviewed 2 of 2 files at r1, 2 of 2 files at r2, 6 of 6 files at r3.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

pkg/storage/helpers_test.go, line 335 at r3 (raw file):

// GetTimestampCacheLowWater returns the timestamp cache low water mark.
func (r *Replica) GetTimestampCacheLowWater() hlc.Timestamp {
	// r.store.tsCacheMu.Lock()

Such a satisfying diff.

Comments from Reviewable

petermattis · 2017-10-25T13:35:51Z

Very cool. Did you come up with a solution for keeping entries in the cache for MinTSCacheWindow seconds? I was thinking about this morning. (Haven't looked at the code yet).

nvanbenschoten · 2017-10-25T13:54:11Z

Does the defaultCacheSize parameter (64MB, which btw sounds tiny to me) allow for approximately the same number of entries in the cache or are we evicting much earlier now?

I haven't verified this, but they should allow for roughly the same number of entries because the currently used cacheEntrySize accounts for roughly the same amount as what the skiplist's arena will account for (keys, hlc.Timestamp, txnID, entry overhead).

I also agree that 64MB sounds tiny. I think we'll want to size this as some fraction of available memory because that will also scale with the expected amount of load. Anyone benchmarking this may want to play around with increasing this by an order of magnitude or so.

Did you come up with a solution for keeping entries in the cache for MinTSCacheWindow seconds?

@spencerkimball and I were discussing this last night. I was thinking that there could be some form of a dynamic fixedCache count instead of us maintaining only two. We could then add some policy to rotateCaches to only release older fixedCaches when their maxWallTime is more than now-MinTSCacheWindow. We'd probably also want to have some minimum count that the number of fixedCaches can shrink down to (maybe just keep it at 2?).

petermattis · 2017-10-25T13:57:17Z

@spencerkimball and I were discussing this last night. I was thinking that there could be some form of a dynamic fixedCache count instead of us maintaining only two. We could then add some policy to rotateCaches to only release older fixedCaches when their maxWallTime is more than now-MinTSCacheWindow. We'd probably also want to have some minimum count that the number of fixedCaches can shrink down to (maybe just keep it at 2?).

Heh, that's exactly what I was thinking. We'd have to maintain the max timestamp per fixedCache, but that seems relatively straightforward.

tbg · 2017-10-25T13:57:41Z

@spencerkimball and I were discussing this last night. I was thinking that there could be some form of a dynamic fixedCache count instead of us maintaining only two. We could then add some policy to rotateCaches to only release older fixedCaches when their maxWallTime is more than now-MinTSCacheWindow. We'd probably also want to have some minimum count that the number of fixedCaches can shrink down to (maybe just keep it at 2?).

How would the lookup work in that scenario? You'd check the first cache, and if you fall through to the low watermark but you'd rather have something smaller, you check the next level, until you bottom out? That doesn't seem unreasonable, given that requests near the write front wouldn't usually have to check anything but the first level.

nvanbenschoten · 2017-10-25T14:11:12Z

We'd have to maintain the max timestamp per fixedCache

You'd check the first cache, and if you fall through to the low watermark but you'd rather have something smaller, you check the next level, until you bottom out? That doesn't seem unreasonable, given that requests near the write front wouldn't usually have to check anything but the first level.

Actually, that's exactly what https://github.com/andy-kimball/tscache already does!

petermattis · 2017-10-27T21:56:05Z

Finally got around to looking at the implementation. Very neat!

It is a bit wasteful that every read/write span for request gets a copy of the timestamp and the transaction ID, especially given that the data is immutable once it is stored in the arena. For a given request, it would be a significant space savings if we could store the timestamp/txnID once in the arena and then have each node store the offset. I think this could be done, though there are definite challenges. Certainly not worthwhile exploring in the near term.

nvanbenschoten · 2017-10-27T22:30:24Z

It is a bit wasteful that every read/write span for request gets a copy of the timestamp and the transaction ID, especially given that the data is immutable once it is stored in the arena. For a given request, it would be a significant space savings if we could store the timestamp/txnID once in the arena and then have each node store the offset.

@petermattis that's a good idea which could certainly be done, and it may even be something we'd want to look into sooner rather than later. In the case where all nodes have a key and a gap value, it would result in a ~22% space saving (4*4+28)/(2*28). It becomes larger if the offset is used to point to timestamp and txnID together: ~35% savings (2*4+28)/(2*28), but this comes with a few extra complications. Furthermore, I think these savings would actually start to compound as spans subsumed and ratcheted other's key and gap values to their same timestamps and txnIDs. The problem is, to take advantage of this compounding savings, we'd need to remove unused timestamps and txnIDs from the arena. Because the arena is immutable, this wouldn't be trivial. For now, I think we should leave this as is, but I'll add a TODO if we get far enough that this becomes something we should start to think about.

This change adds a retention policy to `intervalSkl` that it should never evict pages that contain intervals within a certain window of recency. This will allow the structure to exhibit the `MinRetentionWindow` semantics that `tscache.Cache` implementations require. We do this by replacing the `earlier` and `later` `sklPages` with a linked list of `sklPages` that can dynamically grow when necessary. See cockroachdb#19508 (comment) for more discussion.

nvanbenschoten · 2017-11-10T21:49:22Z

I've rebased this on top of #19974, so it's now using the intervalSkl from #19672. The two commits local to this PR are still works in progress (and not ready to be carefully reviewed), but they expose a COCKROACH_USE_SKL_TSCACHE environment variable to switch between tscache.Cache implementations.

@petermattis I did some testing with this using kv. I'm not seeing the txn retry errors anymore, so I'm thinking that we were right about them being due to one of the two race conditions addressed in #19672.

petermattis · 2017-11-14T19:04:12Z

@nvanbenschoten Are you waiting on review of this or is there still more work to do here?

Review status: 0 of 11 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

Comments from Reviewable

nvanbenschoten · 2017-11-14T19:28:54Z

No, I'm not waiting on a review at the moment. Before pushing this forward I want to improve the randomized testing we perform across Cache implementations to assure that they always give the same results, especially under concurrent workloads.

nvanbenschoten · 2017-11-16T01:11:58Z

This has been rebased off of #20087 and is ready to be reviewed once that gets resolved.

nvanbenschoten · 2017-11-16T06:57:22Z

@bdarnell would it be possible to stress this (using the COCKROACH_U 8000 SE_SKL_TSCACHE env var) under Jepsen for a day or two?

bdarnell · 2017-11-16T16:06:17Z

We're not set up to easily run multiple configurations of the jepsen tests, but you could temporarily add a commit to this PR to default that variable to true and then run a one-off build on this branch.

cuongdo · 2017-11-16T16:13:31Z

@bdarnell aren't there current failures in Jepsen that might be confusing to sort through while testing this PR?

bdarnell · 2017-11-16T16:16:38Z

Yes, but I'm assuming we don't want to wait for all of those to be resolved before testing this change. You'll just have to look at the logs (I can help with this) to see whether it's an inconsistency or some other automation/glue problem (All the current problems are the latter).

nvanbenschoten · 2017-11-16T23:09:31Z

This has been rebased off master and is now ready for review.

It got most of the way through Jepsen before an SSH failure. I'm running the two tests that didn't get to finish again.

tbg · 2017-11-17T15:34:16Z

but remove [DNM] from commits. Exciting!

Reviewed 9 of 10 files at r5, 3 of 3 files at r6.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.

build/jepsen-common.sh, line 54 at r6 (raw file):

tests=(
    #"bank"

Just leaving a comment to double-check that this is reverted before merge.

pkg/storage/tscache/cache_test.go, line 515 at r6 (raw file):

	defer leaktest.AfterTest(t)()

	// Run one subtest using a real clock to generate timestampts and

Timestampts
Do you need to update the commit here? I think you fixed this in the other PR.
Not reviewing the rest of the test, since I've seen it already. Please let me know if anything changed.

Comments from Reviewable

This change uses the `intervalSkl` structure to create a new `sklImpl` type that implements the `Cache` interface.

nvanbenschoten · 2017-11-17T19:43:58Z

TFTR! I think you mistook a new test as being the same as an existing one though.

Review status: 8 of 10 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.

pkg/storage/tscache/cache_test.go, line 515 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Timestampts
Do you need to update the commit here? I think you fixed this in the other PR.
Not reviewing the rest of the test, since I've seen it already. Please let me know if anything changed.

This comment was copied from TestIntervalSklConcurrentVsSequential and TestIntervalSklConcurrency, so that's why it had the same spelling error (fixed now!). This is a new test though. It's testing that treeImpl and sklImpl return the same results when accessed concurrently, which is a little different than the test that makes sure the intervalSkl returns the same results when accessed sequentially and when accessed concurrently.

build/jepsen-common.sh, line 54 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Just leaving a comment to double-check that this is reverted before merge.

Done.

Comments from Reviewable

tbg · 2017-11-21T23:19:44Z

Reviewed 2 of 9 files at r4, 4 of 4 files at r7.
Review status: all files reviewed at latest revision, all discussions resolved, all commit checks successful.

Comments from Reviewable

nvanbenschoten requested a review from a team October 25, 2017 07:26

This was referenced Oct 30, 2017

tscache: TimestampCache package refactors #19661

Merged

tscache: introduce intervalSkl, a skiplist-backed interval-ratcheting data structure #19672

Merged

nvanbenschoten mentioned this pull request Nov 9, 2017

tscache: enforce minimum retention window policy in intervalSkl #19929

Merged

nvanbenschoten mentioned this pull request Nov 10, 2017

tscache: remove lowWaterTxnIDMarker, GetMaxRead/GetMaxWrite returns no flag #19974

Merged

nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch from cc4d267 to 2686d91 Compare November 10, 2017 21:44

nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch from 2686d91 to e702f49 Compare November 16, 2017 01:10

nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch 2 times, most recently from 1985b71 to 587d44d Compare November 16, 2017 05:31

nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch from 8f38921 to 0860410 Compare November 16, 2017 17:35

tscache: unexport GlobalLowWater, rename to getLowWater

00cdbbf

nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch from 0860410 to cf06ace Compare November 16, 2017 23:06

nvanbenschoten requested a review from a team November 16, 2017 23:06

nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch 2 times, most recently from 9289a7a to cf53406 Compare November 17, 2017 04:38

tscache: create new sklImpl Cache implementation

45ad7d5

This change uses the `intervalSkl` structure to create a new `sklImpl` type that implements the `Cache` interface.

nvanbenschoten force-pushed the nvnabenschoten/skiplistTSCache branch from cf53406 to 45ad7d5 Compare November 17, 2017 19:43

nvanbenschoten changed the title ~~[DNM] tscache: experiment with skiplist-backed TimestampCache~~ tscache: introduce skiplist-backed TimestampCache Nov 17, 2017

nvanbenschoten merged commit c21246c into cockroachdb:master Nov 22, 2017

nvanbenschoten deleted the nvnabenschoten/skiplistTSCache branch November 22, 2017 01:55

nvanbenschoten mentioned this pull request Nov 23, 2017

tscache: switch to skiplist implementation by default, remove Request interface #20232

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tscache: introduce skiplist-backed TimestampCache #19508

tscache: introduce skiplist-backed TimestampCache #19508

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tscache: introduce skiplist-backed TimestampCache #19508

tscache: introduce skiplist-backed TimestampCache #19508

Uh oh!

Conversation

kv, 100% reads

kv, 80% reads

tpcc

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!