-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[WIP] Remote Write V2 Gogoproto Replacement with Vtprotobuf #16026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Remote Write V2 Gogoproto Replacement with Vtprotobuf #16026
Conversation
…; adjust genproto script and generate the protos. no other changes - will not compile
… vtprotobuf optimizations yet
Still TODO at this point:
|
Update: PProf for CPU Time on
|
Thanks, I would say let's keep the patch IF it improves things (can we measure without this patch?) Looking very quickly on one profile you gave us I noticed 3 things:
|
thanks @bwplotka! Will take another round at this when I am back from vacation Re: Pooling - Pooling timeseries and samples performed better on these benchmarks than vtprotobuf without any pooling. I will evaluate those again individually, as well as see if prombench shows any different characteristics. Re: Measuring Allocs - BuildWriteRequestV2 shows slower CPU time even with 0 allocs in most of the cases. Will chase down the time spent in Size more here, plus the effects of the LabelsRef patch optimization. |
Hello from the bug-scrub! @francoposa do you think you will come back to this? |
@bboreham I think based on the conversation in CNCF Slack and my own research on the Mimir side that there is little confidence that we can ever really get the performance we want without the non-nullables either from Gogoproto or from figuring something else out / creating a separate generator for Prometheus' use cases. |
Addresses #11908
Scope
This PR migrates Remote Write V2 protos to use
vtprotobuf
in place ofgogoproto
.I have made an effort to keep the changeset as small as possible here just to prove out approximately equal performance to gogoproto.
Performance Benchmarks
These benchmark results look a bit different than the ones in another PR about using vtprotobuf, but I believe they are more accurate.
TL;DR Accurate benchmarks require much higher iterations than what Go will choose by default
Do not let Go choose its own benchtime iterations for these benchmarks.
Letting Go choose the benchmark iterations or setting benchtime to some low time will show wildly different results than forcing the benchmarks to run longer. Only at very low iterations do we see vtprotobuf showing large improvements over the main branch.
For CPU time, the differences between main branch and vtprotobuf performance are reduced drastically as we extend the benchmark duration time.
For memory allocations, due to the usage of various forms of object and buffer pooling in these codepaths, memory allocation results in benchmarks never stabilize - they continually go down as the reuse of pools is averaged out over more iterations, so we have to be sure to compare similar number of iterations across branches.
Method
All benchmarks were run for 20 seconds using the script collapsed below
Results
I have included results from:
b74cebf6
88a1da27
HEAD~1
,46c65285
I included benchmarks for two different commits, as the most recent commit added pooled object usage for exemplars, histograms, and their bucketspans - this appears to show slightly worse performance for some benchmark results.
I still need to try to diff the flamegraphs, but my assumption here is that with the addition of the latest commit, we pay the price of running through teardown & reset processes for those object pools, but because the benchmark does not utilize histograms and exemplars in the timeseries, we never get the chance to measure the advantage of the having the pooled objects in the first place.
benchout-main-b74cebf6-20s.txt
benchout-vtproto-pool-timeseries-samples-46c65285-20s.txt
benchout-vtproto-pool-timeseries-samples-exemplars-histograms-88a1da27-20s.txt
Benchstat comparison across all three is collapsed below
...and uploaded as a file:
benchstat-main-vs-46c65285-vs-88a1da27-20s.txt
Interpretation
The obvious concerns here are CPU time for
BuildV2WriteRequest
and, to a lesser degree, allocations forSampleSend
.The counterpoint to those concerns would be that the despite the statistical significance, there may not be practical significance (e.g. does <1KiB difference in B/op matter for
SampleSend
?).A big TODO could be to diff the CPU flame graphs for
BuildV2WriteRequest
and see if there are any obvious ways to improve this branch's performance.Things That Did Not Help
(or made performance worse)
google.golang.org/grpc/mem
'stieredBufferPool
UnmarshalVTUnsafe
in place ofUnmarshalVT
.writev2.Request
object