8000 [WIP] Remote Write V2 Gogoproto Replacement with Vtprotobuf by francoposa · Pull Request #16026 · prometheus/prometheus · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[WIP] Remote Write V2 Gogoproto Replacement with Vtprotobuf #16026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

francoposa
Copy link
@francoposa francoposa commented Feb 12, 2025

Addresses #11908

Scope

This PR migrates Remote Write V2 protos to use vtprotobuf in place of gogoproto.
I have made an effort to keep the changeset as small as possible here just to prove out approximately equal performance to gogoproto.

Performance Benchmarks

These benchmark results look a bit different than the ones in another PR about using vtprotobuf, but I believe they are more accurate.

TL;DR Accurate benchmarks require much higher iterations than what Go will choose by default
Do not let Go choose its own benchtime iterations for these benchmarks.

Letting Go choose the benchmark iterations or setting benchtime to some low time will show wildly different results than forcing the benchmarks to run longer. Only at very low iterations do we see vtprotobuf showing large improvements over the main branch.
For CPU time, the differences between main branch and vtprotobuf performance are reduced drastically as we extend the benchmark duration time.
For memory allocations, due to the usage of various forms of object and buffer pooling in these codepaths, memory allocation results in benchmarks never stabilize - they continually go down as the reuse of pools is averaged out over more iterations, so we have to be sure to compare similar number of iterations across branches.

Method

All benchmarks were run for 20 seconds using the script collapsed below
#!/usr/bin/env bash

set -Eeuxo pipefail

if [ "$#" -ne 2 ]; then
  echo "Usage: $0 <count> <output_file>"
  exit 1
fi

COUNT=$1
OUTPUT_FILE=$2

go test \
  -v \
  -test.bench='^(BenchmarkSampleSend|BenchmarkStoreSeries|BenchmarkBuildWriteRequest|BenchmarkBuildV2WriteRequest|BenchmarkBuildTimeSeries|BenchmarkStreamReadEndpoint|BenchmarkRemoteWriteHandler)$' \
  -test.run='^$' \
  -test.benchmem \
  -test.benchtime=20s \
  -test.timeout=120m \
  -test.count="$COUNT" \
  ./... | tee -a "$OUTPUT_FILE"

Results

I have included results from:

  1. the main branch at the current head, b74cebf6
  2. this branch at its current head, 88a1da27
  3. this branch at HEAD~1, 46c65285

I included benchmarks for two different commits, as the most recent commit added pooled object usage for exemplars, histograms, and their bucketspans - this appears to show slightly worse performance for some benchmark results.
I still need to try to diff the flamegraphs, but my assumption here is that with the addition of the latest commit, we pay the price of running through teardown & reset processes for those object pools, but because the benchmark does not utilize histograms and exemplars in the timeseries, we never get the chance to measure the advantage of the having the pooled objects in the first place.
benchout-main-b74cebf6-20s.txt
benchout-vtproto-pool-timeseries-samples-46c65285-20s.txt
benchout-vtproto-pool-timeseries-samples-exemplars-histograms-88a1da27-20s.txt

Benchstat comparison across all three is collapsed below

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/storage/remote
cpu: AMD Ryzen 9 PRO 6950H with Radeon Graphics     
                                             │ benchout-main-b74cebf6-20s.txt │ benchout-vtproto-pool-timeseries-samples-46c65285-20s.txt │ benchout-vtproto-pool-timeseries-samples-exemplars-histograms-88a1da27-20s.txt │
                                             │             sec/op             │              sec/op                vs base                │                         sec/op                          vs base                │
SampleSend/prometheus.WriteRequest-16                             4.683m ± 2%                         4.619m ± 1%   -1.38% (p=0.001 n=10)                                              4.598m ± 2%   -1.82% (p=0.009 n=10)
SampleSend/io.prometheus.write.v2.Request-16                      2.881m ± 4%                         2.975m ± 2%   +3.27% (p=0.011 n=10)                                              2.992m ± 1%   +3.83% (p=0.015 n=10)
StoreSeries/plain-16                                              219.8µ ± 8%                         214.9µ ± 0%   -2.21% (p=0.000 n=10)                                              216.7µ ± 1%   -1.39% (p=0.000 n=10)
StoreSeries/externalLabels-16                                     897.7µ ± 1%                         884.7µ ± 6%   -1.45% (p=0.015 n=10)                                              878.7µ ± 1%   -2.11% (p=0.000 n=10)
StoreSeries/relabel-16                                            730.4µ ± 7%                         758.3µ ± 5%        ~ (p=0.436 n=10)                                              718.7µ ± 0%   -1.59% (p=0.000 n=10)
StoreSeries/externalLabels+relabel-16                             1.203m ± 1%                         1.189m ± 0%   -1.15% (p=0.000 n=10)                                              1.207m ± 2%        ~ (p=0.912 n=10)
BuildWriteRequest/2_instances-16                                  19.71µ ± 3%                         19.14µ ± 5%        ~ (p=0.075 n=10)                                              19.88µ ± 0%        ~ (p=0.165 n=10)
BuildWriteRequest/10_instances-16                                 92.56µ ± 3%                         92.51µ ± 1%        ~ (p=0.529 n=10)                                              94.97µ ± 0%   +2.61% (p=0.023 n=10)
BuildWriteRequest/1k_instances-16                                 950.0µ ± 2%                         931.6µ ± 1%   -1.94% (p=0.000 n=10)                                              971.1µ ± 1%   +2.22% (p=0.002 n=10)
BuildV2WriteRequest/2_instances-16                                16.35µ ± 2%                         17.59µ ± 1%   +7.60% (p=0.000 n=10)                                              19.85µ ± 1%  +21.43% (p=0.000 n=10)
BuildV2WriteRequest/10_instances-16                               77.85µ ± 3%                         84.92µ ± 2%   +9.07% (p=0.000 n=10)                              
8000
                94.17µ ± 1%  +20.97% (p=0.000 n=10)
BuildV2WriteRequest/1k_instances-16                               746.0µ ± 1%                         846.1µ ± 1%  +13.42% (p=0.000 n=10)                                              925.6µ ± 1%  +24.07% (p=0.000 n=10)
BuildTimeSeries-16                                                1.967m ± 1%                         2.051m ± 3%   +4.27% (p=0.000 n=10)                                              1.964m ± 0%        ~ (p=0.190 n=10)
StreamReadEndpoint-16                                             27.44µ ± 1%                         27.90µ ± 2%   +1.69% (p=0.001 n=10)                                              27.81µ ± 3%   +1.35% (p=0.003 n=10)
RemoteWriteHandler-16                                             7.851µ ± 1%                         7.780µ ± 1%        ~ (p=0.051 n=10)                                              7.767µ ± 5%        ~ (p=0.404 n=10)
geomean                                                           259.5µ                              264.6µ        +1.98%                                                             270.6µ        +4.28%

                                             │ benchout-main-b74cebf6-20s.txt │ benchout-vtproto-pool-timeseries-samples-46c65285-20s.txt │ benchout-vtproto-pool-timeseries-samples-exemplars-histograms-88a1da27-20s.txt │
                                             │              B/op              │               B/op                vs base                 │                         B/op                          vs base                  │
SampleSend/prometheus.WriteRequest-16                          42.20Ki ± 8%                         42.53Ki ± 6%       ~ (p=0.684 n=10)                                             42.97Ki ± 6%        ~ (p=0.853 n=10)
SampleSend/io.prometheus.write.v2.Request-16                   11.38Ki ± 1%                         11.65Ki ± 5%  +2.40% (p=0.043 n=10)                                             12.31Ki ± 4%   +8.16% (p=0.000 n=10)
StoreSeries/plain-16                                           272.0Ki ± 0%                         272.0Ki ± 0%  -0.01% (p=0.000 n=10)                                             272.0Ki ± 0%   -0.01% (p=0.000 n=10)
StoreSeries/externalLabels-16                                  897.7Ki ± 0%                         897.7Ki ± 0%  -0.00% (p=0.001 n=10)                                             897.7Ki ± 0%   -0.00% (p=0.000 n=10)
StoreSeries/relabel-16                                         773.0Ki ± 0%                         773.0Ki ± 0%       ~ (p=0.645 n=10)                                             773.0Ki ± 0%   -0.01% (p=0.000 n=10)
StoreSeries/externalLabels+relabel-16                          866.9Ki ± 0%                         866.9Ki ± 0%  -0.01% (p=0.000 n=10)                                             866.9Ki ± 0%   -0.00% (p=0.033 n=10)
BuildWriteRequest/2_instances-16                                 80.00 ± 0%                           80.00 ± 0%       ~ (p=1.000 n=10) ¹                                             80.00 ± 0%        ~ (p=1.000 n=10) ¹
BuildWriteRequest/10_instances-16                                82.00 ± 0%                           82.00 ± 0%       ~ (p=1.000 n=10) ¹                                             82.00 ± 0%        ~ (p=1.000 n=10) ¹
BuildWriteRequest/1k_instances-16                                332.5 ± 1%                           329.0 ± 2%  -1.05% (p=0.005 n=10)                                               338.5 ± 0%   +1.80% (p=0.001 n=10)
BuildV2WriteRequest/2_instances-16                               0.000 ± 0%                           0.000 ± 0%       ~ (p=1.000 n=10) ¹                                             0.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/10_instances-16                              0.000 ± 0%                           0.000 ± 0%       ~ (p=1.000 n=10) ¹                                             0.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/1k_instances-16                              56.50 ± 3%                           60.50 ± 9%  +7.08% (p=0.000 n=10)                                               71.50 ± 1%  +26.55% (p=0.000 n=10)
BuildTimeSeries-16                                             2.528Mi ± 0%                         2.528Mi ± 0%  -0.00% (p=0.000 n=10)                                             2.528Mi ± 0%   -0.00% (p=0.000 n=10)
StreamReadEndpoint-16                                          23.68Ki ± 0%                         23.67Ki ± 0%       ~ (p=0.053 n=10)                                             23.67Ki ± 0%   -0.00% (p=0.001 n=10)
RemoteWriteHandler-16                                          2.985Ki ± 0%                         2.985Ki ± 0%       ~ (p=1.000 n=10)                                             2.985Ki ± 0%        ~ (p=0.536 n=10)
geomean                                                                     ²                                     +0.60%                ²                                                          +2.36%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                             │ benchout-main-b74cebf6-20s.txt │ benchout-vtproto-pool-timeseries-samples-46c65285-20s.txt │ benchout-vtproto-pool-timeseries-samples-exemplars-histograms-88a1da27-20s.txt │
                                             │           allocs/op            │            allocs/op              vs base                 │                      allocs/op                        vs base                  │
SampleSend/prometheus.WriteRequest-16                            81.50 ± 4%                           89.00 ± 4%  +9.20% (p=0.000 n=10)                                               90.00 ± 4%  +10.43% (p=0.000 n=10)
SampleSend/io.prometheus.write.v2.Request-16                     88.00 ± 1%                           95.50 ± 5%  +8.52% (p=0.000 n=10)                                              101.00 ± 4%  +14.77% (p=0.000 n=10)
StoreSeries/plain-16                                             606.0 ± 0%                           606.0 ± 0%       ~ (p=1.000 n=10) ¹                                             606.0 ± 0%        ~ (p=1.000 n=10) ¹
StoreSeries/externalLabels-16                                   1.606k ± 0%                          1.606k ± 0%       ~ (p=1.000 n=10) ¹                                            1.606k ± 0%        ~ (p=1.000 n=10) ¹
StoreSeries/relabel-16                                          4.607k ± 0%                          4.607k ± 0%       ~ (p=1.000 n=10)                                              4.607k ± 0%        ~ (p=1.000 n=10)
StoreSeries/externalLabels+relabel-16                           4.607k ± 0%                          4.607k ± 0%       ~ (p=1.000 n=10) ¹                                            4.607k ± 0%        ~ (p=0.474 n=10)
BuildWriteRequest/2_instances-16                                 1.000 ± 0%                           1.000 ± 0%       ~ (p=1.000 n=10) ¹                                             1.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildWriteRequest/10_instances-16                                1.000 ± 0%                           1.000 ± 0%       ~ (p=1.000 n=10) ¹                                             1.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildWriteRequest/1k_instances-16                                1.000 ± 0%                           1.000 ± 0%       ~ (p=1.000 n=10) ¹                                             1.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/2_instances-16                               0.000 ± 0%                           0.000 ± 0%       ~ (p=1.000 n=10) ¹                                             0.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/10_instances-16                              0.000 ± 0%                           0.000 ± 0%       ~ (p=1.000 n=10) ¹                                             0.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/1k_instances-16                              0.000 ± 0%                           0.000 ± 0%       ~ (p=1.000 n=10) ¹                                             0.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildTimeSeries-16                                              39.75k ± 0%                          39.75k ± 0%  -0.00% (p=0.000 n=10)                                              39.75k ± 0%   -0.00% (p=0.000 n=10)
StreamReadEndpoint-16                                            235.0 ± 0%                           235.0 ± 0%       ~ (p=1.000 n=10) ¹                                             235.0 ± 0%        ~ (p=1.000 n=10) ¹
RemoteWriteHandler-16                                            34.00 ± 0%                           34.00 ± 0%       ~ (p=1.000 n=10) ¹                                             34.00 ± 0%        ~ (p=1.000 n=10) ¹
geomean                                                                     ²                                     +1.14%                ²                                                          +1.59%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                    │ benchout-main-b74cebf6-20s.txt │ benchout-vtproto-pool-timeseries-samples-46c65285-20s.txt │ benchout-vtproto-pool-timeseries-samples-exemplars-histograms-88a1da27-20s.txt │
                                    │       compressedSize/op        │        compressedSize/op          vs base                 │                   compressedSize/op                    vs base                 │
BuildWriteRequest/2_instances-16                         1.933k ± 0%                        1.933k ± 0%       ~ (p=1.000 n=10) ¹                                             1.933k ± 0%       ~ (p=1.000 n=10) ¹
BuildWriteRequest/10_instances-16                        8.156k ± 0%                        8.156k ± 0%       ~ (p=1.000 n=10) ¹                                             8.156k ± 0%       ~ (p=1.000 n=10) ¹
BuildWriteRequest/1k_instances-16                        78.22k ± 0%                        78.22k ± 0%       ~ (p=1.000 n=10) ¹                                             78.22k ± 0%       ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/2_instances-16                       1.517k ± 0%                        1.517k ± 0%       ~ (p=1.000 n=10) ¹                                             1.517k ± 0%       ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/10_instances-16                      5.338k ± 0%                        5.338k ± 0%       ~ (p=1.000 n=10) ¹                                             5.338k ± 0%       ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/1k_instances-16                      48.80k ± 0%                        48.80k ± 0%       ~ (p=1.000 n=10) ¹                                             48.80k ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                  8.871k                             8.871k       +0.00%                                                              8.871k       +0.00%
¹ all samples are equal

...and uploaded as a file:
benchstat-main-vs-46c65285-vs-88a1da27-20s.txt

Interpretation

The obvious concerns here are CPU time for BuildV2WriteRequest and, to a lesser degree, allocations for SampleSend.
The counterpoint to those concerns would be that the despite the statistical significance, there may not be practical significance (e.g. does <1KiB difference in B/op matter for SampleSend?).

A big TODO could be to diff the CPU flame graphs for BuildV2WriteRequest and see if there are any obvious ways to improve this branch's performance.

Things That Did Not Help

(or made performance worse)

  • some brief naive attempts at more advanced buffer pooling, such as using google.golang.org/grpc/mem's tieredBufferPool
  • using UnmarshalVTUnsafe in place of UnmarshalVT.
  • applying pooling to the top-level writev2.Request object

@francoposa francoposa changed the title Remote Write V2 Gogoproto Replacement with Vtprotobuf [WIP] Remote Write V2 Gogoproto Replacement with Vtprotobuf Feb 12, 2025
@francoposa
Copy link
Author
francoposa commented Feb 12, 2025

Still TODO at this point:

  • Fix up TestDecodeWriteV2Request
  • Adjust protobuf build for buf.build
  • Adjust protobuf build & CI stuff to use newest protoc (29.x)
  • Investigate CPU Time diff on BenchmarkBuildV2WriteRequest
  • Some more extensive benchmarking?

@francoposa
Copy link
Author
francoposa commented Feb 12, 2025

Update: PProf for CPU Time on BenchmarkBuildV2WriteRequest

The profiles on each branch are zipped up here:
CPU-profile-BenchmarkBuildV2WriteRequest.zip

Some of the difference seems to be in encoding LabelsRefs, as using the vtprotobuf code without any manual patching lost this optimization from prompb/io/prometheus/write/v2/custom.go:

if len(m.LabelsRefs) > 0 {
		// This is the trick: encode the varints in reverse order to make it easier
		// to do it in place. Then reverse the whole thing.

It would be a question for the maintainers whether continuing to maintain this patch is worth it.


Applying this patch:

diff --git a/prompb/io/prometheus/write/v2/types_vtproto.pb.go b/prompb/io/prometheus/write/v2/types_vtproto.pb.go
index ec1f5cf0b..38024ae9a 100644
--- a/prompb/io/prometheus/write/v2/types_vtproto.pb.go
+++ b/prompb/io/prometheus/write/v2/types_vtproto.pb.go
@@ -12,6 +12,7 @@ import (
 	protoimpl "google.golang.org/protobuf/runtime/protoimpl"
 	io "io"
 	math "math"
+	"slices"
 	sync "sync"
 	unsafe "unsafe"
 )
@@ -411,22 +412,21 @@ func (m *TimeSeries) MarshalToSizedBufferVT(dAtA []byte) (int, error) {
 		}
 	}
 	if len(m.LabelsRefs) > 0 {
-		var pksize2 int
-		for _, num := range m.LabelsRefs {
-			pksize2 += protohelpers.SizeOfVarint(uint64(num))
-		}
-		i -= pksize2
-		j1 := i
+		var j10 int
+		start := i
 		for _, num := range m.LabelsRefs {
 			for num >= 1<<7 {
-				dAtA[j1] = uint8(uint64(num)&0x7f | 0x80)
+				dAtA[i-1] = uint8(uint64(num)&0x7f | 0x80)
 				num >>= 7
-				j1++
+				i--
+				j10++
 			}
-			dAtA[j1] = uint8(num)
-			j1++
+			dAtA[i-1] = uint8(num)
+			i--
+			j10++
 		}
-		i = protohelpers.EncodeVarint(dAtA, i, uint64(pksize2))
+		slices.Reverse(dAtA[i:start])
+		i = protohelpers.EncodeVarint(dAtA, i, uint64(j10))
 		i--
 		dAtA[i] = 0xa
 	}

Results with LabelsRef Patch

Running BenchmarkBuildV2WriteRequest results in the slightly better benchstat results, but that change does not get us all the way back to baseline.
Benchstat results here use main as the baseline vs. this branch's current head @ 88a1da27 + the above patch:

BuildV2WriteRequest/2_instances-16     16.35µ ± 2%                18.65µ ± 1%  +14.04% (p=0.000 n=10)
BuildV2WriteRequest/10_instances-16    77.85µ ± 3%                89.64µ ± 1%  +15.14% (p=0.000 n=10)
BuildV2WriteRequest/1k_instances-16    746.0µ ± 1%                892.0µ ± 2%  +19.57% (p=0.000 n=10)

@bwplotka
Copy link
Member

Running BenchmarkBuildV2WriteRequest results in the slightly better benchstat results, but that change does not get us all the way back to baseline.
Benchstat results here use main as the baseline vs. this branch's current head @ 88a1da2 + the above patch:

Thanks, I would say let's keep the patch IF it improves things (can we measure without this patch?)

Looking very quickly on one profile you gave us I noticed 3 things:

  • Why pooling is taking so much time. Everything in Remote Write using sync pools might be too much - it might be adding overhead with no value (we would need to have concurrent request being build right after release, is it even realistic? Remote Write requests are made every ~few seconds minimum). I would turn off all pooling and slowly add those in separate PRs. Pooling in Go is tricky (I personally rarely use it)
  • Why Size is taking so much time?
  • I would measure allocations. Large number of those and optimizing this impacts CPU time a lot (alloc time and NOT visible in macrobenchmarks) GC time)

image

@francoposa
Copy link
Author

thanks @bwplotka! Will take another round at this when I am back from vacation

Re: Pooling - Pooling timeseries and samples performed better on these benchmarks than vtprotobuf without any pooling. I will evaluate those again individually, as well as see if prombench shows any different characteristics.

Re: Measuring Allocs - BuildWriteRequestV2 shows slower CPU time even with 0 allocs in most of the cases. Will chase down the time spent in Size more here, plus the effects of the LabelsRef patch optimization.

@github-actions github-actions bot added the stale label Apr 15, 2025
@bboreham
Copy link
Member

Hello from the bug-scrub! @francoposa do you think you will come back to this?

@francoposa
Copy link
Author

Hello from the bug-scrub! @francoposa do you think you will come back to this?

@bboreham I think based on the conversation in CNCF Slack and my own research on the Mimir side that there is little confidence that we can ever really get the performance we want without the non-nullables either from Gogoproto or from figuring something else out / creating a separate generator for Prometheus' use cases.

@github-actions github-actions bot removed the stale label Apr 27, 2025
@github-actions github-actions bot added the stale label Jun 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0