[WIP] Remote Write V2 Gogoproto Replacement with Vtprotobuf #16026

francoposa · 2025-02-12T19:20:21Z

Addresses #11908

Scope

This PR migrates Remote Write V2 protos to use vtprotobuf in place of gogoproto.
I have made an effort to keep the changeset as small as possible here just to prove out approximately equal performance to gogoproto.

Performance Benchmarks

These benchmark results look a bit different than the ones in another PR about using vtprotobuf, but I believe they are more accurate.

TL;DR Accurate benchmarks require much higher iterations than what Go will choose by default
Do not let Go choose its own benchtime iterations for these benchmarks.

Letting Go choose the benchmark iterations or setting benchtime to some low time will show wildly different results than forcing the benchmarks to run longer. Only at very low iterations do we see vtprotobuf showing large improvements over the main branch.
For CPU time, the differences between main branch and vtprotobuf performance are reduced drastically as we extend the benchmark duration time.
For memory allocations, due to the usage of various forms of object and buffer pooling in these codepaths, memory allocation results in benchmarks never stabilize - they continually go down as the reuse of pools is averaged out over more iterations, so we have to be sure to compare similar number of iterations across branches.

Method

All benchmarks were run for 20 seconds using the script collapsed below

#!/usr/bin/env bash

set -Eeuxo pipefail

if [ "$#" -ne 2 ]; then
  echo "Usage: $0 <count> <output_file>"
  exit 1
fi

COUNT=$1
OUTPUT_FILE=$2

go test \
  -v \
  -test.bench='^(BenchmarkSampleSend|BenchmarkStoreSeries|BenchmarkBuildWriteRequest|BenchmarkBuildV2WriteRequest|BenchmarkBuildTimeSeries|BenchmarkStreamReadEndpoint|BenchmarkRemoteWriteHandler)$' \
  -test.run='^$' \
  -test.benchmem \
  -test.benchtime=20s \
  -test.timeout=120m \
  -test.count="$COUNT" \
  ./... | tee -a "$OUTPUT_FILE"

Results

I have included results from:

the main branch at the current head, b74cebf6
this branch at its current head, 88a1da27
this branch at HEAD~1, 46c65285

I included benchmarks for two different commits, as the most recent commit added pooled object usage for exemplars, histograms, and their bucketspans - this appears to show slightly worse performance for some benchmark results.
I still need to try to diff the flamegraphs, but my assumption here is that with the addition of the latest commit, we pay the price of running through teardown & reset processes for those object pools, but because the benchmark does not utilize histograms and exemplars in the timeseries, we never get the chance to measure the advantage of the having the pooled objects in the first place.
benchout-main-b74cebf6-20s.txt
benchout-vtproto-pool-timeseries-samples-46c65285-20s.txt
benchout-vtproto-pool-timeseries-samples-exemplars-histograms-88a1da27-20s.txt

Benchstat comparison across all three is collapsed below

goos: linux
goarch: amd64
pkg: github.com/prometheus/prometheus/storage/remote
cpu: AMD Ryzen 9 PRO 6950H with Radeon Graphics     
                                             │ benchout-main-b74cebf6-20s.txt │ benchout-vtproto-pool-timeseries-samples-46c65285-20s.txt │ benchout-vtproto-pool-timeseries-samples-exemplars-histograms-88a1da27-20s.txt │
                                             │             sec/op             │              sec/op                vs base                │                         sec/op                          vs base                │
SampleSend/prometheus.WriteRequest-16                             4.683m ± 2%                         4.619m ± 1%   -1.38% (p=0.001 n=10)                                              4.598m ± 2%   -1.82% (p=0.009 n=10)
SampleSend/io.prometheus.write.v2.Request-16                      2.881m ± 4%                         2.975m ± 2%   +3.27% (p=0.011 n=10)                                              2.992m ± 1%   +3.83% (p=0.015 n=10)
StoreSeries/plain-16                                              219.8µ ± 8%                         214.9µ ± 0%   -2.21% (p=0.000 n=10)                                              216.7µ ± 1%   -1.39% (p=0.000 n=10)
StoreSeries/externalLabels-16                                     897.7µ ± 1%                         884.7µ ± 6%   -1.45% (p=0.015 n=10)                                              878.7µ ± 1%   -2.11% (p=0.000 n=10)
StoreSeries/relabel-16                                            730.4µ ± 7%                         758.3µ ± 5%        ~ (p=0.436 n=10)                                              718.7µ ± 0%   -1.59% (p=0.000 n=10)
StoreSeries/externalLabels+relabel-16                             1.203m ± 1%                         1.189m ± 0%   -1.15% (p=0.000 n=10)                                              1.207m ± 2%        ~ (p=0.912 n=10)
BuildWriteRequest/2_instances-16                                  19.71µ ± 3%                         19.14µ ± 5%        ~ (p=0.075 n=10)                                              19.88µ ± 0%        ~ (p=0.165 n=10)
BuildWriteRequest/10_instances-16                                 92.56µ ± 3%                         92.51µ ± 1%        ~ (p=0.529 n=10)                                              94.97µ ± 0%   +2.61% (p=0.023 n=10)
BuildWriteRequest/1k_instances-16                                 950.0µ ± 2%                         931.6µ ± 1%   -1.94% (p=0.000 n=10)                                              971.1µ ± 1%   +2.22% (p=0.002 n=10)
BuildV2WriteRequest/2_instances-16                                16.35µ ± 2%                         17.59µ ± 1%   +7.60% (p=0.000 n=10)                                              19.85µ ± 1%  +21.43% (p=0.000 n=10)
BuildV2WriteRequest/10_instances-16                               77.85µ ± 3%                         84.92µ ± 2%   +9.07% (p=0.000 n=10)                              
8000
                94.17µ ± 1%  +20.97% (p=0.000 n=10)
BuildV2WriteRequest/1k_instances-16                               746.0µ ± 1%                         846.1µ ± 1%  +13.42% (p=0.000 n=10)                                              925.6µ ± 1%  +24.07% (p=0.000 n=10)
BuildTimeSeries-16                                                1.967m ± 1%                         2.051m ± 3%   +4.27% (p=0.000 n=10)                                              1.964m ± 0%        ~ (p=0.190 n=10)
StreamReadEndpoint-16                                             27.44µ ± 1%                         27.90µ ± 2%   +1.69% (p=0.001 n=10)                                              27.81µ ± 3%   +1.35% (p=0.003 n=10)
RemoteWriteHandler-16                                             7.851µ ± 1%                         7.780µ ± 1%        ~ (p=0.051 n=10)                                              7.767µ ± 5%        ~ (p=0.404 n=10)
geomean                                                           259.5µ                              264.6µ        +1.98%                                                             270.6µ        +4.28%

                                             │ benchout-main-b74cebf6-20s.txt │ benchout-vtproto-pool-timeseries-samples-46c65285-20s.txt │ benchout-vtproto-pool-timeseries-samples-exemplars-histograms-88a1da27-20s.txt │
                                             │              B/op              │               B/op                vs base                 │                         B/op                          vs base                  │
SampleSend/prometheus.WriteRequest-16                          42.20Ki ± 8%                         42.53Ki ± 6%       ~ (p=0.684 n=10)                                             42.97Ki ± 6%        ~ (p=0.853 n=10)
SampleSend/io.prometheus.write.v2.Request-16                   11.38Ki ± 1%                         11.65Ki ± 5%  +2.40% (p=0.043 n=10)                                             12.31Ki ± 4%   +8.16% (p=0.000 n=10)
StoreSeries/plain-16                                           272.0Ki ± 0%                         272.0Ki ± 0%  -0.01% (p=0.000 n=10)                                             272.0Ki ± 0%   -0.01% (p=0.000 n=10)
StoreSeries/externalLabels-16                                  897.7Ki ± 0%                         897.7Ki ± 0%  -0.00% (p=0.001 n=10)                                             897.7Ki ± 0%   -0.00% (p=0.000 n=10)
StoreSeries/relabel-16                                         773.0Ki ± 0%                         773.0Ki ± 0%       ~ (p=0.645 n=10)                                             773.0Ki ± 0%   -0.01% (p=0.000 n=10)
StoreSeries/externalLabels+relabel-16                          866.9Ki ± 0%                         866.9Ki ± 0%  -0.01% (p=0.000 n=10)                                             866.9Ki ± 0%   -0.00% (p=0.033 n=10)
BuildWriteRequest/2_instances-16                                 80.00 ± 0%                           80.00 ± 0%       ~ (p=1.000 n=10) ¹                                             80.00 ± 0%        ~ (p=1.000 n=10) ¹
BuildWriteRequest/10_instances-16                                82.00 ± 0%                           82.00 ± 0%       ~ (p=1.000 n=10) ¹                                             82.00 ± 0%        ~ (p=1.000 n=10) ¹
BuildWriteRequest/1k_instances-16                                332.5 ± 1%                           329.0 ± 2%  -1.05% (p=0.005 n=10)                                               338.5 ± 0%   +1.80% (p=0.001 n=10)
BuildV2WriteRequest/2_instances-16                               0.000 ± 0%                           0.000 ± 0%       ~ (p=1.000 n=10) ¹                                             0.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/10_instances-16                              0.000 ± 0%                           0.000 ± 0%       ~ (p=1.000 n=10) ¹                                             0.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/1k_instances-16                              56.50 ± 3%                           60.50 ± 9%  +7.08% (p=0.000 n=10)                                               71.50 ± 1%  +26.55% (p=0.000 n=10)
BuildTimeSeries-16                                             2.528Mi ± 0%                         2.528Mi ± 0%  -0.00% (p=0.000 n=10)                                             2.528Mi ± 0%   -0.00% (p=0.000 n=10)
StreamReadEndpoint-16                                          23.68Ki ± 0%                         23.67Ki ± 0%       ~ (p=0.053 n=10)                                             23.67Ki ± 0%   -0.00% (p=0.001 n=10)
RemoteWriteHandler-16                                          2.985Ki ± 0%                         2.985Ki ± 0%       ~ (p=1.000 n=10)                                             2.985Ki ± 0%        ~ (p=0.536 n=10)
geomean                                                                     ²                                     +0.60%                ²                                                          +2.36%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                             │ benchout-main-b74cebf6-20s.txt │ benchout-vtproto-pool-timeseries-samples-46c65285-20s.txt │ benchout-vtproto-pool-timeseries-samples-exemplars-histograms-88a1da27-20s.txt │
                                             │           allocs/op            │            allocs/op              vs base                 │                      allocs/op                        vs base                  │
SampleSend/prometheus.WriteRequest-16                            81.50 ± 4%                           89.00 ± 4%  +9.20% (p=0.000 n=10)                                               90.00 ± 4%  +10.43% (p=0.000 n=10)
SampleSend/io.prometheus.write.v2.Request-16                     88.00 ± 1%                           95.50 ± 5%  +8.52% (p=0.000 n=10)                                              101.00 ± 4%  +14.77% (p=0.000 n=10)
StoreSeries/plain-16                                             606.0 ± 0%                           606.0 ± 0%       ~ (p=1.000 n=10) ¹                                             606.0 ± 0%        ~ (p=1.000 n=10) ¹
StoreSeries/externalLabels-16                                   1.606k ± 0%                          1.606k ± 0%       ~ (p=1.000 n=10) ¹                                            1.606k ± 0%        ~ (p=1.000 n=10) ¹
StoreSeries/relabel-16                                          4.607k ± 0%                          4.607k ± 0%       ~ (p=1.000 n=10)                                              4.607k ± 0%        ~ (p=1.000 n=10)
StoreSeries/externalLabels+relabel-16                           4.607k ± 0%                          4.607k ± 0%       ~ (p=1.000 n=10) ¹                                            4.607k ± 0%        ~ (p=0.474 n=10)
BuildWriteRequest/2_instances-16                                 1.000 ± 0%                           1.000 ± 0%       ~ (p=1.000 n=10) ¹                                             1.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildWriteRequest/10_instances-16                                1.000 ± 0%                           1.000 ± 0%       ~ (p=1.000 n=10) ¹                                             1.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildWriteRequest/1k_instances-16                                1.000 ± 0%                           1.000 ± 0%       ~ (p=1.000 n=10) ¹                                             1.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/2_instances-16                               0.000 ± 0%                           0.000 ± 0%       ~ (p=1.000 n=10) ¹                                             0.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/10_instances-16                              0.000 ± 0%                           0.000 ± 0%       ~ (p=1.000 n=10) ¹                                             0.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/1k_instances-16                              0.000 ± 0%                           0.000 ± 0%       ~ (p=1.000 n=10) ¹                                             0.000 ± 0%        ~ (p=1.000 n=10) ¹
BuildTimeSeries-16                                              39.75k ± 0%                          39.75k ± 0%  -0.00% (p=0.000 n=10)                                              39.75k ± 0%   -0.00% (p=0.000 n=10)
StreamReadEndpoint-16                                            235.0 ± 0%                           235.0 ± 0%       ~ (p=1.000 n=10) ¹                                             235.0 ± 0%        ~ (p=1.000 n=10) ¹
RemoteWriteHandler-16                                            34.00 ± 0%                           34.00 ± 0%       ~ (p=1.000 n=10) ¹                                             34.00 ± 0%        ~ (p=1.000 n=10) ¹
geomean                                                                     ²                                     +1.14%                ²                                                          +1.59%                ²
¹ all samples are equal
² summaries must be >0 to compute geomean

                                    │ benchout-main-b74cebf6-20s.txt │ benchout-vtproto-pool-timeseries-samples-46c65285-20s.txt │ benchout-vtproto-pool-timeseries-samples-exemplars-histograms-88a1da27-20s.txt │
                                    │       compressedSize/op        │        compressedSize/op          vs base                 │                   compressedSize/op                    vs base                 │
BuildWriteRequest/2_instances-16                         1.933k ± 0%                        1.933k ± 0%       ~ (p=1.000 n=10) ¹                                             1.933k ± 0%       ~ (p=1.000 n=10) ¹
BuildWriteRequest/10_instances-16                        8.156k ± 0%                        8.156k ± 0%       ~ (p=1.000 n=10) ¹                                             8.156k ± 0%       ~ (p=1.000 n=10) ¹
BuildWriteRequest/1k_instances-16                        78.22k ± 0%                        78.22k ± 0%       ~ (p=1.000 n=10) ¹                                             78.22k ± 0%       ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/2_instances-16                       1.517k ± 0%                        1.517k ± 0%       ~ (p=1.000 n=10) ¹                                             1.517k ± 0%       ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/10_instances-16                      5.338k ± 0%                        5.338k ± 0%       ~ (p=1.000 n=10) ¹                                             5.338k ± 0%       ~ (p=1.000 n=10) ¹
BuildV2WriteRequest/1k_instances-16                      48.80k ± 0%                        48.80k ± 0%       ~ (p=1.000 n=10) ¹                                             48.80k ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                  8.871k                             8.871k       +0.00%                                                              8.871k       +0.00%
¹ all samples are equal

...and uploaded as a file:
benchstat-main-vs-46c65285-vs-88a1da27-20s.txt

Interpretation

The obvious concerns here are CPU time for BuildV2WriteRequest and, to a lesser degree, allocations for SampleSend.
The counterpoint to those concerns would be that the despite the statistical significance, there may not be practical significance (e.g. does <1KiB difference in B/op matter for SampleSend?).

A big TODO could be to diff the CPU flame graphs for BuildV2WriteRequest and see if there are any obvious ways to improve this branch's performance.

Things That Did Not Help

(or made performance worse)

some brief naive attempts at more advanced buffer pooling, such as using google.golang.org/grpc/mem's tieredBufferPool
using UnmarshalVTUnsafe in place of UnmarshalVT.
applying pooling to the top-level writev2.Request object

…; adjust genproto script and generate the protos. no other changes - will not compile

… vtprotobuf optimizations yet

…cketspans

francoposa · 2025-02-12T19:26:11Z

Still TODO at this point:

Fix up TestDecodeWriteV2Request
Adjust protobuf build for buf.build
Adjust protobuf build & CI stuff to use newest protoc (29.x)
Investigate CPU Time diff on BenchmarkBuildV2WriteRequest
Some more extensive benchmarking?

francoposa · 2025-02-12T19:46:43Z

Update: PProf for CPU Time on `BenchmarkBuildV2WriteRequest`

The profiles on each branch are zipped up here:
CPU-profile-BenchmarkBuildV2WriteRequest.zip

Some of the difference seems to be in encoding LabelsRefs, as using the vtprotobuf code without any manual patching lost this optimization from prompb/io/prometheus/write/v2/custom.go:

if len(m.LabelsRefs) > 0 {
		// This is the trick: encode the varints in reverse order to make it easier
		// to do it in place. Then reverse the whole thing.

It would be a question for the maintainers whether continuing to maintain this patch is worth it.

Applying this patch:

diff --git a/prompb/io/prometheus/write/v2/types_vtproto.pb.go b/prompb/io/prometheus/write/v2/types_vtproto.pb.go
index ec1f5cf0b..38024ae9a 100644
--- a/prompb/io/prometheus/write/v2/types_vtproto.pb.go
+++ b/prompb/io/prometheus/write/v2/types_vtproto.pb.go
@@ -12,6 +12,7 @@ import (
 	protoimpl "google.golang.org/protobuf/runtime/protoimpl"
 	io "io"
 	math "math"
+	"slices"
 	sync "sync"
 	unsafe "unsafe"
 )
@@ -411,22 +412,21 @@ func (m *TimeSeries) MarshalToSizedBufferVT(dAtA []byte) (int, error) {
 		}
 	}
 	if len(m.LabelsRefs) > 0 {
-		var pksize2 int
-		for _, num := range m.LabelsRefs {
-			pksize2 += protohelpers.SizeOfVarint(uint64(num))
-		}
-		i -= pksize2
-		j1 := i
+		var j10 int
+		start := i
 		for _, num := range m.LabelsRefs {
 			for num >= 1<<7 {
-				dAtA[j1] = uint8(uint64(num)&0x7f | 0x80)
+				dAtA[i-1] = uint8(uint64(num)&0x7f | 0x80)
 				num >>= 7
-				j1++
+				i--
+				j10++
 			}
-			dAtA[j1] = uint8(num)
-			j1++
+			dAtA[i-1] = uint8(num)
+			i--
+			j10++
 		}
-		i = protohelpers.EncodeVarint(dAtA, i, uint64(pksize2))
+		slices.Reverse(dAtA[i:start])
+		i = protohelpers.EncodeVarint(dAtA, i, uint64(j10))
 		i--
 		dAtA[i] = 0xa
 	}

Results with LabelsRef Patch

Running BenchmarkBuildV2WriteRequest results in the slightly better benchstat results, but that change does not get us all the way back to baseline.
Benchstat results here use main as the baseline vs. this branch's current head @ 88a1da27 + the above patch:

BuildV2WriteRequest/2_instances-16     16.35µ ± 2%                18.65µ ± 1%  +14.04% (p=0.000 n=10)
BuildV2WriteRequest/10_instances-16    77.85µ ± 3%                89.64µ ± 1%  +15.14% (p=0.000 n=10)
BuildV2WriteRequest/1k_instances-16    746.0µ ± 1%                892.0µ ± 2%  +19.57% (p=0.000 n=10)

bwplotka · 2025-02-13T12:27:52Z

Running BenchmarkBuildV2WriteRequest results in the slightly better benchstat results, but that change does not get us all the way back to baseline.
Benchstat results here use main as the baseline vs. this branch's current head @ 88a1da2 + the above patch:

Thanks, I would say let's keep the patch IF it improves things (can we measure without this patch?)

Looking very quickly on one profile you gave us I noticed 3 things:

Why pooling is taking so much time. Everything in Remote Write using sync pools might be too much - it might be adding overhead with no value (we would need to have concurrent request being build right after release, is it even realistic? Remote Write requests are made every ~few seconds minimum). I would turn off all pooling and slowly add those in separate PRs. Pooling in Go is tricky (I personally rarely use it)
Why Size is taking so much time?
I would measure allocations. Large number of those and optimizing this impacts CPU time a lot (alloc time and NOT visible in macrobenchmarks) GC time)

francoposa · 2025-02-13T19:31:23Z

thanks @bwplotka! Will take another round at this when I am back from vacation

Re: Pooling - Pooling timeseries and samples performed better on these benchmarks than vtprotobuf without any pooling. I will evaluate those again individually, as well as see if prombench shows any different characteristics.

Re: Measuring Allocs - BuildWriteRequestV2 shows slower CPU time even with 0 allocs in most of the cases. Will chase down the time spent in Size more here, plus the effects of the LabelsRef patch optimization.

bboreham · 2025-04-22T11:33:31Z

Hello from the bug-scrub! @francoposa do you think you will come back to this?

francoposa · 2025-04-25T17:21:38Z

Hello from the bug-scrub! @francoposa do you think you will come back to this?

@bboreham I think based on the conversation in CNCF Slack and my own research on the Mimir side that there is little confidence that we can ever really get the performance we want without the non-nullables either from Gogoproto or from figuring something else out / creating a separate generator for Prometheus' use cases.

francoposa added 7 commits February 10, 2025 16:17

switch write v2 proto to vtprotobuf with pooling and unmarshal_unsafe…

9caccf9

…; adjust genproto script and generate the protos. no other changes - will not compile

revert some dependency downgrades, compile, pass all tests & lint; no…

03fa2e0

… vtprotobuf optimizations yet

fix panic in BenchmarkBuildV2WriteRequest

5aeeed3

apply object pooling for v2 timeseries, metadata, and samples

3ac203d

fix tests which require clone of fixture after enabling pooling

1b47027

fix up go.mod with main

46c6528

apply pooling fully metadata, apply for exemplars, histograms, and bu…

88a1da2

…cketspans

francoposa requested review from cstyan, bwplotka and tomwilkie as code owners February 12, 2025 19:20

francoposa changed the title ~~Remote Write V2 Gogoproto Replacement with Vtprotobuf~~ [WIP] Remote Write V2 Gogoproto Replacement with Vtprotobuf Feb 12, 2025

francoposa mentioned this pull request Feb 12, 2025

Add remote write protocol 2.0 support grafana/mimir#9072

Open

github-actions bot added the stale label Apr 15, 2025

github-actions bot removed the stale label Apr 27, 2025

github-actions bot added the stale label Jun 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Remote Write V2 Gogoproto Replacement with Vtprotobuf #16026

[WIP] Remote Write V2 Gogoproto Replacement with Vtprotobuf #16026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[WIP] Remote Write V2 Gogoproto Replacement with Vtprotobuf #16026

Are you sure you want to change the base?

[WIP] Remote Write V2 Gogoproto Replacement with Vtprotobuf #16026

Uh oh!

Conversation

Uh oh!

Scope

Performance Benchmarks

Method

Results

Interpretation

Things That Did Not Help

Uh oh!

Uh oh!

Still TODO at this point:

Uh oh!

Uh oh!

Update: PProf for CPU Time on BenchmarkBuildV2WriteRequest

Results with LabelsRef Patch

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Update: PProf for CPU Time on `BenchmarkBuildV2WriteRequest`