POC: new native histogram extrapolation idea #16791

krajorama · 2025-06-27T13:21:53Z

Compensate for under/over estimating the left hand side (start side) of buckets.
If the bucket level interpolation is inside the duration to start, then extrapolate only to this extra duration on the left. Otherwise if the duration to overall zero (from the overall count) is inside the bucket's zero time, than the bucket's estimate might be too much, calculate the increase from zero through the start time to get the increase from the start to the first sample.

Related to #15976 and #16192 .

For now it doesn't work with counter resets between the first sample and second sample and if rate needed to match schemas.

Compensate for under/over estimating the left hand side (start side) of buckets. If the bucket level interpolation is inside the duration to start, then extrapolate only to this extra duration on the left. Otherwise if the duration to overall zero (from the overall count) is inside the bucket's zero time, than the bucket's estimate might be too much, calculate the increase from zero through the start time to get the increase from the start to the first sample. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>

beorn7

A few minor remarks and some real thoughts. Still working on this…

beorn7 · 2025-07-02T14:35:53Z

promql/functions.go

+	extrapolateToIntervalRight := extrapolateToInterval + durationToEnd
+	factorRight := extrapolateToIntervalRight / sampledInterval
+
+	factor := (extrapolateToInterval + durationToStart + durationToEnd) / sampledInterval


Suggested change

factor := (extrapolateToInterval + durationToStart + durationToEnd) / sampledInterval

factor := (durationToStart + extrapolateToIntervalRight) / sampledInterval

Simpler and easier to understand IMHO.

beorn7 · 2025-07-02T14:40:26Z

promql/functions.go

+	resultHistogram.Count *= factor
+	resultHistogram.Sum *= factor


This could as well be first or last, right? (I'm confused that it is between the zero bucket and the regular buckets.)

beorn7 · 2025-07-02T14:42:16Z

promql/functions.go

+			// zero.
+			factorLeft := bucketDurationToStart / sampledInterval
+			if isRate {
+				factorRight /= ms.Range.Seconds()


Suggested change

factorRight /= ms.Range.Seconds()

factorLeft /= ms.Range.Seconds()

(not exercised in the test)

beorn7 · 2025-07-02T14:48:08Z

promql/functions.go

-	} else {
-		resultHistogram.Mul(factor)
+
+	extrapolateBucket := func(firstValue, resultValue float64) float64 {


So here is my fundamental concern: This does change individual buckets without changing the count. However, with the test case below (even after my proposed change), it all seems to cancel each other out so that in the end, the sum of the buckets is still equal to the count. I will try a bit to see if I can find an example where that doesn't work, or even better, to formally prove that this works out generall and not just by coincidence.

beorn7 · 2025-07-02T15:15:26Z

promql/functions.go

+			if isRate {
+				factorRight /= ms.Range.Seconds()
+			}
+			return resultValue*factorLeft + resultValue*factorRight


Suggested change

return resultValue*factorLeft + resultValue*factorRight

return resultValue* (factorLeft + factorRight)

I find this easier to parse (and less to calculate ;).

beorn7 · 2025-07-02T17:28:43Z

promql/promqltest/testdata/native_histograms.test

+load 1m
+  metric {{schema:0 count:15.0 sum:25.0 buckets:[5 10]}} {{schema:0 count:2490.0 sum:75.0 buckets:[15 2475]}}x55
+
+eval instant at 55m increase(metric[90m])


Here the last sample in the range coincides with the boundary of the range. Let's change this to have something in between, e.g.

Suggested change

eval instant at 55m increase(metric[90m])

eval instant at 54m30s increase(metric[90m])

(Spoiler: The values below change, but somehow the buckets still sum up to the count.)

beorn7 · 2025-07-02T18:35:48Z

promql/functions.go

+			// timeline) than the overall count, avoid underestimating.
+			// Note if we don't know the duration to Zero, than that's NaN and
+			// this is skipped.
+			compensateLeft := durationToStart / durationToZero


This strikes me as weird. This is 1 whenever the count is corrected to avoid extrapolation below zero, and it is less than 1 by a certain amount if the count is not corrected at all…

beorn7 · 2025-07-02T19:37:10Z

Maybe my brain is too fried now, but I report what I have:

If I modify the test to read:

load 1m
  metric {{schema:0 count:15.0 sum:25.0 buckets:[5 10]}} {{schema:0 count:2490.0 sum:75.0 buckets:[15 2475]}}x55

eval instant at 54m30s increase(metric[54m35s])

(note the empty result line), I get the following:

error in eval increase(metric[54m35s]) (line 1384): unexpected metric {} in result, has value {{count:2501.736111111111 sum:50.54012345679012 counter_reset_hint:gauge buckets:[11.365740740740742 2491.628086419753]}}

So this is expected, at first glance. But note that the histogram the test got is inconsistent: Bucket sum is 11.365740740740742 + 2491.628086419753 = 2502.9938271604938, but count is 2501.736111111111.

So that was what I was trying to demonstrate. To complete the test, I wrote the following:

load 1m
  metric {{schema:0 count:15.0 sum:25.0 buckets:[5 10]}} {{schema:0 count:2490.0 sum:75.0 buckets:[15 2475]}}x55

eval instant at 54m30s increase(metric[54m35s])
  {} {{count:2501.736111111111 sum:50.54012345679012 counter_reset_hint:gauge buckets:[11.365740740740742 2491.628086419753]}}

But now I get

error evaluating query "increase(metric[54m35s])" (line 1384) in range mode: extrapolatedRate: Cannot handle different number of positive buckets

🤯

beorn7 · 2025-07-02T19:39:17Z

Now I get it: The error only occurs in range mode, and if the instant mode is already failing, it doesn't even try the range mode.

Which leaves us with the question why it doesn't work in range mode. That shouldn't be a problem.

I'll stop for today and resume thinking tomorrow.

beorn7 · 2025-07-02T20:38:37Z

I have a hunch now that this essentially implements "Insert artificial zero sample at the zero point based on the overall count" in an indirect way… Will verify tomorrow.

krajorama mentioned this pull request Jun 27, 2025

Promql: Prevent extrapolation of NH below zero #16192

Open

krajorama requested a review from beorn7 June 27, 2025 13:42

beorn7 reviewed Jul 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

POC: new native histogram extrapolation idea #16791

POC: new native histogram extrapolation idea #16791

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	factor := (extrapolateToInterval + durationToStart + durationToEnd) / sampledInterval
	factor := (durationToStart + extrapolateToIntervalRight) / sampledInterval

		resultHistogram.Count *= factor
		resultHistogram.Sum *= factor

	factorRight /= ms.Range.Seconds()
	factorLeft /= ms.Range.Seconds()

	return resultValuefactorLeft + resultValuefactorRight
	return resultValue* (factorLeft + factorRight)

	eval instant at 55m increase(metric[90m])
	eval instant at 54m30s increase(metric[90m])

POC: new native histogram extrapolation idea #16791

Are you sure you want to change the base?

POC: new native histogram extrapolation idea #16791

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!