8000 POC: new native histogram extrapolation idea by krajorama · Pull Request #16791 · prometheus/prometheus · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

POC: new native histogram extrapolation idea #16791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

krajorama
Copy link
Member

Compensate for under/over estimating the left hand side (start side) of buckets.
If the bucket level interpolation is inside the duration to start, then extrapolate only to this extra duration on the left. Otherwise if the duration to overall zero (from the overall count) is inside the bucket's zero time, than the bucket's estimate might be too much, calculate the increase from zero through the start time to get the increase from the start to the first sample.

Related to #15976 and #16192 .

For now it doesn't work with counter resets between the first sample and second sample and if rate needed to match schemas.

Compensate for under/over estimating the left hand side (start side)
of buckets.
If the bucket level interpolation is inside the duration to
start, then extrapolate only to this extra duration on the left.
Otherwise if the duration to overall zero (from the overall count) is
inside the bucket's zero time, than the bucket's estimate might be too
much, calculate the increase from zero through the start time to get the
increase from the start to the first sample.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Copy link
Member
@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor remarks and some real thoughts. Still working on this…

extrapolateToIntervalRight := extrapolateToInterval + durationToEnd
factorRight := extrapolateToIntervalRight / sampledInterval

factor := (extrapolateToInterval + durationToStart + durationToEnd) / sampledInterval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
factor := (extrapolateToInterval + durationToStart + durationToEnd) / sampledInterval
factor := (durationToStart + extrapolateToIntervalRight) / sampledInterval

Simpler and easier to understand IMHO.

Comment on lines +235 to +236
resultHistogram.Count *= factor
resultHistogram.Sum *= factor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could as well be first or last, right? (I'm confused that it is between the zero bucket and the regular buckets.)

// zero.
factorLeft := bucketDurationToStart / sampledInterval
if isRate {
factorRight /= ms.Range.Seconds()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
factorRight /= ms.Range.Seconds()
factorLeft /= ms.Range.Seconds()

(not exercised in the test)

} else {
resultHistogram.Mul(factor)

extrapolateBucket := func(firstValue, resultValue float64) float64 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here is my fundamental concern: This does change individual buckets without changing the count. However, with the test case below (even after my proposed change), it all seems to cancel each other out so that in the end, the sum of the buckets is still equal to the count. I will try a bit to see if I can find an example where that doesn't work, or even better, to formally prove that this works out generall and not just by coincidence.

if isRate {
factorRight /= ms.Range.Seconds()
}
return resultValue*factorLeft + resultValue*factorRight
8000 Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return resultValue*factorLeft + resultValue*factorRight
return resultValue* (factorLeft + factorRight)

I find this easier to parse (and less to calculate ;).

load 1m
metric {{schema:0 count:15.0 sum:25.0 buckets:[5 10]}} {{schema:0 count:2490.0 sum:75.0 buckets:[15 2475]}}x55

eval instant at 55m increase(metric[90m])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the last sample in the range coincides with the boundary of the range. Let's change this to have something in between, e.g.

Suggested change
eval instant at 55m increase(metric[90m])
eval instant at 54m30s increase(metric[90m])

(Spoiler: The values below change, but somehow the buckets still sum up to the count.)

// timeline) than the overall count, avoid underestimating.
// Note if we don't know the duration to Zero, than that's NaN and
// this is skipped.
compensateLeft := durationToStart / durationToZero
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strikes me as weird. This is 1 whenever the count is corrected to avoid extrapolation below zero, and it is less than 1 by a certain amount if the count is not corrected at all…

@beorn7
Copy link
Member
beorn7 commented Jul 2, 2025

Maybe my brain is too fried now, but I report what I have:

If I modify the test to read:

load 1m
  metric {{schema:0 count:15.0 sum:25.0 buckets:[5 10]}} {{schema:0 count:2490.0 sum:75.0 buckets:[15 2475]}}x55

eval instant at 54m30s increase(metric[54m35s])

(note the empty result line), I get the following:

error in eval increase(metric[54m35s]) (line 1384): unexpected metric {} in result, has value {{count:2501.736111111111 sum:50.54012345679012 counter_reset_hint:gauge buckets:[11.365740740740742 2491.628086419753]}}

So this is expected, at first glance. But note that the histogram the test got is inconsistent: Bucket sum is 11.365740740740742 + 2491.628086419753 = 2502.9938271604938, but count is 2501.736111111111.

So that was what I was trying to demonstrate. To complete the test, I wrote the following:

load 1m
  metric {{schema:0 count:15.0 sum:25.0 buckets:[5 10]}} {{schema:0 count:2490.0 sum:75.0 buckets:[15 2475]}}x55

eval instant at 54m30s increase(metric[54m35s])
  {} {{count:2501.736111111111 sum:50.54012345679012 counter_reset_hint:gauge buckets:[11.365740740740742 2491.628086419753]}}

But now I get

error evaluating query "increase(metric[54m35s])" (line 1384) in range mode: extrapolatedRate: Cannot handle different number of positive buckets

🤯

@beorn7
Copy link
Member
beorn7 commented Jul 2, 2025

Now I get it: The error only occurs in range mode, and if the instant mode is already failing, it doesn't even try the range mode.

Which leaves us with the question why it doesn't work in range mode. That shouldn't be a problem.

I'll stop for today and resume thinking tomorrow.

@beorn7
Copy link
Member
beorn7 commented Jul 2, 2025

I have a hunch now that this essentially implements "Insert artificial zero sample at the zero point based on the overall count" in an indirect way… Will verify tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0