-
Notifications
You must be signed in to change notification settings - Fork 9.6k
POC: new native histogram extrapolation idea #16791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Compensate for under/over estimating the left hand side (start side) of buckets. If the bucket level interpolation is inside the duration to start, then extrapolate only to this extra duration on the left. Otherwise if the duration to overall zero (from the overall count) is inside the bucket's zero time, than the bucket's estimate might be too much, calculate the increase from zero through the start time to get the increase from the start to the first sample. Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor remarks and some real thoughts. Still working on this…
extrapolateToIntervalRight := extrapolateToInterval + durationToEnd | ||
factorRight := extrapolateToIntervalRight / sampledInterval | ||
|
||
factor := (extrapolateToInterval + durationToStart + durationToEnd) / sampledInterval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factor := (extrapolateToInterval + durationToStart + durationToEnd) / sampledInterval | |
factor := (durationToStart + extrapolateToIntervalRight) / sampledInterval |
Simpler and easier to understand IMHO.
resultHistogram.Count *= factor | ||
resultHistogram.Sum *= factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could as well be first or last, right? (I'm confused that it is between the zero bucket and the regular buckets.)
// zero. | ||
factorLeft := bucketDurationToStart / sampledInterval | ||
if isRate { | ||
factorRight /= ms.Range.Seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factorRight /= ms.Range.Seconds() | |
factorLeft /= ms.Range.Seconds() |
(not exercised in the test)
} else { | ||
resultHistogram.Mul(factor) | ||
|
||
extrapolateBucket := func(firstValue, resultValue float64) float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here is my fundamental concern: This does change individual buckets without changing the count. However, with the test case below (even after my proposed change), it all seems to cancel each other out so that in the end, the sum of the buckets is still equal to the count. I will try a bit to see if I can find an example where that doesn't work, or even better, to formally prove that this works out generall and not just by coincidence.
if isRate { | ||
factorRight /= ms.Range.Seconds() | ||
} | ||
return resultValue*factorLeft + resultValue*factorRight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return resultValue*factorLeft + resultValue*factorRight | |
return resultValue* (factorLeft + factorRight) |
I find this easier to parse (and less to calculate ;).
load 1m | ||
metric {{schema:0 count:15.0 sum:25.0 buckets:[5 10]}} {{schema:0 count:2490.0 sum:75.0 buckets:[15 2475]}}x55 | ||
|
||
eval instant at 55m increase(metric[90m]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the last sample in the range coincides with the boundary of the range. Let's change this to have something in between, e.g.
eval instant at 55m increase(metric[90m]) | |
eval instant at 54m30s increase(metric[90m]) |
(Spoiler: The values below change, but somehow the buckets still sum up to the count.)
// timeline) than the overall count, avoid underestimating. | ||
// Note if we don't know the duration to Zero, than that's NaN and | ||
// this is skipped. | ||
compensateLeft := durationToStart / durationToZero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This strikes me as weird. This is 1 whenever the count is corrected to avoid extrapolation below zero, and it is less than 1 by a certain amount if the count is not corrected at all…
Maybe my brain is too fried now, but I report what I have: If I modify the test to read:
(note the empty result line), I get the following:
So this is expected, at first glance. But note that the histogram the test got is inconsistent: Bucket sum is 11.365740740740742 + 2491.628086419753 = 2502.9938271604938, but count is 2501.736111111111. So that was what I was trying to demonstrate. To complete the test, I wrote the following:
But now I get
🤯 |
Now I get it: The error only occurs in range mode, and if the instant mode is already failing, it doesn't even try the range mode. Which leaves us with the question why it doesn't work in range mode. That shouldn't be a problem. I'll stop for today and resume thinking tomorrow. |
I have a hunch now that this essentially implements "Insert artificial zero sample at the zero point based on the overall count" in an indirect way… Will verify tomorrow. |
Compensate for under/over estimating the left hand side (start side) of buckets.
If the bucket level interpolation is inside the duration to start, then extrapolate only to this extra duration on the left. Otherwise if the duration to overall zero (from the overall count) is inside the bucket's zero time, than the bucket's estimate might be too much, calculate the increase from zero through the start time to get the increase from the start to the first sample.
Related to #15976 and #16192 .
For now it doesn't work with counter resets between the first sample and second sample and if rate needed to match schemas.