-
Notifications
You must be signed in to change notification settings - Fork 9.6k
promql: fix topk error on NaN argument for non-existing series #16725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
promql: fix topk error on NaN argument for non-existing series #16725
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this, let's see what others think about it
promql/engine.go
Outdated
@@ -3277,6 +3277,17 @@ func (ev *evaluator) aggregationK(e *parser.AggregateExpr, fParam float64, input | |||
} | |||
} | |||
|
|||
switch op { | |||
case parser.TOPK, parser.BOTTOMK, parser.LIMITK: | |||
if !convertibleToInt64(fParam) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in https://cloud-native.slack.com/archives/C01AUBA4PFE/p1749800759431499?thread_ts=1749797556.935899&cid=C01AUBA4PFE, I think we want to align their behaviors' with quantile's, as quantile still does some pre-checks here
Line 1452 in 4aee718
if params.HasAnyNaN() { |
Still, there is a difference; quantile uses annotations for that, but here we error, which still isn't consistent IMHO.
Maybe switching to annotations will make the operators more resilient to missing data/gaps in fParam as was discussed here #16404 (comment)
WDYT @beorn7 @NeerajGartia21 about this
(this is not a blocker of the PR, we can discuss this somewhere else, I'm ok to just restore the old behavior in this PR (if others think we should of course))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That pre-check would only be effective if it was a constant param right?
Edit: ah I see, fParams already captures that the param is a timeseries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current behavior of both topk
and quantile
for handling NaN
is appropriate given the nature of their arguments. Since NaN
is a floating-point value, it's technically a valid input for quantile
, even if it's outside the expected [0, 1] range, so returning NaN
with a warning (as quantile
does) makes sense.
On the other hand, topk
expects an integer, and NaN
isn't representable as one. So in this case, throwing an error feels correct, we can't meaningfully proceed with the operation.
@beorn7 might have other views on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current behavior makes sense.
I mean, all of these cases are nonsensical edge cases, so the precise behavior doesn't really matter. We just have to show a defined behavior. Erroring out for values that aren't representable as an integer, even after rounding is defined (even though the "overflow" wording isn't quite right for -Inf and NaN). If we wanted to return a result plus an annotations, we needed to state more things, like overflow returns all elements, and underflow and NaN returns no elements or something. Sound a bit too involved for a problem that is irrelevant in practice.
50447ec
to
80d4329
Compare
promql/engine.go
Outdated
@@ -1433,13 +1433,23 @@ func (ev *evaluator) rangeEvalAgg(ctx context.Context, aggExpr *parser.Aggregate | |||
if params.Max() < 1 { | |||
return nil, annos | |||
} | |||
if !convertibleToInt64(params.Max()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if params.Max()
is minInt64-1
we cannot really say it overflows
int64.
let's have a generic message for both checks or let's use the v <= maxInt64
and v >= minInt64
directly for more control on what we check for.
I'd also use Parameter
or sth else instead of Scalar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because overflow and underflow read weird if its NaN; lets maybe write "Parameter value is not convertible to int64" and check in one condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good overall. I just have a few suggestions related to the tests.
Signed-off-by: Michael Hoffmann <mhoffmann@cloudflare.com>
80d4329
to
cf0b039
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but maybe let's make use of the new msg:
syntax to test the precise message in the tests. I think it makes sense in this case.
@@ -388,6 +388,15 @@ eval_info instant at 50m bottomk by (instance) (1, {__name__=~"http_requests(_hi | |||
{__name__="http_requests", group="production", instance="1", job="api-server"} 200 | |||
{__name__="http_requests", group="production", instance="2", job="api-server"} NaN | |||
|
|||
eval instant at 50m topk(NaN, non_existent) | |||
expect fail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I would prefer to test the exact message, like @NeerajGartia21 suggested (same below):
expect fail | |
expect fail msg: Ratio value is NaN |
@@ -222,6 +222,9 @@ eval_warn instant at 50m histogram_quantile(NaN, testhistogram_bucket) | |||
{start="positive"} NaN | |||
{start="negative"} NaN | |||
|
|||
eval instant at 50m histogram_quantile(NaN, non_existent) | |||
expect warn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here as well:
expect warn | |
expect warn msg: PromQL warning: quantile value should be between 0 and 1, got NaN |
topk(NaN, non_existent)
used to return an error, it stopped after https://github.com/prometheus/prometheus/pull/16404/files.This is a minor issue, but broke thanos regression tests