8000 Ensure target_info and traces_target_info are consistent with the instrumentation state by grcevski · Pull Request #1861 · grafana/beyla · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Ensure target_info and traces_target_info are consistent with the instrumentation state #1861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 29, 2025

Conversation

grcevski
Copy link
Contributor

So far we haven't paid much attention to the quality of the information present in target_info and traces_target_info. Essentially, as long as certain information was there, e.g. certain labels, we didn't care if the data was correctly stored or removed.

There were couple of issues:

  1. The OTEL side of traces_target_info handling was wrong. We would increment the gauge once for span metrics and second time for service graph metrics, and then finally where it was supposed to be removed we'd increment it again.
  2. If a POD in K8S has multiple processes, we couldn't really distinguish which process the target info is for. Namely, multiple instrumented instances would map to the same target info, so we couldn't really use the target_info to tell if a certain service is alive or not.
  3. The target info in OTEL was never expired, but it was in Prometheus. The expiry of these metrics is problematic. Namely, if we had an expiry interval of 60 seconds for metrics, and a service wouldn't encounter any traffic in that period, the target info would disappear, which would signify that the service isn't instrumented at all, rather than not processing any traffic.

To mitigate these issues, I'm changing how we are processing the target_info and the traces_target_info metrics, such they are driven by the instrumentation events. Namely:

  1. I made the PID of the process part of the Service.UID (@mariomac I hope this is OK). This way even on Kubernetes we can tell when a Pod has launched multiple processes.
  2. I created a small pipeline for consuming the events from the discovery pipeline to push Process Events. Like process created and process terminated events. This pipeline feeds into OTEL and Prometheus App metrics and we create the target_info and traces_target_info based on these events, as well as clean up when the process is terminated.
  3. The target info metrics are never expired, they are explicitly cleaned up or will disappear if Beyla is restarted.

@grcevski grcevski requested a review from a team as a code owner April 25, 2025 19:14
ctx context.Context
service *svc.Attrs
provider *metric.MeterProvider
resourceAttributes []attribute.KeyValue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must remember the original resource attributes when the service is created for clean-up purposes. When a service is removed the attributes might be different if we had lost the K8S data.

}

// New Instrumenter, given a Config
func New(ctx context.Context, ctxInfo *global.ContextInfo, config *beyla.Config) (*Instrumenter, error) {
setupFeatureContextInfo(ctx, ctxInfo, config)

tracesInput := msg.NewQueue[[]request.Span](msg.ChannelBufferLen(config.ChannelBufferLen))
processEventsInput := msg.NewQueue[exec.ProcessEvent](msg.ChannelBufferLen(10))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the best place to inject this additional small pipeline. @mariomac

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! I think so. I guess we can always move it to another part if later we require expanding the access.

@@ -95,30 +98,89 @@ func KubeDecoratorProvider(
}
}

func ProcessEventDecoratorProvider(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariomac Essentially I needed a decorator that decorates K8S or host based. It was easiest to place it here, but I'm thinking there might be a better place for this. Perhaps if we ever expand decorators to also fetch information from the known cloud vendors, we might need something like generic Metadata decorator, which decorates consistently depending on what's available.

What I'm thinking is if it makes sense to unify the read_decorator and the k8s decorator in a metadata decorator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there aren't data dependencies or something that need to coordinated, maybe keeping two different nodes makes code simpler an easier to test/debug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good. I'll split this code and move the logic for enrichment of the non k8s metadata where the read_decorator is today.

Copy link
codecov bot commented Apr 25, 2025

Codecov Report

Attention: Patch coverage is 83.98268% with 37 lines in your changes missing coverage. Please review.

Project coverage is 73.97%. Comparing base (24d2e4f) to head (ac21313).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
pkg/export/prom/prom.go 70.58% 19 Missing and 1 partial ⚠️
pkg/export/otel/metrics.go 80.88% 10 Missing and 3 partials ⚠️
pkg/transform/k8s.go 92.45% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1861      +/-   ##
==========================================
- Coverage   74.20%   73.97%   -0.24%     
==========================================
  Files         177      177              
  Lines       19232    19395     +163     
==========================================
+ Hits        14272    14347      +75     
- Misses       4227     4313      +86     
- Partials      733      735       +2     
Flag Coverage Δ
integration-test 57.39% <58.00%> (+0.14%) ⬆️
k8s-integration-test 55.23% <78.35%> (-0.39%) ⬇️
oats-test 35.70% <35.06%> (+0.17%) ⬆️
unittests 46.01% <44.15%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor
@mariomac mariomac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM! I only have a concern/question about the automatic generation of target_info.

}

// New Instrumenter, given a Config
func New(ctx context.Context, ctxInfo *global.ContextInfo, config *beyla.Config) (*Instrumenter, error) {
setupFeatureContextInfo(ctx, ctxInfo, config)

tracesInput := msg.NewQueue[[]request.Span](msg.ChannelBufferLen(config.ChannelBufferLen))
processEventsInput := msg.NewQueue[exec.ProcessEvent](msg.ChannelBufferLen(10))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! I think so. I guess we can always move it to another part if later we require expanding the access.

@@ -259,6 +264,7 @@ type Metrics struct {
serviceGraphServer *Expirer[*request.Span, instrument.Float64Histogram, float64]
serviceGraphFailed *Expirer[*request.Span, instrument.Int64Counter, int64]
serviceGraphTotal *Expirer[*request.Span, instrument.Int64Counter, int64]
targetInfo instrument.Int64UpDownCounter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target_info is automatically created by the collector from the metrics' resource attributes, when the OTEL metrics are converted to Prometheus. Might we end up having duplicate target_info entries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with this quite a bit and it seems that if we store the exact same things as the resource attributes we are good. That's why I record the resource attributes on the Metric type, so that we get exactly the same values. The only way we'd get a duplicate is if the Span resource attributes are different than the ProcessEvent attributes, but since I'm using the same code for both now, after I refactored the setting of the metadata, it shouldn't happen, unless the metadata changes after the initial launch. But if the metadata ends up changing, we would have dups no matter what, even with Spans, some early ones will get stored with the old resource attributes.

Essentially, I made the following test. I launched the OTel demo and I didn't run any transactions, we get maybe 12 or so services in target_info. These will be naturally created because of Kafka messages, since those run all the time even if there's no workload. Then I would run a few transactions and the number of services would go up to 20+.

After the change we end up at the 20+ since the beginning and I was testing by then scaling down everything in the OTel demo to 0 and waiting around 5 minutes for Prometheus to remove the deleted services.

@@ -95,30 +98,89 @@ func KubeDecoratorProvider(
}
}

func ProcessEventDecoratorProvider(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there aren't data dependencies or something that need to coordinated, maybe keeping two different nodes makes code simpler an easier to test/debug.

@grcevski grcevski added the port-to-otel-ebpf-inst PRs that need to be ported to otel-ebpf-instrumentation label Apr 28, 2025
Copy link
Contributor
@mariomac mariomac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏🏻🚀

@grcevski grcevski merged commit 2e3889f into grafana:main Apr 29, 2025
1 check passed
@grcevski grcevski deleted the target_info branch April 29, 2025 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
port-to-otel-ebpf-inst PRs that need to be ported to otel-ebpf-instrumentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0