Ensure target_info and traces_target_info are consistent with the instrumentation state #1861

grcevski · 2025-04-25T19:14:59Z

So far we haven't paid much attention to the quality of the information present in target_info and traces_target_info. Essentially, as long as certain information was there, e.g. certain labels, we didn't care if the data was correctly stored or removed.

There were couple of issues:

The OTEL side of traces_target_info handling was wrong. We would increment the gauge once for span metrics and second time for service graph metrics, and then finally where it was supposed to be removed we'd increment it again.
If a POD in K8S has multiple processes, we couldn't really distinguish which process the target info is for. Namely, multiple instrumented instances would map to the same target info, so we couldn't really use the target_info to tell if a certain service is alive or not.
The target info in OTEL was never expired, but it was in Prometheus. The expiry of these metrics is problematic. Namely, if we had an expiry interval of 60 seconds for metrics, and a service wouldn't encounter any traffic in that period, the target info would disappear, which would signify that the service isn't instrumented at all, rather than not processing any traffic.

To mitigate these issues, I'm changing how we are processing the target_info and the traces_target_info metrics, such they are driven by the instrumentation events. Namely:

I made the PID of the process part of the Service.UID (@mariomac I hope this is OK). This way even on Kubernetes we can tell when a Pod has launched multiple processes.
I created a small pipeline for consuming the events from the discovery pipeline to push Process Events. Like process created and process terminated events. This pipeline feeds into OTEL and Prometheus App metrics and we create the target_info and traces_target_info based on these events, as well as clean up when the process is terminated.
The target info metrics are never expired, they are explicitly cleaned up or will disappear if Beyla is restarted.

grcevski · 2025-04-25T19:16:18Z

pkg/export/otel/metrics.go

+	ctx                      context.Context
+	service                  *svc.Attrs
+	provider                 *metric.MeterProvider
+	resourceAttributes       []attribute.KeyValue


I must remember the original resource attributes when the service is created for clean-up purposes. When a service is removed the attributes might be different if we had lost the K8S data.

grcevski · 2025-04-25T19:17:44Z

pkg/internal/appolly/appolly.go

 }

 // New Instrumenter, given a Config
 func New(ctx context.Context, ctxInfo *global.ContextInfo, config *beyla.Config) (*Instrumenter, error) {
 	setupFeatureContextInfo(ctx, ctxInfo, config)

 	tracesInput := msg.NewQueue[[]request.Span](msg.ChannelBufferLen(config.ChannelBufferLen))
+	processEventsInput := msg.NewQueue[exec.ProcessEvent](msg.ChannelBufferLen(10))


I'm not sure if this is the best place to inject this additional small pipeline. @mariomac

Yeah! I think so. I guess we can always move it to another part if later we require expanding the access.

grcevski · 2025-04-25T19:20:50Z

pkg/transform/k8s.go

@@ -95,30 +98,89 @@ func KubeDecoratorProvider(
 	}
 }

+func ProcessEventDecoratorProvider(


@mariomac Essentially I needed a decorator that decorates K8S or host based. It was easiest to place it here, but I'm thinking there might be a better place for this. Perhaps if we ever expand decorators to also fetch information from the known cloud vendors, we might need something like generic Metadata decorator, which decorates consistently depending on what's available.

What I'm thinking is if it makes sense to unify the read_decorator and the k8s decorator in a metadata decorator.

If there aren't data dependencies or something that need to coordinated, maybe keeping two different nodes makes code simpler an easier to test/debug.

OK, sounds good. I'll split this code and move the logic for enrichment of the non k8s metadata where the read_decorator is today.

codecov · 2025-04-25T19:35:24Z

Codecov Report

Attention: Patch coverage is 83.98268% with 37 lines in your changes missing coverage. Please review.

Project coverage is 73.97%. Comparing base (24d2e4f) to head (ac21313).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/export/prom/prom.go	70.58%	19 Missing and 1 partial ⚠️
pkg/export/otel/metrics.go	80.88%	10 Missing and 3 partials ⚠️
pkg/transform/k8s.go	92.45%	3 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1861      +/-   ##
==========================================
- Coverage   74.20%   73.97%   -0.24%     
==========================================
  Files         177      177              
  Lines       19232    19395     +163     
==========================================
+ Hits        14272    14347      +75     
- Misses       4227     4313      +86     
- Partials      733      735       +2

Flag	Coverage Δ
integration-test	`57.39% <58.00%> (+0.14%)`	⬆️
k8s-integration-test	`55.23% <78.35%> (-0.39%)`	⬇️
oats-test	`35.70% <35.06%> (+0.17%)`	⬆️
unittests	`46.01% <44.15%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mariomac

Mostly LGTM! I only have a concern/question about the automatic generation of target_info.

mariomac · 2025-04-28T14:11:54Z

pkg/internal/appolly/appolly.go

 }

 // New Instrumenter, given a Config
 func New(ctx context.Context, ctxInfo *global.ContextInfo, config *beyla.Config) (*Instrumenter, error) {
 	setupFeatureContextInfo(ctx, ctxInfo, config)

 	tracesInput := msg.NewQueue[[]request.Span](msg.ChannelBufferLen(config.ChannelBufferLen))
+	processEventsInput := msg.NewQueue[exec.ProcessEvent](msg.ChannelBufferLen(10))


Yeah! I think so. I guess we can always move it to another part if later we require expanding the access.

mariomac · 2025-04-28T14:13:56Z

pkg/export/otel/metrics.go

@@ -259,6 +264,7 @@ type Metrics struct {
 	serviceGraphServer           *Expirer[*request.Span, instrument.Float64Histogram, float64]
 	serviceGraphFailed           *Expirer[*request.Span, instrument.Int64Counter, int64]
 	serviceGraphTotal            *Expirer[*request.Span, instrument.Int64Counter, int64]
+	targetInfo                   instrument.Int64UpDownCounter


target_info is automatically created by the collector from the metrics' resource attributes, when the OTEL metrics are converted to Prometheus. Might we end up having duplicate target_info entries?

I experimented with this quite a bit and it seems that if we store the exact same things as the resource attributes we are good. That's why I record the resource attributes on the Metric type, so that we get exactly the same values. The only way we'd get a duplicate is if the Span resource attributes are different than the ProcessEvent attributes, but since I'm using the same code for both now, after I refactored the setting of the metadata, it shouldn't happen, unless the metadata changes after the initial launch. But if the metadata ends up changing, we would have dups no matter what, even with Spans, some early ones will get stored with the old resource attributes.

Essentially, I made the following test. I launched the OTel demo and I didn't run any transactions, we get maybe 12 or so services in target_info. These will be naturally created because of Kafka messages, since those run all the time even if there's no workload. Then I would run a few transactions and the number of services would go up to 20+.

After the change we end up at the 20+ since the beginning and I was testing by then scaling down everything in the OTel demo to 0 and waiting around 5 minutes for Prometheus to remove the deleted services.

mariomac · 2025-04-28T14:15:16Z

pkg/transform/k8s.go

@@ -95,30 +98,89 @@ func KubeDecoratorProvider(
 	}
 }

+func ProcessEventDecoratorProvider(


If there aren't data dependencies or something that need to coordinated, maybe keeping two different nodes makes code simpler an easier to test/debug.

mariomac

👏🏻🚀

grcevski added 5 commits April 25, 2025 11:32

Track target_info based on process liveness

aa1483b

port target info changes to prom metrics

4cec786

Fix bugs with processing

9989b3f

cleanup traces info before target info

2049a9b

Fix test

5f251b2

grcevski requested a review from a team as a code owner April 25, 2025 19:14

grcevski commented Apr 25, 2025

View reviewed changes

fix linter and missing pid/os for OTEL traces_target_info

ac21313

mariomac reviewed Apr 28, 2025

View reviewed changes

grcevski added the port-to-otel-ebpf-inst PRs that need to be ported to otel-ebpf-instrumentation label Apr 28, 2025

grcevski added 2 commits April 28, 2025 10:33

Merge branch 'main' into target_info

2bf2d3c

apply pr review feedback

bd2369a

mariomac approved these changes Apr 29, 2025

View reviewed changes

grcevski merged commit 2e3889f into grafana:main Apr 29, 2025
1 check passed

grcevski deleted the target_info branch April 29, 2025 14:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure target_info and traces_target_info are consistent with the instrumentation state #1861

Ensure target_info and traces_target_info are consistent with the instrumentation state #1861

Ensure target_info and traces_target_info are consistent with the instrumentation state #1861

Ensure target_info and traces_target_info are consistent with the instrumentation state #1861

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment