add support for prometheus metrics listener in engine #10555

sipsma · 2025-06-09T19:46:50Z

Starting out with just a very simple Prometheus gauge for the number of connected clients + a few high-level metrics on the local cache (total disk space + number of entries). This is meant to support some cloud-related use cases.

Current usage:

ATM you have to set an env var on the engine container to configure the listener
- Example: _EXPERIMENTAL_DAGGER_METRICS_ADDR: 0.0.0.0:9090
- Also added _EXPERIMENTAL_DAGGER_METRICS_CACHE_UPDATE_INTERVAL to configure the interval in which cache metrics are updated, defaults to 5m
After that prometheus metrics are available over HTTP at /metrics
- Example: curl http://<engine-endpoint>:9090/metrics
The metrics added
- Current number of connected clients, e.g. dagger_connected_clients 1
  - This is updated synchronously when metrics are requested
- Current total disk space consumed by the local cache in bytes, e.g. dagger_local_cache_total_disk_size_bytes 24000
  - This is only updated every 5m in order to limit any possible perf impact of holding buildkit disk usage locks on every metrics request
- Current total number of entries in the local cache, e.g. dagger_local_cache_entry_count 42
  - same as above, only updated every 5m

@gerhard One thing that occurred to me while writing this is that we could try to instead use OTEL metrics, which is appealing in that it could unify more with the rest of our telemetry stack

We do already support OTEL metrics and exporting those to cloud, but they are "per-client" and integrated with the rest of our per-client OTEL spans/logs.
- For this new use case, we'd want to export metrics for the engine "as a whole"
This would be more of a "push" model, whereas exposing a Prometheus endpoint on the engine is more "pull"
Potential pros of this approach:
1. Integrates more with the rest of our existing cloud OTEL stack (persistence of data for each engine in the backend db, etc.)
2. We could plausibly use the engine pushing these metrics as a healthcheck (if it's not pushing engine-wide metrics, something probably went wrong)
3. The "push" model may be simpler in terms of networking; i.e. it's often easier to have the engine reach out to the cloud than it is for us to reach the engine in arbitrary networking environments (which isn't a problem today but may become more of a headache in the longer term)
Potential cons
1. More complicated to get going

Given it's extreme simplicity, I'm also totally fine starting w/ this Prometheus listener approach to get us going and then iterate from there, possibly moving to OTEL metrics down the line. Just wanted to raise this now.

There's also integration between Prometheus and OTEL, though I haven't taken the time to read through the details there yet so no opinion on whether it's a path worth considering

Update on the above: we decided to go with Prometheus for now since it's the simplest to get going, with the intention of moving to OTEL metrics as we iterate over time. Given this, leaving the configuration of the metrics as an env var prefixed with _EXPERIMENTAL

Draft TODOs:

Figure out above question of OTEL vs. Prometheus (will discuss with Gerhard)
- Gonna use Prometheus listener as is for now to get going quickly, will iterate most likely towards OTEL over time
~~Support configuration of metrics endpoint as engine config file setting + command line param (rather than just an env var)~~
- Decided to just stick w/ a _EXPERIMENTAL_* env var for now since we are not sure the lifespan of this, may be more of a stepping stone
Clarify if the /healthz endpoint is needed for Prometheus
- Windsurf added this for me, have not yet looked into whether it's really needed
- Didn't see that it was, removed for now

Starting out with just a very simple gauge for the number of connected clients. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

gerhard · 2025-06-17T07:34:33Z

So good to see this come together @sipsma!

One thing that occurred to me while writing this is that we could try to instead use OTEL metrics, which is appealing in that it could unify more with the rest of our telemetry stack.

Using OTel metrics is a great idea. Yes, we should absolutely do that in addition to exposing this Prometheus /metrics endpoint.

There is complexity associated with depending on the /metrics interface even if we build the Grafana dashboard or share some Prometheus alerts. Wiring everything together, keeping up with the updates, combining with logs, mapping it to operatio 8000 ns that run within the Engine, etc. is just too much work. Enthusiasts are free to do it, but anyone that wants the polished end-result which is being improved on constantly, will opt for Dagger Cloud. Another option is to plug the OTel data into an LLM, but that brings its own set of complexities.

On top of the above, we both know that there is a lot more to this than what can be exposed via /metrics. OTel allows us to attach a lot more contextual meaning, which makes the data a lot more insightful. In conclusion: OTel as a more detailed & insightful addition on top of this new & experimental /metrics endpoint makes the most sense to me.

Since I started typing this, the PR moved along a bit, so I'm going to do one final review before we merge & ship it, preferably in the next release. cc @jedevc

We have a very simple load-test client that makes the metrics vary and shows how everything comes together. To run it: dagger -m modules/metrics run up Grafana will be available on http://localhost:3000 (default username & password are admin:admin), and it will be already wired with Prometheus & a Dagger Dev Engine that has `/metrics` enabled. Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

gerhard

This now looks good to me!

The only thing left is to add a super simple Grafana dashboard that visualises the 3 Dagger Engine metrics. We don't need to block the r 8000 elease on that. What we have so far is a good enough start:

dagger -m modules/metrics run up

And then just go to Drilldown in the local Grafana to see this:

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

It visualises and explains the first three (3) Dagger Engine metrics: 1. dagger_connected_clients 2. dagger_local_cache_entries 3. dagger_local_cache_total_disk_size_bytes To use this, run: dagger -m github.com/dagger/dagger/modules/metrics call run up Then open up Grafana in your browser: http://localhost:3000 (default user is `admin` & pass is also `admin`) This is a follow-up to: - dagger#10555 Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

It visualises and explains the first three (3) Dagger Engine metrics: 1. dagger_connected_clients 2. dagger_local_cache_entries 3. dagger_local_cache_total_disk_size_bytes To use this, run: dagger -m github.com/dagger/dagger/modules/metrics call run up Then open up Grafana in your browser: http://localhost:3000 (default user is `admin` & pass is also `admin`). This is a follow-up to: - #10555 Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

sipsma requested a review from gerhard June 9, 2025 19:46

sipsma added 3 commits June 10, 2025 12:43

add support for prometheus metrics listener in engine

a79f0a0

Starting out with just a very simple gauge for the number of connected clients. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

support for cache size and entry count in metrics

19017e3

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

limit perf impact of cache metrics by running only ever 5m

1f3079a

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

sipsma force-pushed the prometheus-metrics branch from 5fe17f2 to 1f3079a Compare June 10, 2025 19:43

sipsma marked this pull request as ready for review June 10, 2025 19:48

sipsma added this to the v0.18.11 milestone Jun 17, 2025

jedevc approved these changes Jun 24, 2025

View reviewed changes

gerhard force-pushed the prometheus-metrics branch from 352b468 to 6aa3c16 Compare June 24, 2025 18:41

update integ test metric name

a62db41

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

gerhard approved these changes Jun 24, 2025

View reviewed changes

sipsma force-pushed the prometheus-metrics branch from 127b443 to 7955219 Compare June 24, 2025 19:00

appease linter

00abe87

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

sipsma force-pushed the prometheus-metrics branch from 7955219 to 00abe87 Compare June 24, 2025 19:22

sipsma merged commit eac70cb into dagger:main Jun 24, 2025
56 of 57 checks passed

gerhard mentioned this pull request Jun 27, 2025

Add default Grafana dashboard - Dagger Engine #10652

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add support for prometheus metrics listener in engine #10555

add support for prometheus metrics listener in engine #10555

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

add support for prometheus metrics listener in engine #10555

add support for prometheus metrics listener in engine #10555

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!