8000 add support for prometheus metrics listener in engine by sipsma · Pull Request #10555 · dagger/dagger · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

add support for prometheus metrics listener in engine #10555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 24, 2025

Conversation

sipsma
Copy link
Contributor
@sipsma sipsma commented Jun 9, 2025

Starting out with just a very simple Prometheus gauge for the number of connected clients + a few high-level metrics on the local cache (total disk space + number of entries). This is meant to support some cloud-related use cases.

Current usage:

  • ATM you have to set an env var on the engine container to configure the listener
    • Example: _EXPERIMENTAL_DAGGER_METRICS_ADDR: 0.0.0.0:9090
    • Also added _EXPERIMENTAL_DAGGER_METRICS_CACHE_UPDATE_INTERVAL to configure the interval in which cache metrics are updated, defaults to 5m
  • After that prometheus metrics are available over HTTP at /metrics
    • Example: curl http://<engine-endpoint>:9090/metrics
  • The metrics added
    • Current number of connected clients, e.g. dagger_connected_clients 1
      • This is updated synchronously when metrics are requested
    • Current total disk space consumed by the local cache in bytes, e.g. dagger_local_cache_total_disk_size_bytes 24000
      • This is only updated every 5m in order to limit any possible perf impact of holding buildkit disk usage locks on every metrics request
    • Current total number of entries in the local cache, e.g. dagger_local_cache_entry_count 42
      • same as above, only updated every 5m

@gerhard One thing that occurred to me while writing this is that we could try to instead use OTEL metrics, which is appealing in that it could unify more with the rest of our telemetry stack

  • We do already support OTEL metrics and exporting those to cloud, but they are "per-client" and integrated with the rest of our per-client OTEL spans/logs.
    • For this new use case, we'd want to export metrics for the engine "as a whole"
  • This would be more of a "push" model, whereas exposing a Prometheus endpoint on the engine is more "pull"
  • Potential pros of this approach:
    1. Integrates more with the rest of our existing cloud OTEL stack (persistence of data for each engine in the backend db, etc.)
    2. We could plausibly use the engine pushing these metrics as a healthcheck (if it's not pushing engine-wide metrics, something probably went wrong)
    3. The "push" model may be simpler in terms of networking; i.e. it's often easier to have the engine reach out to the cloud than it is for us to reach the engine in arbitrary networking environments (which isn't a problem today but may become more of a headache in the longer term)
  • Potential cons
    1. More complicated to get going

Given it's extreme simplicity, I'm also totally fine starting w/ this Prometheus listener approach to get us going and then iterate from there, possibly moving to OTEL metrics down the line. Just wanted to raise this now.

Update on the above: we decided to go with Prometheus for now since it's the simplest to get going, with the intention of moving to OTEL metrics as we iterate over time. Given this, leaving the configuration of the metrics as an env var prefixed with _EXPERIMENTAL


Draft TODOs:

  • Figure out above question of OTEL vs. Prometheus (will discuss with Gerhard)
    • Gonna use Prometheus listener as is for now to get going quickly, will iterate most likely towards OTEL over time
  • Support configuration of metrics endpoint as engine config file setting + command line param (rather than just an env var)
    • Decided to just stick w/ a _EXPERIMENTAL_* env var for now since we are not sure the lifespan of this, may be more of a stepping stone
  • Clarify if the /healthz endpoint is needed for Prometheus
    • Windsurf added this for me, have not yet looked into whether it's really needed
    • Didn't see that it was, removed for now

@sipsma sipsma requested a review from gerhard June 9, 2025 19:46
sipsma added 3 commits June 10, 2025 12:43
Starting out with just a very simple gauge for the number of connected
clients.

Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma sipsma force-pushed the prometheus-metrics branch from 5fe17f2 to 1f3079a Compare June 10, 2025 19:43
@sipsma sipsma marked this pull request as ready for review June 10, 2025 19:48
@gerhard
Copy link
Member
gerhard commented Jun 17, 2025

So good to see this come together @sipsma!

One thing that occurred to me while writing this is that we could try to instead use OTEL metrics, which is appealing in that it could unify more with the rest of our telemetry stack.

Using OTel metrics is a great idea. Yes, we should absolutely do that in addition to exposing this Prometheus /metrics endpoint.

There is complexity associated with depending on the /metrics interface even if we build the Grafana dashboard or share some Prometheus alerts. Wiring everything together, keeping up with the updates, combining with logs, mapping it to operatio 8000 ns that run within the Engine, etc. is just too much work. Enthusiasts are free to do it, but anyone that wants the polished end-result which is being improved on constantly, will opt for Dagger Cloud. Another option is to plug the OTel data into an LLM, but that brings its own set of complexities.

On top of the above, we both know that there is a lot more to this than what can be exposed via /metrics. OTel allows us to attach a lot more contextual meaning, which makes the data a lot more insightful. In conclusion: OTel as a more detailed & insightful addition on top of this new & experimental /metrics endpoint makes the most sense to me.

Since I started typing this, the PR moved along a bit, so I'm going to do one final review before we merge & ship it, preferably in the next release. cc @jedevc

@sipsma sipsma added this to the v0.18.11 milestone Jun 17, 2025
We have a very simple load-test client that makes the metrics vary and
shows how everything comes together. To run it:

    dagger -m modules/metrics run up

Grafana will be available on http://localhost:3000 (default username &
password are admin:admin), and it will be already wired with
Prometheus & a Dagger Dev Engine that has `/metrics` enabled.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
@gerhard gerhard force-pushed the prometheus-metrics branch from 352b468 to 6aa3c16 Compare June 24, 2025 18:41
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Copy link
Member
@gerhard gerhard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now looks good to me!

The only thing left is to add a super simple Grafana dashboard that visualises the 3 Dagger Engine metrics. We don't need to block the r 8000 elease on that. What we have so far is a good enough start:

dagger -m modules/metrics run up

And then just go to Drilldown in the local Grafana to see this:

image

@sipsma sipsma force-pushed the prometheus-metrics branch from 127b443 to 7955219 Compare June 24, 2025 19:00
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
@sipsma sipsma force-pushed the prometheus-metrics branch from 7955219 to 00abe87 Compare June 24, 2025 19:22
@sipsma sipsma merged commit eac70cb into dagger:main Jun 24, 2025
56 of 57 checks passed
gerhard added a commit to gerhard/dagger that referenced this pull request Jun 27, 2025
It visualises and explains the first three (3) Dagger Engine metrics:
1. dagger_connected_clients
2. dagger_local_cache_entries
3. dagger_local_cache_total_disk_size_bytes

To use this, run:

    dagger -m github.com/dagger/dagger/modules/metrics call run up

Then open up Grafana in your browser: http://localhost:3000 (default
user is `admin` & pass is also `admin`)

This is a follow-up to:
- dagger#10555

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
gerhard added a commit that referenced this pull request Jun 27, 2025
It visualises and explains the first three (3) Dagger Engine metrics:
1. dagger_connected_clients
2. dagger_local_cache_entries
3. dagger_local_cache_total_disk_size_bytes

To use this, run:

    dagger -m github.com/dagger/dagger/modules/metrics call run up

Then open up Grafana in your browser: http://localhost:3000 (default
user is `admin` & pass is also `admin`).

This is a follow-up to:
- #10555

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0