-
Notifications
You must be signed in to change notification settings - Fork 745
add support for prometheus metrics listener in engine #10555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Starting out with just a very simple gauge for the number of connected clients. Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
5fe17f2
to
1f3079a
Compare
So good to see this come together @sipsma!
Using OTel metrics is a great idea. Yes, we should absolutely do that in addition to exposing this Prometheus There is complexity associated with depending on the On top of the above, we both know that there is a lot more to this than what can be exposed via Since I started typing this, the PR moved along a bit, so I'm going to do one final review before we merge & ship it, preferably in the next release. cc @jedevc |
We have a very simple load-test client that makes the metrics vary and shows how everything comes together. To run it: dagger -m modules/metrics run up Grafana will be available on http://localhost:3000 (default username & password are admin:admin), and it will be already wired with Prometheus & a Dagger Dev Engine that has `/metrics` enabled. Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
352b468
to
6aa3c16
Compare
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now looks good to me!
The only thing left is to add a super simple Grafana dashboard that visualises the 3 Dagger Engine metrics. We don't need to block the r 8000 elease on that. What we have so far is a good enough start:
dagger -m modules/metrics run up
And then just go to Drilldown in the local Grafana to see this:
127b443
to
7955219
Compare
Signed-off-by: Erik Sipsma <erik@sipsma.dev>
7955219
to
00abe87
Compare
It visualises and explains the first three (3) Dagger Engine metrics: 1. dagger_connected_clients 2. dagger_local_cache_entries 3. dagger_local_cache_total_disk_size_bytes To use this, run: dagger -m github.com/dagger/dagger/modules/metrics call run up Then open up Grafana in your browser: http://localhost:3000 (default user is `admin` & pass is also `admin`) This is a follow-up to: - dagger#10555 Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
It visualises and explains the first three (3) Dagger Engine metrics: 1. dagger_connected_clients 2. dagger_local_cache_entries 3. dagger_local_cache_total_disk_size_bytes To use this, run: dagger -m github.com/dagger/dagger/modules/metrics call run up Then open up Grafana in your browser: http://localhost:3000 (default user is `admin` & pass is also `admin`). This is a follow-up to: - #10555 Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Starting out with just a very simple Prometheus gauge for the number of connected clients + a few high-level metrics on the local cache (total disk space + number of entries). This is meant to support some cloud-related use cases.
Current usage:
_EXPERIMENTAL_DAGGER_METRICS_ADDR: 0.0.0.0:9090
_EXPERIMENTAL_DAGGER_METRICS_CACHE_UPDATE_INTERVAL
to configure the interval in which cache metrics are updated, defaults to5m
/metrics
curl http://<engine-endpoint>:9090/metrics
dagger_connected_clients 1
dagger_local_cache_total_disk_size_bytes 24000
dagger_local_cache_entry_count 42
@gerhard One thing that occurred to me while writing this is that we could try to instead use OTEL metrics, which is appealing in that it could unify more with the rest of our telemetry stack
Given it's extreme simplicity, I'm also totally fine starting w/ this Prometheus listener approach to get us going and then iterate from there, possibly moving to OTEL metrics down the line. Just wanted to raise this now.
Update on the above: we decided to go with Prometheus for now since it's the simplest to get going, with the intention of moving to OTEL metrics as we iterate over time. Given this, leaving the configuration of the metrics as an env var prefixed with
_EXPERIMENTAL
Draft TODOs:
Support configuration of metrics endpoint as engine config file setting + command line param (rather than just an env var)_EXPERIMENTAL_*
env var for now since we are not sure the lifespan of this, may be more of a stepping stone/healthz
endpoint is needed for Prometheus