Description
What did you do?
Hi all, we're using prometheus-monitoring stack to monitor our servers. As everyone, we've configured the UP metric to check if the server is up or down and fire alerts if it's down for 5m, below is the expression we're using, it's common one:
groups:
- name: server_alerts
rules:- alert: ServerDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
description: "Server {{ $labels.instance_name }} is down."
VALUE: "{{ $value }}"
LABELS: "{{ $labels }}"
summary: "Server {{ $labels.instance_name }} is down"
- alert: ServerDown
I have 10 servers under monitoring, 2 have scheduled start stop and start everyday. The servers stop at 9PM at night and starts again at 9AM.
When the server is stopped, we receive alert that the Server is Down but when it comes back again next morning at 9AM, I receive an alert approx. after 4 mins. Why does this happen? I can see the server is reflecting as UP in the targets but I still receive alert despite the alert is no longer in firing state.
How to change this behaviour?
What did you expect to see?
After the initial alert is sent when the server is stopped, I should no longer receive alert after it came up again.
What did you see instead? Under which circumstances?
I receive an alert again in the morning after 9AM, approx 4-5 minutes after(differs)
the target shows as UP in the targets section
from the alerts tab, the particular instance is no longer in firing state
System information
Ubuntu
Prometheus version
prometheus, version 3.4.1 (branch: HEAD, revision: aea6503d9bbaad6c5faff3ecf6f1025213356c92)
build user: root@16f976c24db1
build date: 20250531-10:44:38
go version: go1.24.3
platform: linux/amd64
tags: netgo,builtinassets,stringlabels
Prometheus configuration file
global:
scrape_interval: 30s
evaluation_interval: 30s
external_labels:
client: TESTING
remote_write:
- url: "http://thanos:10908/api/v1/receive"
scrape_configs:
- job_name: "test"
ec2_sd_configs: &ec2config
- region: "ap-south-1"
relabel_configs:
- source_labels: [__meta_ec2_tag_OS]
regex: linux
action: keep
- source_labels: [__meta_ec2_private_ip]
regex: '(.*)'
replacement: '${1}:1784'
target_label: __address__
- source_labels: [__meta_ec2_tag_Name]
target_label: instance_name
- job_name: 'alertmanager'
static_configs:
- targets: ["alertmanager:9093"]
### Rule Files ####
rule_files:
- "/etc/prometheus/EC2-Alerts.yml"
- "/etc/prometheus/RDS-Alerts.yml"
#### AlertManager ####
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
scheme: http
basic_auth:
username: "admin"
password: "XXXXXXXXXXXXXX"
Alertmanager version
alertmanager, version 0.27.0 (branch: HEAD, revision: 0aa3c2aad14cff039931923ab16b26b7481783b5)
build user: root@22cd11f671e9
build date: 20240228-11:51:20
go version: go1.21.7
platform: linux/amd64
tags: netgo
Alertmanager configuration file
route:
group_by: ['alertname','instance_name', 'instance', 'category', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 5h
receiver: 'default-receiver'
routes:
- receiver: 'test'
matchers:
- alertname="ServerDown"
repeat_interval: 720h
group_wait: 30s
continue: false
- receiver: 'test'
group_wait: 10s
continue: false
receivers:
- name: 'test'
webhook_configs:
- url: 'https://XXXXXXX/Prod/GraphanaWebhook'
http_config:
authorization:
credentials: absd
send_resolved: true
- name: 'default-receiver'
webhook_configs:
- url: 'https://XXXXXXX/Prod/GraphanaWebhook'
http_config:
authorization:
credentials: absd
send_resolved: false
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'trouble'
equal: ['instance','category']
running alertmanager with data.retention=730h
Logs