Up metric fires alert after the server is up

What did you do?

Hi all, we're using prometheus-monitoring stack to monitor our servers. As everyone, we've configured the UP metric to check if the server is up or down and fire alerts if it's down for 5m, below is the expression we're using, it's common one:

groups:

name: server_alerts
rules:
- alert: ServerDown
  expr: up == 0
  for: 5m
  labels:
  severity: critical
  annotations:
  description: "Server {{ $labels.instance_name }} is down."
  VALUE: "{{ $value }}"
  LABELS: "{{ $labels }}"
  summary: "Server {{ $labels.instance_name }} is down"

I have 10 servers under monitoring, 2 have scheduled start stop and start everyday. The servers stop at 9PM at night and starts again at 9AM.

When the server is stopped, we receive alert that the Server is Down but when it comes back again next morning at 9AM, I receive an alert approx. after 4 mins. Why does this happen? I can see the server is reflecting as UP in the targets but I still receive alert despite the alert is no longer in firing state.

How to change this behaviour?

What did you expect to see?

After the initial alert is sent when the server is stopped, I should no longer receive alert after it came up again.

What did you see instead? Under which circumstances?

I receive an alert again in the morning after 9AM, approx 4-5 minutes after(differs)
the target shows as UP in the targets section
from the alerts tab, the particular instance is no longer in firing state

System information

Ubuntu

Prometheus version

prometheus, version 3.4.1 (branch: HEAD, revision: aea6503d9bbaad6c5faff3ecf6f1025213356c92)
  build user:       root@16f976c24db1
  build date:       20250531-10:44:38
  go version:       go1.24.3
  platform:         linux/amd64
  tags:             netgo,builtinassets,stringlabels

Prometheus configuration file

global:
  scrape_interval: 30s
  evaluation_interval: 30s
  external_labels:
    client: TESTING

remote_write:
- url: "http://thanos:10908/api/v1/receive"

scrape_configs:
  - job_name: "test"
    ec2_sd_configs: &ec2config
      - region: "ap-south-1"
    relabel_configs:
      - source_labels: [__meta_ec2_tag_OS]
        regex: linux
        action: keep
      - source_labels: [__meta_ec2_private_ip]
        regex: '(.*)'
        replacement: '${1}:1784'
        target_label: __address__
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance_name

  - job_name: 'alertmanager'
    static_configs:
      - targets: ["alertmanager:9093"]
### Rule Files ####
rule_files:
  - "/etc/prometheus/EC2-Alerts.yml"
  - "/etc/prometheus/RDS-Alerts.yml"

#### AlertManager ####
alerting:
  alertmanagers:
    - static_configs:
      - targets: ["alertmanager:9093"]
      scheme: http
      basic_auth:
        username: "admin"
        password: "XXXXXXXXXXXXXX"

Alertmanager version

alertmanager, version 0.27.0 (branch: HEAD, revision: 0aa3c2aad14cff039931923ab16b26b7481783b5)
  build user:       root@22cd11f671e9
  build date:       20240228-11:51:20
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo

Alertmanager configuration file

route:
  group_by: ['alertname','instance_name', 'instance', 'category', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 5h
  receiver: 'default-receiver'
  routes:
    - receiver: 'test'
      matchers:
        - alertname="ServerDown"
      repeat_interval: 720h
      group_wait: 30s
      continue: false

    - receiver: 'test'
      group_wait: 10s
      continue: false

receivers:
  - name: 'test'
    webhook_configs:
      - url: 'https://XXXXXXX/Prod/GraphanaWebhook'
        http_config:
          authorization:
            credentials: absd
        send_resolved: true

  - name: 'default-receiver'
    webhook_configs:
      - url: 'https://XXXXXXX/Prod/GraphanaWebhook'
        http_config:
          authorization:
            credentials: absd
        send_resolved: false

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'trouble'
    equal: ['instance','category']

running alertmanager with data.retention=730h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

System information

Prometheus version

Prometheus configuration file

Alertmanager version

Alertmanager configuration file

Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions