8000 Up metric fires alert after the server is up · Issue #16748 · prometheus/prometheus · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Up metric fires alert after the server is up #16748
Open
@kyouma24

Description

@kyouma24

What did you do?

Hi all, we're using prometheus-monitoring stack to monitor our servers. As everyone, we've configured the UP metric to check if the server is up or down and fire alerts if it's down for 5m, below is the expression we're using, it's common one:

groups:

  • name: server_alerts
    rules:
    • alert: ServerDown
      expr: up == 0
      for: 5m
      labels:
      severity: critical
      annotations:
      description: "Server {{ $labels.instance_name }} is down."
      VALUE: "{{ $value }}"
      LABELS: "{{ $labels }}"
      summary: "Server {{ $labels.instance_name }} is down"

I have 10 servers under monitoring, 2 have scheduled start stop and start everyday. The servers stop at 9PM at night and starts again at 9AM.

When the server is stopped, we receive alert that the Server is Down but when it comes back again next morning at 9AM, I receive an alert approx. after 4 mins. Why does this happen? I can see the server is reflecting as UP in the targets but I still receive alert despite the alert is no longer in firing state.

How to change this behaviour?

What did you expect to see?

After the initial alert is sent when the server is stopped, I should no longer receive alert after it came up again.

What did you see instead? Under which circumstances?

I receive an alert again in the morning after 9AM, approx 4-5 minutes after(differs)
the target shows as UP in the targets section
from the alerts tab, the particular instance is no longer in firing state

System information

Ubuntu

Prometheus version

prometheus, version 3.4.1 (branch: HEAD, revision: aea6503d9bbaad6c5faff3ecf6f1025213356c92)
  build user:       root@16f976c24db1
  build date:       20250531-10:44:38
  go version:       go1.24.3
  platform:         linux/amd64
  tags:             netgo,builtinassets,stringlabels

Prometheus configuration file

global:
  scrape_interval: 30s
  evaluation_interval: 30s
  external_labels:
    client: TESTING

remote_write:
- url: "http://thanos:10908/api/v1/receive"

scrape_configs:
  - job_name: "test"
    ec2_sd_configs: &ec2config
      - region: "ap-south-1"
    relabel_configs:
      - source_labels: [__meta_ec2_tag_OS]
        regex: linux
        action: keep
      - source_labels: [__meta_ec2_private_ip]
        regex: '(.*)'
        replacement: '${1}:1784'
        target_label: __address__
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance_name

  - job_name: 'alertmanager'
    static_configs:
      - targets: ["alertmanager:9093"]
### Rule Files ####
rule_files:
  - "/etc/prometheus/EC2-Alerts.yml"
  - "/etc/prometheus/RDS-Alerts.yml"

#### AlertManager ####
alerting:
  alertmanagers:
    - static_configs:
      - targets: ["alertmanager:9093"]
      scheme: http
      basic_auth:
        username: "admin"
        password: "XXXXXXXXXXXXXX"

Alertmanager version

alertmanager, version 0.27.0 (branch: HEAD, revision: 0aa3c2aad14cff039931923ab16b26b7481783b5)
  build user:       root@22cd11f671e9
  build date:       20240228-11:51:20
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo

Alertmanager configuration file

route:
  group_by: ['alertname','instance_name', 'instance', 'category', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 5h
  receiver: 'default-receiver'
  routes:
    - receiver: 'test'
      matchers:
        - alertname="ServerDown"
      repeat_interval: 720h
      group_wait: 30s
      continue: false

    - receiver: 'test'
      group_wait: 10s
      continue: false

receivers:
  - name: 'test'
    webhook_configs:
      - url: 'https://XXXXXXX/Prod/GraphanaWebhook'
        http_config:
          authorization:
            credentials: absd
        send_resolved: true

  - name: 'default-receiver'
    webhook_configs:
      - url: 'https://XXXXXXX/Prod/GraphanaWebhook'
        http_config:
          authorization:
            credentials: absd
        send_resolved: false

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'trouble'
    equal: ['instance','category']

running alertmanager with data.retention=730h

Logs


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0