Alerting to Slack

Having metrics and logs is only useful if something tells you when things go wrong. In the homelab, alerts go to a #alerts channel in Slack through two paths.


Two alerting paths

Prometheus alert rules ──> Alertmanager ──> Slack
Loki alert rules       ──> Alertmanager ──> Slack
Grafana alert rules    ──> Grafana      ──> Slack

Alertmanager handles alerts that come from Prometheus (metric-based) and from Loki's Ruler (log-based). It groups, deduplicates, and routes them.

Grafana Unified Alerting handles alerts created directly in the Grafana UI. You can write a PromQL or LogQL query, set a threshold, and Grafana evaluates it on a schedule.

Both send to the same Slack channel, but through different mechanisms.


Alertmanager setup

Alertmanager is part of kube-prometheus-stack. The Slack webhook URL lives in Doppler and gets synced into the cluster via an ExternalSecret:

Doppler (GRAFANA_SLACK_WEBHOOK_URL)
ExternalSecret (grafana-slack-webhook)
Secret in metrics namespace
Alertmanager reads from mounted secret

Alertmanager is configured with routing rules based on severity:

  • critical alerts repeat every 1 hour
  • warning alerts repeat every 6 hours
  • InfoInhibitor alerts are silenced (sent to a null receiver)

There's also an inhibition rule: if a critical alert is firing, it suppresses the corresponding warning alert for the same alertname and namespace. This avoids duplicate noise.

The Slack message template includes alert name, status, severity, namespace, summary, and description. When an alert resolves, Alertmanager sends a resolution message too (send_resolved: true).


Grafana alerting setup

Grafana has its own alerting engine (Unified Alerting). A provisioned contact point called "Slack" is loaded from a ConfigMap at startup. It uses the same webhook URL, injected as an environment variable.

Alert rules created in the Grafana UI can query any datasource: Prometheus, Thanos, or Loki. You set an evaluation interval, a pending period (how long the condition must be true before firing), and labels like severity.

Some examples of alerts I run:

  • High CPU usage (> 80% for 5 minutes)
  • Pod crash looping (restarts in the last 15 minutes)
  • High error rate in logs (via Loki query)
  • PVC almost full

Loki alerting

Loki has a built-in Ruler that evaluates alert rules based on log queries. When a rule fires, it sends the alert to Alertmanager (not directly to Slack). Alertmanager then routes it like any other alert.

The Ruler also supports recording rules. These turn log patterns into Prometheus metrics via remote_write. For example, counting error logs per service per minute and writing that as a time series to Prometheus. Then you can query it with PromQL, build dashboards, and even alert on it from the metrics side.

Rules are loaded dynamically via ConfigMaps with the label loki_rule: "true". The Loki sidecar watches for these and reloads them automatically.


Secrets management

The Slack webhook URL never touches Git. It lives in Doppler, gets synced to a Kubernetes Secret via External Secrets Operator, and is consumed by both Alertmanager (as a mounted file) and Grafana (as an environment variable).

If the webhook URL changes, updating it in Doppler is enough. The ExternalSecret refreshes every hour, and a Grafana restart picks up the new value.


Having two alerting paths sounds redundant, but they serve different purposes. Alertmanager is better for infrastructure alerts that come from Prometheus recording rules or Loki log patterns. Grafana is better for ad-hoc alerts you want to create quickly from the UI without writing YAML.

The inhibition rules in Alertmanager are worth setting up early. Without them, a single incident can generate a wall of notifications in Slack, and you stop paying attention.