The Metrics Stack

The metrics foundation in the homelab is kube-prometheus-stack. One Helm chart that installs Prometheus, Grafana, Alertmanager, Node Exporter, kube-state-metrics, and a set of default dashboards and recording rules. It's the fastest way to get real cluster observability.


What gets scraped

Prometheus scrapes everything on a 60-second interval:

Node Exporter runs as a DaemonSet on every node. CPU, memory, disk, network, filesystem. The basics that tell you if a node is healthy.

kube-state-metrics exposes the state of Kubernetes objects. Pod status, deployment replicas, PVC capacity, node conditions. This is where you get metrics like "how many pods are in CrashLoopBackOff" or "is this deployment fully rolled out."

ServiceMonitors from each addon. Loki, Thanos, ArgoCD, MinIO, cert-manager, Envoy Gateway. Each Helm chart creates its own ServiceMonitor, so Prometheus discovers and scrapes them automatically.

PodMonitor for applications. A generic PodMonitor called o11y-apps watches for pods with the o11y.ruiz.sh/metrics: "true" label. It scrapes ports named metrics or http-metrics. This is how application pods opt into metrics collection without needing their own ServiceMonitor.


Prometheus config

Prometheus runs on the large tier node (6 vCPUs, 20 GB RAM). Key settings:

  • Retention: 6 days or 8 GB, whichever comes first
  • Scrape interval: 60 seconds
  • WAL compression: enabled
  • Storage: 8 GB PVC on Longhorn (single replica)

6 days of local retention is enough for recent debugging. Anything older goes to Thanos via the Sidecar, which uploads completed 2-hour blocks to MinIO.

Remote write receiver is enabled so Loki's recording rules can push derived metrics back into Prometheus.


The WAL (Write-Ahead Log)

Every metric that Prometheus scrapes goes to the WAL first, before anything else. The WAL is an append-only log on disk that acts as a buffer between incoming data and the final TSDB blocks.

The flow:

scrape ──> WAL (append-only, on disk) ──> head block (in memory) ──> 2h block (on disk)

When Prometheus scrapes a target, it writes the samples to the WAL immediately. This is fast because it's sequential writes, no indexing. The data also lives in memory as the "head block." Every 2 hours, Prometheus cuts the head block into a compressed TSDB block on disk. At that point, the corresponding WAL segments are no longer needed and get cleaned up.

Why this matters:

Crash recovery. If Prometheus crashes or the pod restarts, the WAL is what lets it recover without losing data. On startup, Prometheus replays the WAL to reconstruct the head block. Without the WAL, any data scraped since the last 2-hour block would be gone.

Thanos dependency. The Thanos Sidecar only uploads completed 2-hour blocks. Data in the WAL and head block hasn't been uploaded yet. This means there's always a window of up to 2 hours of data that only exists locally. If the PVC dies before the block is cut, that window is lost.

WAL compression. Enabled in my config (walCompression: true). Compresses WAL segments with Snappy, which reduces disk usage with minimal CPU overhead. Worth enabling always.

The WAL is not optional. It's a fundamental part of how Prometheus TSDB works. You can't disable it.


Grafana

Grafana runs on the medium tier node. Three datasources configured:

  • Prometheus for the last 6 days (direct)
  • Thanos for long-term queries (default datasource)
  • Loki for logs, with the X-Scope-OrgID: homelab header for multi-tenancy

Dashboards are loaded automatically via a sidecar that watches for ConfigMaps with the grafana_dashboard: "1" label across all namespaces. Each addon can ship its own dashboard as a ConfigMap.

Admin credentials come from Doppler via External Secrets. Grafana has a 2 GB PVC for persistence.


Alertmanager

Alertmanager handles alert routing. Alerts from Prometheus rules and Loki rules both land here.

Routing is by severity:

  • critical repeats every 1 hour
  • warning repeats every 6 hours
  • InfoInhibitor goes to a null receiver (silenced)

An inhibition rule suppresses warning alerts when a critical alert is already firing for the same alertname and namespace. This avoids duplicate noise in Slack.

The Slack webhook URL is stored in Doppler, synced via ExternalSecret, and mounted as a file that Alertmanager reads.


Node placement

Workloads are distributed across tiers using nodeSelector:

Component Node tier
Prometheus + Thanos Sidecar large
Grafana medium
Alertmanager medium
Prometheus Operator medium
kube-state-metrics medium
Node Exporter all nodes (DaemonSet)

The 60-second scrape interval is a tradeoff. It keeps storage and CPU low, but you lose granularity for short-lived spikes. For a homelab it's fine. In production I'd probably use 15 or 30 seconds for critical services.

Single replica on Longhorn means no redundancy. If the node with the Prometheus PVC dies, you lose local data. Thanos covers the gap since it already has the blocks in MinIO, but there's a window of up to 2 hours of data that might not have been uploaded yet.