Long-term Metrics with Thanos

Prometheus is great for recent data, but it's not designed for long-term storage. In my homelab, Prometheus keeps 6 days of data locally. After that, it's gone. Thanos solves this by uploading Prometheus blocks to object storage and making them queryable.


How it works

Prometheus writes metrics to disk in 2-hour blocks. A Thanos Sidecar runs in the same pod, watches for completed blocks, and uploads them to MinIO.

Prometheus ──> WAL ──> Block (2h) ──> Sidecar ──> MinIO

On the query side, Thanos Query talks to both the Sidecar (recent data) and the Store Gateway (historical data from MinIO). From Grafana's perspective, it's a single Prometheus-compatible datasource with months of data.


The components

Component What it does
Sidecar Uploads blocks from Prometheus to MinIO
Store Gateway Reads historical blocks from MinIO
Query Unified query interface across all sources
Query Frontend Caching layer for queries
Compactor Merges small blocks and downsamples old data
Bucket Web UI to inspect what's in the bucket

Retention and downsampling

Not all data needs full resolution forever. The Compactor handles this:

Resolution Retention Use case
Raw 15 days Recent detailed data
5 minute 30 days Medium-term trends
1 hour 60 days Long-term trends

Old data gets downsampled automatically. A query for "CPU usage last 2 months" doesn't need per-second granularity. 1-hour resolution is enough and uses far less storage.


Why MinIO

Thanos needs S3-compatible object storage to store metrics blocks. In AWS you'd use S3 directly. In a homelab, there's no S3. MinIO fills that gap. It's an open-source object storage server that implements the S3 API and runs as a pod inside the cluster.

I run MinIO with a single 30 GB PVC on Longhorn. It serves two buckets: thanos-metrics for Thanos and loki-logs for Loki. Both tools just see an S3 endpoint and don't care that it's running on the same cluster.

MinIO credentials are stored in Doppler and synced via External Secrets. Thanos and Loki each get their own Secret with the endpoint, access key, and secret key.


Storage

The Compactor has its own 10 GB PVC as workspace for merging blocks.

Daily growth is around 300 MB for metrics. With the retention policy above, total storage stays manageable.


Grafana integration

Grafana has two Prometheus-compatible datasources:

  • Prometheus for the last 6 days (direct)
  • Thanos for everything older (via Store Gateway + MinIO)

In practice, I use Thanos for almost everything. It covers both recent and historical data in a single query.


Thanos adds complexity, but the tradeoff is worth it. Being able to look at metrics from weeks ago during an investigation is something you don't appreciate until you need it.

The Compactor is the component that surprised me most. It runs quietly in the background, but without it, the bucket fills up with thousands of tiny 2-hour blocks and queries get slow. Compaction and downsampling are what make long-term storage practical.