Learning Prometheus

work

2026-02-01

Learning Prometheus¶

When I joined the Observability team, Prometheus was already everywhere. But using it and understanding it are different things. I wanted to close that gap.

So I went back to basics. Read the docs, ran it locally, built dashboards, wrote queries, broke things on purpose until it clicked.

Why observability matters¶

Observability isn't just knowing when something is down. It's understanding what your system is doing and why, without guessing.

Logs, metrics, traces. The usual pillars. But the real point is what they enable: answering hard questions under pressure. During an incident, knowing something failed isn't enough. You need to know where it started, what changed, and what the blast radius looks like.

Working in an Observability team means building that foundation for everyone else. That responsibility pushed me to go deeper.

Prometheus up close¶

Prometheus is opinionated, in a good way.

It's pull-based. Instead of apps pushing metrics somewhere, Prometheus scrapes targets that expose an endpoint. Feels backwards at first, then you realize it makes discovery and freshness easier to reason about.

The data model is simple: everything is a time series identified by a metric name plus labels.

http_request_duration_seconds{method="POST", handler="/api/v1/users"}

Once you get comfortable with labels, Prometheus stops feeling like a monitoring tool and starts feeling like a queryable dataset of system behavior over time.

The moving parts¶

Understanding the architecture makes production setups feel less magical:

Prometheus Server scrapes, stores, and queries. The TSDB, WAL, and block storage design choices all make sense once you understand the write/query tradeoffs.
Exporters are how Prometheus talks to systems that don't expose metrics natively. Node Exporter is the classic one.
Alertmanager handles deduplication, grouping, silencing, and routing. Prometheus fires alerts, Alertmanager makes them manageable.
Pushgateway exists for short-lived jobs that end before Prometheus can scrape them. Useful, but intentionally the exception.

PromQL¶

Not scary, but easy to get wrong without the right mental model. The key is being strict about what you're working with. Instant vectors vs range vectors, and what each function expects.

Things that helped:

rate() vs irate(). rate() smooths over a window, usually the safe default. irate() uses only the last two samples, great for spikes but noisy.

Histograms. histogram_quantile() is easy to copy from examples and still misunderstand. You need to understand _bucket, the le label, and what Prometheus is actually estimating. Get this wrong and your dashboards lie.

Recording rules. Not just a performance trick. They're how you decide which queries are first-class signals. If you're building SLOs, recording rules matter.

PromQL got easier once I started using it to answer real questions instead of studying it in isolation.

I started asking better questions during incidents. Not "is it down?" but "what does the latency distribution look like?" or "did error rate spike after the deploy or after a dependency started timing out?"

I also got much sharper about cardinality. Labels are powerful, but too many unique combinations will wreck performance and storage. You only learn where to draw that line after you've broken a Prometheus setup a few times.

After all this, the PCA felt like a natural checkpoint.