Autoscaling with KEDA

Kubernetes has built-in HPA (Horizontal Pod Autoscaler), but it only scales based on CPU and memory. I wanted to scale Thanos Query based on actual request rate, which is a Prometheus metric. That's where KEDA comes in.


What KEDA does

KEDA (Kubernetes Event-driven Autoscaling) extends the HPA with custom triggers. Instead of just CPU/memory, you can scale based on Prometheus queries, queue depth, cron schedules, HTTP traffic, and dozens of other sources.

In my case, I use it to scale Thanos Query and Query Frontend based on HTTP request rate. When Grafana dashboards are generating heavy queries, KEDA spins up more replicas. When traffic drops, it scales back down.


The ScaledObjects

Two ScaledObjects, one for each component:

Thanos Query scales from 1 to 3 replicas. It triggers when the HTTP request rate exceeds 5 req/s (measured over 2 minutes) or memory utilization goes above 80%. Cooldown is 5 minutes, so it doesn't flap on short spikes.

Thanos Query Frontend scales from 1 to 2 replicas. Lower threshold (3 req/s) since it's the entry point and hits load first. Cooldown is 10 minutes because it caches results and scaling down too fast would lose the cache.

Both use Prometheus as the trigger source, querying http_requests_total directly from the Prometheus instance in the cluster.

Scale up is immediate. There's no initialCooldownPeriod or stabilizationWindowSeconds configured, so as soon as KEDA detects a trigger threshold is crossed (checked every 30 seconds), it creates the new pod right away. No waiting to see if the metric drops back down.

Scale down is slower on purpose. The cooldownPeriod (5 minutes for Query, 10 minutes for Query Frontend) only applies to scaling down. This prevents flapping: if traffic drops for a moment and comes back, the extra replicas stay up instead of being killed and recreated.


The stress test

To validate that autoscaling actually works, I built a stress test. It's a CronJob that never runs automatically (scheduled for February 31st). When I want to test, I trigger it manually:

kubectl create job --from=cronjob/thanos-stress-test stress-now -n metrics

The job runs 5 parallel workers for 3 minutes, each sending heavy PromQL queries to Thanos Query Frontend: counting all series, top-k aggregations, and CPU usage breakdowns by pod. These are intentionally expensive queries that force Thanos to work hard.

While the test runs, I watch the HPA in another terminal:

kubectl get hpa -n metrics -w

Within a minute or two, KEDA detects the spike in request rate and scales up the replicas. After the test ends and the cooldown period passes, it scales back down.

There's also a local script (scripts/hpa-stress-test.sh) that does the same thing via port-forward, useful for testing from outside the cluster with 10 workers.


KEDA config

KEDA runs on the large tier node. Three components: the operator, the metrics adapter (which bridges KEDA metrics to the Kubernetes metrics API), and webhooks for validation.

All three have ServiceMonitors enabled, so Prometheus scrapes KEDA's own metrics. No CPU limits (only memory limits), following the same pattern as the rest of the cluster.


The OOM problem

Autoscaling on request rate has a blind spot. A single heavy query can blow up memory faster than KEDA can react.

The polling interval is 30 seconds and the evaluation window is 2 minutes. If someone runs a query that scans millions of series (like count({__name__=~".+"}) across all time), the pod's memory spikes instantly. By the time KEDA detects the load and spins up a new replica, the original pod already hit its memory limit and got OOM-killed.

That's why both ScaledObjects also have a memory trigger at 80% utilization. It's a safety net. If memory starts climbing before the request rate threshold is reached, KEDA scales up preemptively. It doesn't prevent every OOM (some queries are just too heavy for a single pod), but it reduces the window.

The real fix for this is a combination of things: memory limits that give enough headroom, query timeouts in Thanos Query (set to 5 minutes), and being careful with wildcard queries on large time ranges. Autoscaling helps, but it's not a substitute for query discipline.


Why not just HPA

The built-in HPA with CPU/memory would work for basic cases, but Thanos Query is often idle on CPU while handling expensive queries that take time because they scan a lot of data in MinIO. CPU-based scaling would miss these cases entirely.

Scaling on request rate is more accurate for query workloads. If 10 dashboards are loading at the same time, the request rate spikes regardless of CPU usage, and that's when you need more replicas.