kagent: AI Agents for Kubernetes with LM Studio

I wanted an AI assistant that could actually interact with my cluster. Not a chatbot that generates YAML, but something that can query Prometheus, inspect pods, check Helm releases, and reason about what's happening in the cluster. kagent does exactly that.

The interesting part: the LLM doesn't run in the cluster. It runs on a separate machine with a GPU, served by LM Studio. kagent just points to it over the network.


The setup

Two machines involved:

The cluster runs on a Proxmox mini PC with 96 GB RAM. This is where kagent, the agents, and all the Kubernetes tooling live.

The GPU machine is a desktop with an AMD Ryzen 7 9800X3D and an NVIDIA RTX 3080. It runs LM Studio serving the model. Nothing else. It sits on the same network.

I actually tried running everything in the same cluster first. No GPU, pure CPU inference. It worked, technically. But the response times were brutal. A simple question that should take a few seconds was taking minutes. The model was competing with real workloads for CPU and memory, and without a GPU the inference is just painfully slow. It wasn't usable for anything interactive.

Offloading inference to a dedicated machine with a GPU made it a completely different experience. The cluster resources stay free for actual workloads, and the model responds in seconds instead of minutes.


What is kagent

kagent is a cloud-native AI platform built specifically for Kubernetes. It deploys as a set of components:

  • A controller that manages agent lifecycle and orchestration
  • A UI for interacting with agents through a web interface
  • Specialized agents that each have tools and context for specific domains
  • MCP (Model Context Protocol) servers that expose Kubernetes APIs as tools the LLM can call

The agents aren't generic chatbots. Each one has a focused set of tools:

Agent What it does
k8s-agent Inspects pods, deployments, services, events, logs
helm-agent Lists releases, checks values, inspects chart status
observability-agent Queries Prometheus, checks alerting rules, reads metrics
promql-agent Writes and validates PromQL queries

When you ask a question, the agent decides which tools to call, executes them against the real cluster, and uses the results to build a response. It's function calling, not prompt engineering.


Installing with Helm

kagent ships as two Helm charts: kagent-crds for the Custom Resource Definitions and kagent for the actual components. Both come from the same OCI registry.

The wrapper Chart.yaml:

apiVersion: v2
description: kagent - Cloud Native Agentic AI for Kubernetes
name: kagent
version: 0.0.1

dependencies:
  - name: kagent-crds
    version: 0.8.6
    repository: oci://ghcr.io/kagent-dev/kagent/helm
  - name: kagent
    version: 0.8.6
    repository: oci://ghcr.io/kagent-dev/kagent/helm

This follows the same pattern as every other addon in the homelab: a Helm chart wrapper with a config.json that ArgoCD picks up via ApplicationSets.

{
  "name": "kagent",
  "namespace": "kagent",
  "category": "tools"
}

ArgoCD discovers it, creates the Application, and syncs it. No manual helm install.


Pointing to LM Studio

LM Studio exposes an OpenAI-compatible local server, so kagent's OpenAI provider works out of the box — just point the base URL at the GPU machine's IP instead of api.openai.com.

providers:
  default: openai
  openai:
    provider: OpenAI
    model: "qwen3.5-9b-claude-4.6-opus-reasoning-distilled-v2"
    config:
      baseUrl: http://192.168.xx.xx:1234/v1

The model identifier matches the API Identifier set in LM Studio. LM Studio listens on port 1234 by default when the local server is enabled. The GPU machine is on the same LAN, so it's a direct HTTP call — no VPN, no tunneling.

LM Studio also has a built-in local server UI with live log streaming, per-request token counts, and inference stats. Useful for checking whether the model is actually receiving requests and how long it's taking to generate.


The model

The current model is qwen3.5-9b-claude-4.6-opus-reasoning-distilled-v2. It's a 9B parameter model from the Qwen family, distilled to replicate the reasoning patterns of Claude 4.6 Opus. The idea behind reasoning distillation is that a smaller model can learn to produce structured, step-by-step reasoning by training on outputs from a much larger model — in this case, Opus-class reasoning traces.

In practice it means the model approaches multi-step questions more methodically than a base Qwen3 9B would. For kagent's use case — where the agent has to decide which tools to call, interpret the results, and chain multiple calls — that matters.

The configuration in LM Studio:

Setting Value
Context Length 64,000 (model supports up to 262,144)
GPU Offload 13 layers
CPU Thread Pool 32
Flash Attention Enabled
K/V Cache Quantization Q8_0
Unified KV Cache Enabled
Offload KV Cache to GPU Enabled
Estimated Memory 3.25 GB GPU + 7.97 GB total

The RTX 3080 has 10 GB VRAM. With 13 layers offloaded to GPU and the rest handled by the CPU thread pool, the model fits without issue. Flash Attention keeps the memory footprint manageable at the 64K context window.

LM Studio model configuration for qwen3.5-9b

I set the context length to 64K rather than the full 262K. For most Kubernetes interactions — pod descriptions, logs, metric results — 64K is more than enough, and it keeps per-request memory predictable.


Choosing the agents

kagent comes with a lot of agents, but not all of them are relevant for my setup. I enabled four and disabled the rest:

agents:
  k8s-agent:
    enabled: true
  helm-agent:
    enabled: true
  observability-agent:
    enabled: true
  promql-agent:
    enabled: true
  istio-agent:
    enabled: false
  kgateway-agent:
    enabled: false
  argo-rollouts-agent:
    enabled: false
  cilium-policy-agent:
    enabled: false
  cilium-manager-agent:
    enabled: false
  cilium-debug-agent:
    enabled: false

No Istio, no Cilium, no Argo Rollouts in the homelab. No point running agents for things that don't exist in the cluster. Each disabled agent is one less deployment consuming resources.


Exposing the UI

The kagent UI is exposed via Envoy Gateway at kagent.ruiz.sh, following the same pattern as other services in the cluster.

The HTTPRoute:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: kagent
  namespace: kagent
spec:
  parentRefs:
    - name: homelab
      namespace: envoy-gateway-system
      sectionName: https
  hostnames:
    - "kagent.ruiz.sh"
  rules:
    - backendRefs:
        - name: kagent-ui
          port: 8080

One thing that caught me: LLM responses are slow. The default Envoy timeouts would kill the connection before the model finished generating. A BackendTrafficPolicy with generous timeouts fixes that:

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: kagent-timeout
  namespace: kagent
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: kagent
  timeout:
    http:
      requestTimeout: 600s
      connectionIdleTimeout: 600s
      maxConnectionDuration: 600s
    tcp:
      connectTimeout: 30s

600 seconds (10 minutes) for request timeout. Sounds excessive, but complex queries where the agent chains multiple tool calls can take a while, especially when the model is reasoning through intermediate steps before calling the next tool.


Resource allocation

Everything runs on the large tier node. No CPU limits (same pattern as the rest of the cluster), only memory limits:

Component Memory request Memory limit
Controller 128Mi 512Mi
UI 128Mi 512Mi
Tools (MCP servers) 128Mi 512Mi
Each agent 128Mi 512Mi
Bundled PostgreSQL 128Mi 256Mi

kagent uses PostgreSQL to store conversation history and agent state. For a homelab, the bundled PostgreSQL with 1 Gi of storage is fine. In production you'd want an external database.


The OpenAI secret

Even though the request goes to LM Studio locally, kagent still expects an OpenAI API key in some code paths. An ExternalSecret syncs it from Doppler:

apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: kagent-openai
  namespace: kagent
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: doppler
    kind: ClusterSecretStore
  target:
    name: kagent-openai
    creationPolicy: Owner
  data:
    - secretKey: OPENAI_API_KEY
      remoteRef:
        key: OPENAI_API_KEY

LM Studio's local server does not require an API key by default — it accepts any string. If you enable API key validation under Developer → API Keys, generate a key there and store it in Doppler. Either way, the existing secret works without changes. This also keeps the option open to switch to a real OpenAI endpoint without redeploying — just change the base URL.


What works well

kagent UI showing an agent interacting with the cluster

The screenshot above is from a real interaction. I asked the k8s-agent to check whether the nodes are healthy, confirm pods are running across all namespaces, look for pressure or taint issues, identify crash loops, pending or failed pods, restart spikes, and summarize any problems with the likely cause. One prompt, and the agent chained multiple tool calls on its own: checked node readiness, scanned all namespaces for pod status, and came back with a structured summary.

The reasoning-distilled model handles these multi-step chains better than I expected from a 9B model. The observability-agent can query Prometheus and explain what the metrics mean. The helm-agent knows which releases are deployed and their status. Ask "is Loki healthy" and the agents will check the pods, verify if metrics are being scraped, and give you a combined picture.


What to watch out for

Latency. A 9B model on a consumer GPU is not GPT-4. Responses take a few seconds, and complex multi-tool queries can take 30-60 seconds. The 600-second timeout exists for a reason.

Hybrid inference. With only 13 layers on the GPU and the rest on CPU, the 3080 isn't doing all the work. The CPU thread pool of 32 threads handles the remaining layers. It's fast enough, but a full GPU offload (or a model that fits entirely in VRAM) would be noticeably faster.

Network dependency. If the GPU machine is off or LM Studio crashes, kagent is useless. There's no fallback configured. The OpenAI secret is already in place, so switching to a real API endpoint as a fallback would be straightforward.


Running AI agents inside the cluster that can actually observe and reason about the cluster state is a different experience from copy-pasting YAML into ChatGPT. The model is smaller and slower, but it has real access to real data. And it runs entirely on hardware I own, on my network, with no data leaving the house.

Sources