MCP Servers and Token Usage in Claude Code

I was about to wire up four MCP servers in Claude Code: Terraform docs, Grafana, AWS docs, Loki. Before flipping them on globally, I wanted to know what enabling them actually costs in tokens per turn. The official docs answered the question more clearly than I expected.

Short version: yes, MCPs cost tokens, but Claude Code uses Anthropic's deferred tool loading, which makes the cost much smaller than the naive worst case. Understanding how it does that is what shapes the right config.


How tools end up in context

Every tool you give Claude (built-in, MCP, custom) has to be described to the model so it can decide when to call it. That description goes into the request as the tools parameter: name, description, JSON schema for arguments. The Anthropic tool use docs are explicit about the cost:

The additional tokens from tool use come from: - The tools parameter in API requests (tool names, descriptions, and schemas) - tool_use content blocks in API requests and responses - tool_result content blocks in API requests

There's also a fixed overhead. On Opus 4.7, enabling tools at all costs 346 tokens for tool_choice: auto. That's the system prompt addition that teaches the model how to emit tool calls, and it's billed on every request in the conversation. The full per-model breakdown lives in the pricing section of the same doc.

A real MCP server can be heavy. A single Grafana MCP exposes a couple dozen tools, each with its own description and schema. Anthropic's tool search docs cite a concrete number for the worst case:

A typical multi-server setup (GitHub, Slack, Sentry, Grafana, Splunk) can consume ~55k tokens in definitions before Claude does any actual work.

55k tokens, paid on every turn, before you've typed a single character of prompt. That's what would happen if every tool were loaded eagerly.


Deferred tools and ToolSearch

The fix Anthropic shipped is defer_loading: true plus a server-side tool_search_tool. Instead of dumping every tool into context up front, the model only sees:

  • Non-deferred tools (the small set you use constantly)
  • A meta-tool called tool_search_tool that lets the model search a catalog by name and description

When the model decides it needs a deferred tool, it calls tool_search_tool with a query (regex or BM25 depending on the variant), gets back 3-5 matching tool_reference blocks, and the API automatically expands those references into full tool definitions inline. The step-by-step flow is documented in detail. The system prompt prefix is never touched, so prompt caching keeps working.

From the docs:

Tool search typically reduces this by over 85%, loading only the 3–5 tools Claude actually needs for a given request.

Claude Code uses this mechanism. The Claude Code MCP docs are direct about it:

Tool search is enabled by default. MCP tools are deferred rather than loaded into context upfront, and Claude uses a search tool to discover relevant ones when a task needs them. Only the tools Claude actually uses enter context. From your perspective, MCP tools work exactly as before.

If you watch the system reminders during a session you'll see your MCP tools listed by name as deferred tools, with a note that calling them directly fails with InputValidationError until their schemas are fetched via ToolSearch.

Two caveats from the same doc:

  • Model support. Tool search needs tool_reference blocks, which means Sonnet 4+ or Opus 4+. Haiku doesn't support it. If you switch to Haiku, MCP tools load upfront.
  • Vertex AI and proxies. Disabled by default on Vertex AI (the beta header isn't accepted) and when ANTHROPIC_BASE_URL points to a non-first-party host (most proxies don't forward tool_reference blocks).

What this means for settings.json

In ~/.claude/settings.json, MCP servers go under mcpServers. The catalog I'm picking from looks roughly like this:

{
  "mcpServers": {
    "terraform": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "hashicorp/terraform-mcp-server:latest"]
    },
    "grafana": {
      "command": "uvx",
      "args": ["mcp-grafana"],
      "env": {
        "GRAFANA_URL": "...",
        "GRAFANA_SERVICE_ACCOUNT_TOKEN": "..."
      }
    },
    "aws-docs": {
      "command": "uvx",
      "args": ["awslabs.aws-documentation-mcp-server@latest"]
    },
    "loki": {
      "command": "docker",
      "args": ["run", "--rm", "-i", "-e", "LOKI_URL", "-e", "LOKI_TOKEN", "loki-mcp-server"]
    }
  }
}

Each entry adds an MCP server. With tool search default-on, the per-turn cost is roughly:

  • The catalog listing (cheap, only tool names load at session start)
  • The ToolSearch tool definition itself
  • The full schema of any deferred tool the model has actually pulled in this turn

Compared to the 55k-token worst case from the API docs, that's a huge improvement. But it's not free, and the cost compounds across long sessions because every tool definition the model expands stays in context until compaction.


The Claude Code knobs

The Configure tool search section of the doc is the real best-practices guide for Claude Code specifically. Two controls matter.

ENABLE_TOOL_SEARCH (global behavior)

Value Behavior
(unset) Default. All MCP tools deferred and loaded on demand
true Force deferral, including on Vertex AI / proxies
auto Threshold mode: load upfront if tools fit within 10% of context, defer the overflow
auto:<N> Custom threshold, e.g. auto:5 for 5%
false Disable deferral entirely, all tools loaded upfront

For most setups, leaving it unset is correct. auto is interesting if you have a small handful of tools that almost always fit and don't want the search round-trip. false is the "load everything upfront" toggle if you want to measure the difference yourself.

alwaysLoad: true (exempt a server)

If a server's tools should always be visible without a search step, set alwaysLoad: true on that server entry:

{
  "mcpServers": {
    "terraform": {
      "command": "docker",
      "args": ["run", "-i", "--rm", "hashicorp/terraform-mcp-server:latest"],
      "alwaysLoad": true
    }
  }
}

From the doc:

Use this for a small number of tools that Claude needs on every turn, since each upfront tool consumes context that would otherwise be available for your conversation.

Requires Claude Code v2.1.121+. This is the per-server equivalent of "keep 3-5 tools non-deferred" from the API docs.

MAX_MCP_OUTPUT_TOKENS (for verbose returns)

Deferral hides the definition cost. The result cost is separate and Claude Code has its own controls for it:

  • Output warning at 10,000 tokens when any MCP tool result exceeds that
  • Default cap at 25,000 tokens
  • Configurable via MAX_MCP_OUTPUT_TOKENS env var

So if you call a Notion or Drive MCP and it dumps a big document, you'll get a warning rather than silently filling the context.


When to keep an MCP enabled

Combining the API guidance with the Claude Code knobs:

Situation What to do
MCP you rarely use Leave deferred (default), or remove if you never use it
MCP needed every turn (small) alwaysLoad: true on that server
Project-only MCP (Grafana, Loki for one cluster) Configure in project .mcp.json or project .claude/settings.json, not globally
MCP that returns large blobs Leave deferred, optionally raise MAX_MCP_OUTPUT_TOKENS if you legitimately need bigger responses
Running on Haiku, Vertex, or via a proxy Tool search is off, so be much more selective about which MCPs you enable

Two practical rules:

  • Don't enable MCPs you don't use. Even deferred, they show up in the catalog listing on every turn. If you installed something to try once and never came back, remove it.
  • Group by workflow. Project-specific servers go in the repo; only small, read-only documentation servers (Terraform, AWS docs) belong globally.

The kagent post covers a different angle on MCP: exposing cluster APIs as tools for an in-cluster agent. Same protocol, different consumer.


What I ended up with

Two MCPs in the global config: aws-docs and terraform, both small and read-only. Grafana and Loki moved to the homelab repo's .mcp.json since they only matter when I'm working in that cluster. terraform got alwaysLoad: true because I hit it on basically every infra task and the search round-trip was unnecessary friction. The other three stay deferred (the default).

The reason this matters at all is that token cost is silent. There's no error, no warning, no visible degradation when your context fills up with tool definitions you're not using. Just a slightly slower, slightly dumber model that costs slightly more per turn. The fix is small: read what's in mcpServers, ask whether you actually use each one, and move project-specific things to project configs.


Sources

Claude Code specific:

Anthropic API (the underlying primitives):