The MCP Context Tax — Notes from Running 605 Tool Packs
Mounting more MCP servers makes your agent worse, not better. After a year and ~4M tool calls through the Pipeworx gateway, here's what we see about tool count, context bloat, and on-demand routing.
If you’ve followed the MCP discourse for the last six months, you’ve seen the same pattern in every “best MCPs to install” listicle: a 20-server starter kit, each with 5-15 tools, sometimes more. The implicit promise is that more capability is strictly better. The agent has more reach; the developer has more options.
That promise is wrong, and we have a fairly precise view of why now.
Pipeworx runs a hosted MCP gateway with 605 tool packs behind one URL. The catalog spans about 2,761 individual tools across roughly 295 underlying data sources — SEC EDGAR, FRED, BLS, FDA, EPA, USPTO, Zillow, IMDB, Polymarket, and the long tail. In the last 30 days it served just over 4 million requests and roughly 1.5 million actual tool calls across about 22,000 unique monthly visitors. That gives us a reasonable sample to talk about what tool count actually does to agent behavior at the calling site.
The short version: tool count is a tax, not a feature. The right number of tools visible to an agent at any given moment is dramatically lower than the number it could theoretically reach. Mounting all of them up-front is the wrong default. The interesting question is how to surface the right subset.
The mechanics of the tax
Every MCP tool definition lives in the agent’s context window. The schema, the description, the parameter list, the type hints — all of it gets serialized into the system prompt every turn. There is no on-disk version the model consults later. If the agent is to know that a tool exists, the definition has to be present at inference time.
Tool descriptions are not small. A typical MCP tool description in our catalog runs 100 to 500 tokens once you include parameter schemas and type information. The richer ones — compound tools with detailed parameter narratives — can hit 1,000+ tokens individually. Mount 50 servers averaging 10 tools each and you’ve burned 50,000 to 100,000 tokens of context before the agent does anything. Mount 200 and the entire context budget can be saturated with tool descriptions on smaller models.
This sounds like a quantitative problem, but it actually produces qualitative degradation, and that’s the more interesting failure mode.
The behavioral cliff at ~50 tools
We see it consistently: past somewhere in the 40-60 visible tool range, tool-selection accuracy drops sharply. The model picks the wrong tool more often, hallucinates parameters more often, and — most surprisingly — misroutes prompts toward irrelevant tools that happen to share keyword surface. This isn’t a smooth degradation curve. There’s something cliff-shaped about it.
We’re not the first people to notice this. It’s been discussed on Reddit, in Anthropic’s own guidance on Claude Code MCP, and in various tooling writeups. What having ~4 million tool calls of telemetry has done for us is make it concrete: the cliff exists across model versions, doesn’t disappear by switching to a larger context window, and applies to both Anthropic and OpenAI native function-calling.
What it isn’t, importantly, is “the model can’t handle long context.” Anthropic and OpenAI both reliably handle hundreds of thousands of tokens of content. It’s specifically the tool selection task that degrades — the model is being asked to do attention-based discrimination across many semantically-similar option blocks, and that’s the operation that breaks down.
The implication: optimizing total context length is the wrong lens. Optimizing the size and similarity profile of the visible tool surface is the right one.
Long-tail tools die first
The second pattern is downstream of the first. On any given session, only a small subset of the mounted tool surface gets used. The rest sit in context, consume tokens, contribute to selection noise, and never get called.
Reddit anecdata captures this perfectly: “I installed 12 MCPs in week 1, kept 4 by week 2.” The four that survive are the boring high-frequency stuff — filesystem, repo search, docs lookup, browser control. The eight that get removed aren’t bad tools — they’re occasionally-useful tools that lose the visibility competition to the always-useful ones. When you put a one-off SEC EDGAR query tool next to filesystem in the same context, the model attends to filesystem and forgets EDGAR exists.
This produces a perverse outcome: long-tail tools are individually high-value (the user installed them for a reason) but get collectively deprecated by the architecture. The user removes them because they “never worked.” They worked fine, but their schemas got crowded out.
If your stack is built on the “mount everything” pattern, the rational response is to keep cutting until only the daily-driver tools are left. The long tail dies. Anything that wasn’t useful in the first 30 seconds gets evicted.
What we built to test the alternative
The Pipeworx gateway accepts query parameters that filter the visible tool surface per session. ?task= takes a free-text description (“housing market analysis”) and runs a small embedding model over our catalog to surface the ~20 most relevant tools. ?vertical= takes a named bundle (housing, fintech, pharma, govcon, agri, trade, green) and surfaces the curated set for that domain. Either approach replaces the “mount everything” pattern with “mount the relevant subset.”
There’s also an ask_pipeworx(question) meta-tool that takes natural language and dispatches a tool call internally — the agent never sees the underlying tool surface at all, just gets the answer. That collapses 605 packs into a single tool definition’s worth of context cost when the agent doesn’t care about the routing layer.
This works. We can see in our telemetry that sessions using ?task= filtering complete more tool calls successfully, with lower per-task token spend, and with measurably different tool-selection distributions versus sessions that don’t filter. The fastest way to describe it: filtering removes the wrong-tool floor. The model isn’t deciding among 605 options anymore — it’s deciding among 20, and the 20 are scoped to actually be relevant.
We’re not claiming this is the only architecture that solves the problem. Gatana’s Code Mode approach — compile tools into executable code with progressive schema disclosure on demand — attacks the same problem from a different angle, and so do several other gateways. Different access patterns; same underlying observation that the visible tool surface needs to be smaller than the total reachable tool surface.
What we don’t know yet
Several things we can’t answer well from our data:
The actual model-dependency of the cliff. We see consistent behavior across Claude Sonnet, Opus, and GPT variants in our usage. We do not have good data on whether smaller open models (Llama 3.x, Qwen, Mistral) have the same cliff, a different cliff, or a much earlier one. We’d guess earlier, but haven’t measured.
Whether routing accuracy degrades on truly ambiguous queries. A query like “show me X data” where X plausibly matches five different packs is the harder case for semantic routing. Our embedding-based approach picks the highest-similarity match; if the top three are within noise, we may be choosing wrong. This is on our roadmap to instrument.
The interplay with compound tools. Some of our highest-value tools (housing_market_snapshot, fintech_company_deep_dive) are compound — they chain 5+ underlying packs into a single call. Routing should arguably surface these instead of the individual packs they wrap, but our current logic doesn’t preference them. Probably a fix; we haven’t run the experiment yet.
Whether the routing approach holds up at 5,000 packs. It works at 605. We’ll find out at 5,000.
Takeaways for people building agent stacks
A few rules of thumb that we’d defend with the data we have:
-
Treat tool surface size as a hyperparameter. It is not free; it actively degrades agent performance past a point. Test what your effective ceiling is for the model you’re using.
-
The right number of tools per session is approximately 20, not 200. If you have more reachable than that, the architecture needs to filter, not mount.
-
Long-tail tools belong behind on-demand surfaces, not in your default tool list. Mount the boring high-frequency stuff (filesystem, search, repo, docs); route to everything else.
-
A single meta-tool that dispatches by intent is the cheapest context-cost architecture available. One tool definition, infinite reach. Worth using even if it’s just for the explicit long-tail queries.
-
Stop reading “best MCP servers to install” listicles. They optimize the wrong dimension.
The Pipeworx gateway is at gateway.pipeworx.io if you want to look at the surface — free tier, no API key required for most tools. Or browse the catalog at pipeworx.io/registry. Every pack is also published independently on npm as @pipeworx/mcp-<slug> under MIT — the gateway is one delivery channel, not the only one.
The architectural point applies whether you use Pipeworx or not. The visible tool surface needs to be smaller than your reachable tool surface. The interesting design problem of MCP infrastructure right now is figuring out how to deliver that.