Hierarchical navigation isn't the same problem as tool selection
The 'tens of thousands of MCP tools and the agent still works' demos floating around right now demonstrate something real — but it isn't what most people are taking away from them.
There’s a class of demo making the rounds in the MCP community right now: a server exposing tens of thousands of tools, with an agent successfully navigating them. The framing is usually “scale is solved, just expose everything, agents will figure it out.”
Having read through one of these demos closely and worked on tool selection at a different kind of scale ourselves, the takeaway is more interesting than that, and less convenient.
What the demos actually show
The pattern is some cross-product of structured dimensions — regions × accounts × resource categories × operations, or equivalent. Tools are generated programmatically and share a single handler. Tool names encode the path:
get_us_east_1_account_412_compute_instances
list_eu_central_1_account_983_storage_buckets
describe_us_west_2_account_77_network_endpoints
The agent is asked something like “list compute instances in account 412 of us-east-1.” It picks the right tool. It works.
This is a real demonstration and the code is real. But what it actually shows is that hierarchical navigation under a uniform schema scales fine — which is a known property of string matching against well-structured names. The agent isn’t doing semantic disambiguation. It’s pattern-matching account_412 and us_east_1 and compute_instances against the user’s literal phrasing, in a space where exactly one tool name contains all three tokens.
You could replace the LLM with a regex and the demo would still work.
The actual hard problem
The thing that breaks at scale isn’t tool count. It’s semantic distinctness between tools that don’t share a schema.
Consider what happens when an agent has to choose between:
get_company_facts(SEC EDGAR — XBRL-tagged financial facts for a public US company)entity_profile(a compound tool fanning out across SEC + USPTO + LEI + GDELT + federal contracts for the same company)validate_claim(fact-check a natural-language claim against SEC XBRL)compare_entities(parallel calls across 2–5 companies)fintech_company_deep_dive(a vertical compound tool that overlapsentity_profilefor financial companies specifically)recent_changes(time-filtered fan-out for “what’s new about company X in the last 30 days”)
All six are reasonable candidates for a query like “tell me about Apple.” None of them share a schema. None of them encode the answer in their name. The agent’s decision rests on understanding the meaning of each tool description — and on weighing tradeoffs about cost, latency, and information density that aren’t apparent from the names alone.
That’s the selection problem real gateways live with. It does not get easier when you add 1,000 more semantically-distinct tools. It gets harder, because every new option adds another dimension on which the agent has to make a judgment call.
Why the demos succeed anyway
Two architectural choices in the demo do the heavy lifting, and neither generalizes to a heterogeneous catalog:
1. The hierarchy is encoded in the tool names. Once that’s true, selection collapses to substring matching. There is no semantic step. The agent reads account_412 in the prompt, scans for tool names containing account_412, narrows from there.
2. All tools share one handler. This means the catalog has the information density of a single tool with a structured argument space. The “N thousand tools” framing is roughly equivalent to one tool taking (region, account, category, operation) as arguments. The LLM doesn’t care which surface that argument space is exposed through.
If you flattened the same catalog into one manage_resource(region, account, category, operation) tool, the agent would solve the same task with no difficulty. Tool count is doing no work in the demo. The structured naming is doing all of it.
What this looks like at our scale
Pipeworx currently exposes 1,423 tools across ~400 packs (SEC, FDA, FRED, USPTO, EPA, Census, ATTOM, ClinicalTrials, UN Comtrade, +~280 more) through one MCP endpoint. Live stats: registry.pipeworx.io/stats.json — we’re at ~5M requests/month run rate on the gateway zone with 4× MoM growth, ~20k MCP tool calls in the trailing 30 days from ~500 unique callers.
That’s orders of magnitude smaller than the cross-product demos by count. But every tool maps to a different upstream API, a different schema, a different auth model, and a different reason an agent might want to call it. There is no naming hierarchy to fall back on. Picking get_company_facts vs. entity_profile for a “tell me about Apple” query is a real decision the model has to make on description content alone.
What we found in practice: at this scale, the agent is the bottleneck. Asking a frontier model to scan 1,400 tool descriptions per turn is expensive, noisy, and produces inconsistent picks. The model spends most of its attention budget on cataloging rather than answering.
The solution we settled on is to put a routing layer in front of the catalog. The agent calls one meta-tool — ask_pipeworx("question in plain English") — instead of seeing all 1,423 tools. That meta-tool runs a two-stage selection:
- Embedding-based shortlist. Cosine similarity over pre-computed tool-description embeddings narrows 1,423 tools to the top 10 candidates. Intent boosts (regex on verbs like “fact-check,” “compare,” “tell me about”) nudge the right cross-cutting meta-tools into the shortlist when literal keyword overlap would otherwise lose to a pack tool.
- LLM selector. A small model — currently Haiku 4.5 for free tiers, Sonnet 4.6 for paid — sees only those 10 candidates, picks one (or two for comparison queries), fills the arguments, and returns the result. If execution fails, the gateway auto-retries with the next-best candidate from the same shortlist.
The agent never sees the full catalog directly. It sees one tool — ask_pipeworx — and gets back the answer.
Three things this design trades off:
- One extra hop of latency for the routing decision (the embedding search is fast; the LLM call is the real cost). Worth it because the alternative is the agent making a bad decision after scanning 1,423 descriptions, then needing two or three more turns to recover.
- The selector model’s quality matters a lot. We moved from a smaller open-weights model to Haiku 4.5 this week specifically because the selector’s tool picks were the limiting factor on result quality. The selection problem doesn’t disappear; it moves to a layer where you can attack it directly — cheaper model, focused prompt, well-defined task.
- The embedding shortlist can be wrong. If the true best tool isn’t in the top 10, the LLM can’t pick it. We backfill with intent-regex boosts for the common cross-cutting meta-tools and watch the telemetry for misses.
This is a different architecture than “expose everything and let the agent figure it out.” The agent isn’t figuring it out at our scale. It’s a dedicated, smaller, cheaper model working on a 10-tool problem instead of a 1,423-tool one — with the agent staying focused on the user’s actual task.
The takeaway people should be drawing
The cross-product demos prove something real: hierarchical name spaces scale to absurd cardinalities without trouble. That’s useful if your catalog has that structure — cloud resources by region/account, files in a tree, anything where the path encodes the meaning.
The lesson they don’t prove, but that’s being widely inferred, is “tool count doesn’t matter, just expose everything.” That holds only when your tools share a schema and the hierarchy is in the names. Most real-world catalogs don’t have that property. SEC filings, FDA drug labels, FRED economic series, USPTO patents, and weather forecasts are not points in a uniform tensor. They’re different kinds of things, and choosing between them is a different kind of problem.
If you’re designing an MCP server that will grow past a few dozen tools, the question to ask isn’t “how do I expose more?” It’s “what’s the structure of the selection problem in my catalog, and where in the stack is it cheapest to solve?”
For homogeneous catalogs, naming and substring matching can carry you very far. For heterogeneous ones, you need a layer between the agent and the catalog that does the semantic work — because the agent can’t, and shouldn’t have to.
You can try our version at gateway.pipeworx.io/mcp. Free anonymous tier, no signup. ask_pipeworx("what do you have?") is a reasonable first call.