We cut our MCP gateway's error rate from 28% to 7% in one afternoon — by actually reading the logs

For the past six weeks I’ve been building Pipeworx, an MCP gateway that wires AI agents into 295 live data sources. The thesis is simple: agents are only as useful as the data they can reach, and the friction of integrating each new source — auth, rate limits, response shapes, error handling — is what holds back the whole field.

We log every tool call to a usage_logs table: which pack, which tool, the latency, the status code, the error message if there was one. That table has been quietly growing while I worked on documentation, recipes, README quality, and protocol surface area.

Yesterday afternoon I finally pulled the data.

The 7-day error rate was 27.6% across 1,579 calls. Almost three calls in ten were failing.

This post is the play-by-play of what I found, what I fixed, and what dropped out. 24 hours later, the error rate is down to 6.9% — and every remaining error is a legitimate user/config issue, not a bug.

The shape of failure

The first thing the data made obvious: errors are heavy-tailed. A few packs are responsible for most of the failures.

Pack	7-day calls	Errors	Error rate
dictionary	162	83	51%
openalex	92	71	77%
zippopotam	44	42	95%
noaa	41	39	95%
edgar	41	18	44%
fbiwanted	16	14	88%
reddit	13	12	92%
nba	12	9	75%
alphavantage	8	8	100%

Nine packs accounted for almost all the volume of failures. And the failures themselves cluster into four cleanly separable categories.

Category 1: Code crashes from missing args

Agents pass arguments that don’t quite match the schema. Sometimes the LLM hallucinates a parameter name; sometimes the argument is null; sometimes it’s just omitted. When the pack’s code tries to do args.ticker.toUpperCase() and args.ticker is undefined, you get a 500 with the world’s worst error message:

Cannot read properties of undefined (reading 'toUpperCase')

That tells the agent literally nothing. It might as well be a segfault.

Five tools across four packs were doing this:

edgar.edgar_company_filings — Cannot read properties of null (reading 'trim')
edgar.edgar_ticker_to_cik — toUpperCase on undefined
sec-xbrl.get_company_facts — replace on undefined
exchangerate.get_rates — toUpperCase on undefined
clinicaltrials.ct_get_study — trim on undefined

The fix is dull but effective: validate at the function boundary. One pattern, applied everywhere:

if (typeof tickerOrCik !== 'string' || !tickerOrCik.trim()) {
  throw new Error(
    'Required argument "ticker_or_cik" is missing or empty. ' +
    'Pass a ticker like "AAPL" or a CIK like "320193".'
  );
}

Same shape, every call. The crash becomes a structured error that tells the agent exactly what to do. An LLM reading “Required argument ‘ticker_or_cik’ is missing” will fix its next call. An LLM reading “Cannot read properties of undefined” will retry the same broken call.

The lesson: at every public function boundary, validate inputs with a message that names the argument and shows a valid example. This is the cheapest possible reliability win.

Category 2: Upstream rate limits

OpenAlex was failing 77% of the time. Every error was the same:

OpenAlex works search error: 429

OpenAlex offers a “polite pool” — identify yourself in the request and you get a much higher rate limit. Our pack already does this; the polite pool wasn’t the problem. The problem was that we were hitting it from a Cloudflare Worker, which means our IP is shared with every other CF Workers user hitting OpenAlex. The polite pool helped but didn’t get us to free.

The real fix was at the gateway, not the pack: bump the cache TTL for academic data.

// Academic / scholarly (very stable; aggressive cache absorbs upstream rate limits)
openalex: 3600,
crossref: 3600,
'semantic-scholar': 3600,
pubmed: 3600,
patents: 3600,

Research papers don’t change. A query for “transformer neural networks” returns the same top-25 results today as it did an hour ago. Caching for an hour is essentially lossless and absorbs the burst pressure that triggers the rate limit.

The lesson: the cheapest way to deal with upstream rate limits is to call upstream less. Cache aggressively wherever you can — and “aggressively” usually means “by a much larger factor than your gut says is safe.”

Category 3: Invalid input → confusing error

The dictionary pack uses Free Dictionary API, which only knows common English words. Agents were asking it for:

“Lenz” (it’s a physicist’s name, not a vocabulary word)
“self-inductance” (it’s a physics concept)
“electromagnetic induction” (also physics)
“Faraday law” (also physics)

The dictionary returned 404, and we used to throw Error: Word not found: "Lenz". The agent gets a 500 status with an error message that suggests the service is broken when really the use case is wrong.

The right shape is to acknowledge that this is a perfectly valid query that has no answer in this particular source, and to point the agent at a better one:

return {
  word,
  found: false,
  hint: word.trim().includes(' ')
    ? 'This dictionary only handles single words; pass one word at a time.'
    : 'Word not in dictionary. Try a different spelling or root form.',
};

Same treatment for NOAA’s get_forecast, which only covers the US. Agents were passing coordinates for Tokyo and getting 500s:

if (pointRes.status === 404) {
  return {
    location: { lat, lon },
    found: false,
    coverage: 'us-only',
    hint: 'NOAA NWS only covers the United States and territories. ' +
          'For non-US coordinates, use the "weather" pack instead — ' +
          'it has global coverage via Open-Meteo.',
  };
}

Same for zippopotam, where postal codes outside its ~60-country coverage now return {found: false, hint: ...} instead of 500.

The lesson: a 200 with {found: false} is almost always better than a 500. The agent can keep working; the data is structured; you’ve reserved 500 for actually broken things.

Category 4: Upstream changes you didn’t notice

The most embarrassing category. Four packs were broken because the data sources changed under them and our packs hadn’t been updated:

reddit — Reddit started 403’ing unauthenticated JSON endpoints. Our pack was hitting old.reddit.com thinking that bypassed it. It doesn’t anymore.
fbiwanted — api.fbi.gov started 403’ing cloud-hosted clients. (Probably an Akamai rule against CF Worker IPs.)
nba — BallDontLie now requires an API key, and our schema didn’t surface that requirement for two of the four tools.
alphavantage — Their free tier dropped to 25 requests per day per key, and agents kept passing the literal string "demo" as the key (because the schema description said it was an API key but didn’t say whose).

For reddit and fbiwanted: same {ok: false, reason, hint} soft-fail pattern. The agent gets back something it can introspect and react to.

For nba: make _apiKey required on every tool’s schema. Previously two of the four tools didn’t require it, so agents would call them without a key, get a 401, and have no idea why.

For alphavantage: refuse the literal "demo" key explicitly, and append a hint pointing at our Finnhub pack for every error path. Finnhub has 60 calls per minute on the free tier and no daily cap — a much better default for actual agent traffic.

The lesson: every pack that wraps a third-party API is silently rotting. The signal is in your logs, but only if you read them.

What about the recipes?

A pattern that kept showing up: agents were calling the wrong pack for what they wanted.

They asked the dictionary for “Lenz” — but Wikipedia would have answered cleanly.
They asked NOAA for Tokyo’s weather — but the weather pack covers Tokyo.
They asked edgar for current stock prices — but alphavantage (or now Finnhub) has those.

This isn’t a bug in any single pack. It’s a routing problem. The agent didn’t know which tool to pick.

The traditional fix is “better tool descriptions.” But agents already read descriptions and still get it wrong, because the decision often depends on which is right for this specific case — and that’s a recipe, not a description.

So I wrote four:

Look up a concept — dictionary vs Wikipedia vs OpenAlex vs concept lookup, with a decision tree
Weather for any location — NOAA for US, Open-Meteo for global
Company financials lookup — when to use edgar vs sec-xbrl vs alphavantage vs the compare_entities meta-tool
Postal code → place lookup — Zippopotam coverage + Open-Meteo geocoder fallback for unsupported countries

Each one is the “if you remember nothing else, remember this routing decision” form. Whether agents (or rather, the humans wiring up the agents) read them remains to be seen — but at least the answer is now written down in one place.

The numbers, after

24 hours after deploying the fixes:

Metric	Before	After
Calls	1,579 (7d)	101 (24h)
Errors	436	7
Error rate	27.6%	6.9%

And every single one of the remaining 7 errors is legitimate:

6× Linear OAuth token required — agent hasn’t connected its Linear OAuth yet. Correct rejection.
1× Repository not found: deep-researcher/linear — agent passed a bad GitHub repo name. Correct rejection.

There are no code bugs left in the error stream. Every line is now either an upstream issue or an actionable user/config message. That’s the bar.

What I’d tell you to do

If you’re building MCP servers or any service that LLM agents call:

Log every call. Status, latency, error message, identifier. The schema doesn’t matter; the data does.
Read the logs. Not as an analytics exercise — as a debugging one. Sort by error count. Group by error message. Look at the top 20.
Validate inputs at every public boundary. With a message that names the bad argument and shows a valid example. This is 90% of the win.
Cache aggressively. Especially for any upstream that has rate limits, and especially for data that doesn’t change minute-to-minute. The right TTL is almost always longer than your gut wants.
Soft-fail with structure. A 200 with {found: false, hint: "..."} lets the agent reason about the next step. A 500 just looks broken.
Treat “agent called the wrong tool” as a documentation bug, not a tool bug. Write decision trees, not just descriptions.

That’s it. No clever architecture, no new framework. Just reading the logs and writing the fixes the logs tell you to write.

You can try the result at gateway.pipeworx.io/mcp. It’s free, no signup required. Connect any MCP client and 295 packs are there. If you find a bug — or hit a 500, or get an “agent called the wrong tool” experience — file it via the pipeworx_feedback tool. I’ll read it. It’s in the logs.