USAspending Federal Contracts for AI Agents — Mirrored, Indexed, Sub-Second

The US federal government awards roughly $700 billion in contracts every year, all of it disclosed at USAspending.gov. The data is open, the search interface is mature, and the API is well-documented — in theory it’s a perfect data source for AI agents researching federal vendors, contract opportunities, or who-won-what for any given technology stack.

In practice, integrating USAspending into an agent stack is harder than it should be. The live api.usaspending.gov endpoint silently hangs requests from Cloudflare Worker egress IPs (and probably others), and the underlying API returns enormous payloads when filter parameters don’t match the expected shape — meaning a single agent call can stall the worker until it gets killed. Pipeworx ran into both of these problems running the usaspending pack live, and after the third user bug report we built a different stack underneath.

What we built

Every month, a GitHub Actions runner downloads the USAspending All-agencies contracts snapshot — a ~180 MB ZIP, ~1.5 GB CSV, ~1.5 million contract transactions — transforms it (filter to recent fiscal years, project 299 columns down to the 30 that matter for search), and bulk-loads it into a Supabase table behind the gateway. The MCP pack’s usa_award_search tool now hits that table instead of the live API.

Results:

Sub-second response for keyword queries that previously timed out at 30s
Full-text search via Postgres GIN index on transaction_description || prime_award_base_transaction_description || recipient_name
Filter by NAICS code, set-aside type, agency, date range — all standard SQL filters with proper indexes
No CF egress problems — the data is in our Supabase, not on a third-party API that may or may not respond

What an agent can ask

usa_award_search(
  keywords: ["cybersecurity"],
  limit: 5
)
→ [
    { recipient: "MAVERIS LLC", amount: 331158125.88,
      awarding_agency: "Department of Veterans Affairs",
      description: "VA CYBERSECURITY OPERATIONS CENTER NEXT GENERATION II..." },
    { recipient: "NTT DATA SERVICES FEDERAL GOVERNMENT, LLC", amount: 78781831.20,
      awarding_agency: "Department of Justice",
      description: "TITLE: CYBERSECURITY OPERATIONS - HACS REQUESTOR..." },
    ...
  ]

Typical query patterns:

“What contracts has $VENDOR won?” — usa_award_search(keywords: ["Lockheed Martin"])
“What does the government spend on $TOPIC?” — keyword search with start_date / end_date
“Who are the top recipients in NAICS $CODE?” — filter by NAICS, aggregate by recipient
“Recent awards under the SBA 8(a) set-aside” — filter by set_aside, sort by date

For higher-level analysis the govcon-intel compound pack glues usaspending together with samgov (live federal contract opportunities) and sbir (small business R&D grants) into single-call workflows — govcon_opportunity_scan, govcon_contractor_profile, govcon_agency_landscape.

Coverage and freshness

The current mirror covers fiscal year 2020 onward — ~1.4 million transactions and growing. Each month’s snapshot includes both new transactions and modifications to existing awards (the upsert is keyed on transaction_unique_key, so amendments and option exercises update the existing row rather than creating duplicates). The refresh cron runs on the 15th of each month, ~9 days after USAspending publishes (giving them margin for any QA republishing of their monthly file).

Coverage limits worth knowing:

USPS is missing. The US Postal Service is an independent agency that contracts through its own systems and doesn’t report through USAspending. We can’t surface what isn’t there.
Some agencies file thinly. Defense and Health & Human Services dominate the volume; smaller agencies appear less frequently in the data.
Modifications, not original awards. USAspending’s current-month “All” snapshot is technically a delta capturing recent activity; deep historical backfill (pre-FY2020) requires their per-fiscal-year archive files, which we may add in a future pass.

When to use the live API anyway

The mirror handles usa_award_search (the most-called USAspending tool by a wide margin). Other tools in the pack — usa_spending_by_agency, usa_spending_by_category, usa_recipient_profile, usa_spending_trends — still hit the live API. These are lower-volume queries and many of them aggregate across the full historical dataset that the monthly snapshot doesn’t cover. If you hit a timeout there, the pack returns a structured error in under 8 seconds (no more 30-second hangs) and your agent can degrade gracefully.

The broader pattern

usaspending is the first of what will probably be several “data source where the live API is unreliable but the data itself is gold” mirrors. SEC EDGAR’s 8-K filings got similar treatment in the sec-events pack — they’re already in EDGAR’s Atom feed but we pre-classify by severity at ingest time so agents skip the procedural noise. The pattern: own the data infrastructure where the upstream is flaky, and surface the result through clean MCP tools so the agent doesn’t see any of it.