MCP (Model Context Protocol) is the right primitive for letting agents reach beyond their context window — but the production reality is rougher than the demos. We hypothesised that real-world MCP failures cluster into a small number of predictable categories, and that a focused pre-flight check would catch most of them. This report tracks every incident across our six weeks of production MCP usage, root-causes them, and back-tests a pre-flight check against the same incident set.
Eleven MCP servers were live across our stack during the audit window:
3× internal (content pipeline, member sync, vault publishing)
4× third-party (Linear, Notion, Supabase, GitHub)
2× experimental (custom Vibeclubs format adapter, Arcanea canon lookup)
2× community (anime-legends-grimoire, suno-prompt-architect)
Every incident was logged with: timestamp, server, symptom, agent action that triggered it, recovery time, root cause, remediation. A pre-flight check was then designed against the root-cause distribution and back-tested by simulating each incident's preconditions.
Six-week window: 2026-03-08 through 2026-04-19. Workload: ~14,000 MCP tool calls across the 11 servers. Sentry alerting wired on all server-side surfaces. Incident threshold: any failure that took >5 minutes to recover or required code change.
Five incidents qualified. Categorised by root cause:
Incident Server Cat Recovery
1. Notion page-id format change notion Schema drift 18 min
2. Linear webhook signature header rename linear Schema drift 34 min
3. Supabase service role key rotation supabase Auth-token gap 6 hr
4. GitHub MCP rate-limit cascade github Transport back-pressure 51 min
5. Anime-grimoire schema migration internal Schema drift 12 min
Cluster summary:
Schema drift 3 of 5 (60%) — server changed, agent kept calling old shape
Auth-token rotation gap 1 of 5 (20%) — secret rotated, server unaware
Transport back-pressure 1 of 5 (20%) — upstream rate limit cascaded
The hypothesis held: every incident fell into one of the three predicted categories. The largest cluster (60%) is silent schema drift — the upstream server changed a field name, payload shape, or tool signature, and the agent kept calling the old shape until something failed loudly enough to trigger an alert. The most expensive (Supabase key rotation, 6h recovery) was a single missed step in a runbook.
The pre-flight check we back-tested:
// Run before each session that uses an MCP server.
async function preflight(server: McpServer) {
// 1. Schema-drift check — fingerprint the tool list and compare to last known.
const tools = await server.listTools()
const fingerprint = hash(tools.map(t => `${t.name}:${t.inputSchema}`).join('|'))
if (fingerprint !== await loadLastFingerprint(server.name)) {
throw new SchemaDriftError(server.name, fingerprint)
}
// 2. Auth-token freshness — re-validate within last 24h.
const lastValidated = await loadLastValidated(server.name)
if (Date.now() - lastValidated > 24 * 3600 * 1000) {
await server.callTool('whoami', {}) // cheap auth probe
await saveLastValidated(server.name, Date.now())
}
// 3. Back-pressure check — upstream rate-limit headers from last call.
const lastHeaders = await loadLastResponseHeaders(server.name)
if (lastHeaders?.['x-ratelimit-remaining'] && parseInt(lastHeaders['x-ratelimit-remaining']) < 10) {
await sleep(60_000) // back off
}
}
Back-test result: the pre-flight catches 4 of 5 incidents (80%) before they fail. The fifth (Linear webhook signature rename) is detected only when the webhook fires; the pre-flight doesn't see inbound payloads. We added a webhook-side guard separately.
Twelve lines plus one fingerprint store. ~80% incident reduction.
The lesson is operational, not architectural. MCP servers are stable enough to run production workloads — what's unstable is the contract between agent and server, because nothing forces it to stay stable. Add the pre-flight to your session start. Persist the fingerprint and the last-validated timestamp. Wire one cheap auth probe per server per day. You buy ~80% incident reduction for an hour of plumbing.
The deeper lesson: production agentic systems fail in mundane, recoverable ways. The discourse keeps catastrophising AI safety while the actual production failures are the same failures every distributed system has always had — schema drift, secret rotation, rate limits. Treat them with the same discipline you'd treat a microservice: schema versioning, secret rotation runbooks, rate-limit observability. Boring works.
Next week's report measures the economic side: cost-per-recovery-minute under different alerting regimes. Hypothesis: a $40/mo Sentry plan saves >$400/mo in unrecovered downtime for a sole operator running 10+ MCP servers. Method preview: replay the six-week incident set under three alerting configurations.
Incident log (server names redacted where third-party), root-cause analyses, and the pre-flight script published to the public Vaults of Benevolence under mcp-production-failures-2026-04. Built on SIP.