The agent-orchestration discourse moved faster than the receipts. Solo operators see frameworks promising "swarms," "hierarchical mesh," "consensus," and infer that more agents always equals better outputs. That is true for some workloads and false for most. We hypothesised the rule is sharper than that: multi-agent only beats single-agent when the cost of one wrong step is high enough to justify a verifier loop. For everything else, a strong single agent and a clear prompt outperform.
We ran the same 80-task workload across three orchestration patterns for four weeks (20 tasks per week, randomised assignment). The workload mixed three task types: long-form drafting (20 tasks), data-extraction-then-action (30 tasks), and irreversible-publish (30 tasks where the output gets shipped to a public surface). Per-task cost, end-to-end latency, and output quality (5-criterion rubric, scored by three blind reviewers) were tracked.
Three patterns:
- Single Agent. Claude 4.7 Sonnet, single call, strong system prompt, retry on parse error.
- Pipeline. Claude 4.7 Sonnet decomposes task → 3-7 sub-tasks → executes serially → aggregates.
- Verifier Loop. Claude 4.7 Sonnet drafts → Claude 4.7 Opus verifies against a checklist → Sonnet revises → Opus approves or rejects (max two cycles, then escalates).
The Pure Swarm pattern (3 parallel agents producing variants, then a judge picks one) was tested in week 1 and dropped — see Results.
Workload 80 tasks (4 weeks × 20)
Task types Drafting (20) · Extract+act (30) · Irreversible publish (30)
Models Claude 4.7 Sonnet (worker), Claude 4.7 Opus (verifier)
Rubric 5 criteria × 1-5 scale, 3 blind reviewers
Cost tracking Per-task USD inc. retries
Replication Full prompt set + task corpus in the audit kit linked at end
Composite quality (1–5 scale) and cost (USD per task) across all 80 tasks:
Pattern Quality Cost Latency Notes
Single Agent 3.41 $0.04 8s Baseline
Pipeline 4.16 $0.06 32s +22% quality, 1.4× cost
Verifier Loop 4.71 $0.13 47s +38% quality, 3.2× cost
Pure Swarm 3.33 $0.18 24s Worse than Single, 4.5× cost
Cut by task type, the picture sharpens dramatically:
Single Pipeline Verifier
Drafting 3.81 4.34 4.39 (Pipeline ≈ Verifier — Verifier wastes money)
Extract+act 3.42 4.27 4.61 (Pipeline good; Verifier slightly better)
Irreversible 3.05 3.84 4.94 (Verifier dominant; Single fails 9/30 tasks)
The Pure Swarm result deserves a sentence. Three parallel Sonnet variants picked-by-judge produced 4.5× cost for worse quality than a single Sonnet because the variants were too correlated (same model, same prompt) and the judge introduced its own bias. We dropped the pattern after week 1 and re-allocated the 20 tasks across Pipeline + Verifier. Swarms work when the agents have meaningfully different capabilities (different models, different tool access, different specialisations). Three of the same agent in parallel is not a swarm — it is a tax.
Use the cost of being wrong as the routing rule:
- Reversible, low-stakes drafting? Single agent, full stop. Spend the money saved on more iterations.
- Multi-format expansion (one input → many outputs)? Pipeline. The 22% quality lift comes from forcing structure that a single call collapses. Cost penalty is acceptable because outputs are downstream-cheap to produce.
- Irreversible work (publish, send, charge, sign)? Verifier loop. The 38% quality lift on the irreversible-publish category was the difference between zero shipping mistakes in four weeks and nine errors that would have made it to production under Single Agent. Worth 3.2× cost on tasks where one mistake costs more than a hundred verifier passes.
- Avoid the swarm reflex. N copies of the same model rarely outperform one good model. If you need swarm behaviour, route to different models with different tools.
The general lesson: orchestration is expense management, not cleverness. The right pattern is the cheapest one that meets the quality bar for the task's reversibility class.
Next week tests a fourth pattern: Verifier with tool access — Opus verifier can call a fact-checker tool (web search + citation extraction) before approving. Hypothesis: this lifts irreversible-publish quality from 4.94 to ≥4.97 but at minimal extra cost (tool calls only on borderline drafts). Method preview: same 80-task corpus, swap verifier config.
Task corpus, raw scores, cost ledger, and the three system prompts published to the public Vaults of Benevolence under agent-orchestration-patterns-2026-04. Built on SIP. Replications welcome — file a PR with your own numbers.