Three agent orchestration patterns we ran for four weeks — costs, outcomes, and which one earned its keep

22 Apr 202612 minagentsorchestrationbenchmarkcost

Takeaway

Verifier-loop wins on irreversible work (publish, send, charge) by 38% on quality at ~3× cost. Pipeline wins on multi-format expansion (one essay → eight outputs) by 22% quality at 1.4× cost. Pure swarm loses to a single Sonnet on every metric except ego.

Hypothesis

The agent-orchestration discourse moved faster than the receipts. Solo operators see frameworks promising "swarms," "hierarchical mesh," "consensus," and infer that more agents always equals better outputs. That is true for some workloads and false for most. We hypothesised the rule is sharper than that: multi-agent only beats single-agent when the cost of one wrong step is high enough to justify a verifier loop. For everything else, a strong single agent and a clear prompt outperform.

Method

We ran the same 80-task workload across three orchestration patterns for four weeks (20 tasks per week, randomised assignment). The workload mixed three task types: long-form drafting (20 tasks), data-extraction-then-action (30 tasks), and irreversible-publish (30 tasks where the output gets shipped to a public surface). Per-task cost, end-to-end latency, and output quality (5-criterion rubric, scored by three blind reviewers) were tracked.

Three patterns:

Single Agent. Claude 4.7 Sonnet, single call, strong system prompt, retry on parse error.
Pipeline. Claude 4.7 Sonnet decomposes task → 3-7 sub-tasks → executes serially → aggregates.
Verifier Loop. Claude 4.7 Sonnet drafts → Claude 4.7 Opus verifies against a checklist → Sonnet revises → Opus approves or rejects (max two cycles, then escalates).

The Pure Swarm pattern (3 parallel agents producing variants, then a judge picks one) was tested in week 1 and dropped — see Results.

Setup

Workload         80 tasks (4 weeks × 20)
Task types       Drafting (20) · Extract+act (30) · Irreversible publish (30)
Models           Claude 4.7 Sonnet (worker), Claude 4.7 Opus (verifier)
Rubric           5 criteria × 1-5 scale, 3 blind reviewers
Cost tracking    Per-task USD inc. retries
Replication      Full prompt set + task corpus in the audit kit linked at end

Results

Composite quality (1–5 scale) and cost (USD per task) across all 80 tasks:

Pattern           Quality   Cost     Latency   Notes
Single Agent      3.41      $0.04    8s        Baseline
Pipeline          4.16      $0.06    32s       +22% quality, 1.4× cost
Verifier Loop     4.71      $0.13    47s       +38% quality, 3.2× cost
Pure Swarm        3.33      $0.18    24s       Worse than Single, 4.5× cost

Cut by task type, the picture sharpens dramatically:

                 Single  Pipeline  Verifier
Drafting         3.81    4.34      4.39       (Pipeline ≈ Verifier — Verifier wastes money)
Extract+act      3.42    4.27      4.61       (Pipeline good; Verifier slightly better)
Irreversible     3.05    3.84      4.94       (Verifier dominant; Single fails 9/30 tasks)

The Pure Swarm result deserves a sentence. Three parallel Sonnet variants picked-by-judge produced 4.5× cost for worse quality than a single Sonnet because the variants were too correlated (same model, same prompt) and the judge introduced its own bias. We dropped the pattern after week 1 and re-allocated the 20 tasks across Pipeline + Verifier. Swarms work when the agents have meaningfully different capabilities (different models, different tool access, different specialisations). Three of the same agent in parallel is not a swarm — it is a tax.

Takeaway

Use the cost of being wrong as the routing rule:

Reversible, low-stakes drafting? Single agent, full stop. Spend the money saved on more iterations.
Multi-format expansion (one input → many outputs)? Pipeline. The 22% quality lift comes from forcing structure that a single call collapses. Cost penalty is acceptable because outputs are downstream-cheap to produce.
Irreversible work (publish, send, charge, sign)? Verifier loop. The 38% quality lift on the irreversible-publish category was the difference between zero shipping mistakes in four weeks and nine errors that would have made it to production under Single Agent. Worth 3.2× cost on tasks where one mistake costs more than a hundred verifier passes.
Avoid the swarm reflex. N copies of the same model rarely outperform one good model. If you need swarm behaviour, route to different models with different tools.

The general lesson: orchestration is expense management, not cleverness. The right pattern is the cheapest one that meets the quality bar for the task's reversibility class.

Next week tests a fourth pattern: Verifier with tool access — Opus verifier can call a fact-checker tool (web search + citation extraction) before approving. Hypothesis: this lifts irreversible-publish quality from 4.94 to ≥4.97 but at minimal extra cost (tool calls only on borderline drafts). Method preview: same 80-task corpus, swap verifier config.

Provenance

Task corpus, raw scores, cost ledger, and the three system prompts published to the public Vaults of Benevolence under agent-orchestration-patterns-2026-04. Built on SIP. Replications welcome — file a PR with your own numbers.

Three agent orchestration patterns we ran for four weeks — costs, outcomes, and which one earned its keep

22 Apr 202612 minagentsorchestrationbenchmarkcost

Takeaway

Workload 80 tasks (4 weeks × 20) Task types Drafting (20) · Extract+act (30) · Irreversible publish (30) Models Claude 4.7 Sonnet (worker), Claude 4.7 Opus (verifier) Rubric 5 criteria × 1-5 scale, 3 blind reviewers Cost tracking Per-task USD inc. retries Replication Full prompt set + task corpus in the audit kit linked at end

Pattern Quality Cost Latency Notes Single Agent 3.41 $0.04 8s Baseline Pipeline 4.16 $0.06 32s +22% quality, 1.4× cost Verifier Loop 4.71 $0.13 47s +38% quality, 3.2× cost Pure Swarm 3.33 $0.18 24s Worse than Single, 4.5× cost

Single Pipeline Verifier Drafting 3.81 4.34 4.39 (Pipeline ≈ Verifier — Verifier wastes money) Extract+act 3.42 4.27 4.61 (Pipeline good; Verifier slightly better) Irreversible 3.05 3.84 4.94 (Verifier dominant; Single fails 9/30 tasks)

Three agent orchestration patterns we ran for four weeks — costs, outcomes, and which one earned its keep

Get the next report on Monday — twenty-four hours before public.

Three agent orchestration patterns we ran for four weeks — costs, outcomes, and which one earned its keep

Get the next report on Monday — twenty-four hours before public.