Most long-form benchmarks score "quality" without specifying the work the writer is actually doing. For voice-bound long-form work — essays, newsletters, framework deep-dives where the who is as load-bearing as the what — the model that best preserves a creator's existing voice on the first draft wins, because every revision pass costs minutes that compound across a year of weekly publishing. We hypothesised that Claude 4.7 would win this category by a margin large enough to dictate which model gets the source-essay slot in a creator's workflow.
We built a brief-bank of fifty long-form prompts drawn from real creator workflows (28 from this audit set, 22 from public archives with permission). Each brief specified topic + audience + voice anchor (a 250-word sample of the writer's published work) and required ≥1,500 words.
For every brief we generated three drafts: one from Claude 4.7, one from Gemini 3 Pro, one from GPT-5.1. Generations were performed at temperature 0.7, identical system prompts, identical context. No model knew which writer was being mimicked beyond the voice anchor.
Five third-party readers — none of whom knew the model used or saw the original writer — scored each draft against six criteria: voice fit, argument coherence, factual density, structural variety, edit-readiness, and absence of generic AI tells. Scoring was blind on a 1–5 scale per criterion, averaged across raters.
Models: Claude 4.7 / Gemini 3 Pro / GPT-5.1
Generation temp: 0.7
System prompt: identical (voice-anchored, no model-specific cues)
Briefs: 50, each with 250-word voice anchor
Raters: 5, blind to model
Rubric: 6 criteria × 1-5 scale
Total ratings: 50 briefs × 3 models × 5 raters × 6 criteria = 4,500 datapoints
Full rubric, raw scores, and the brief bank are available in the audit kit linked at the end of this post.
Mean scores across all 50 briefs (1–5 scale, higher is better):
Criterion Claude 4.7 Gemini 3 Pro GPT-5.1
Voice fit 4.32 3.51 3.78
Argument coherence 4.18 4.05 4.22
Factual density 3.74 4.41 3.89
Structural variety 3.89 3.82 4.31
Edit-readiness 4.21 3.74 3.65
Absence of AI tells 4.06 3.32 3.55
Composite 4.07 3.81 3.90
Claude wins voice-fit by a 23% margin and edit-readiness by 13%. Gemini wins factual density by 18%. GPT wins structural variety by 11%.
Variance also matters: Claude's voice-fit standard deviation across briefs was 0.41, vs Gemini 0.78 and GPT 0.69. In plain language: Claude is more consistent at preserving voice, which is the property a weekly publishing cadence actually needs.
A different cut: when the voice anchor was decision-first and compressed (the patterns this site uses), Claude's voice-fit margin widened to 32%. When the voice anchor was conversational and meandering, the margin narrowed to 9% and structural variety became the deciding criterion (GPT often won).
If your work is voice-bound and you publish weekly, route source-essay generation to Claude 4.7. The 23% voice-fit margin compounds to roughly 90 minutes per week saved on revision, or about 78 hours per year — three full working weeks recovered without changing anything else in the stack.
If your work is research-bound (technical explainers, comparative analyses, market briefs), route the first draft to Gemini 3 Pro and pass the result through Claude for voice polishing. Two-pass is more expensive but produces the best composite outputs in our set.
If your work is structural-bound (long argument essays where the architecture matters more than the prose), GPT-5.1 produces the most surprising structural choices, but at the cost of voice fit — budget revision time accordingly.
The general lesson: model choice is a workflow decision, not a benchmark score. Choose the model that wins the criterion you actually care about, and let the other criteria be solved by another model in the chain.
Next week's report runs the same instrument on agent orchestration: same brief bank, but generations are produced by multi-step agentic pipelines (one tool-using model + one writer model). Hypothesis: composite quality lifts another 0.4–0.6 points but at 4× cost. Method preview: same five raters, same rubric, additional cost-per-output column.
Brief bank, raw scores, rater notes, and replication kit published to the public Vaults of Benevolence under model-comparison-long-form-2026-04. Built on SIP. Replications welcome.