Harness Run · so_extraction 20260512T191113Z  ·  3,570 runs  ·  Generated 2026-05-12 19:32 UTC

Models worked. The data didn't.

A 3,570-run sweep across 10 models and 7 few-shot configurations on three datasets. One of the datasets is broken — and that's the story.

Buried lede
37.9%
Field-match rate on the downloaded dataset — versus 90% on the two synthetic sets. Every model fails on it. This is a data problem, not a model problem.
Best overall
sonnet-4-6
87.6% avg field-match at 4.37s avg. Tops the leaderboard at every few-shot count except 0, where opus-4-6 leads by 0.5 points.
Best speed/quality
openai:4.1
83.5% at 1.6s — 3× faster than the next-best quality tier. Newer GPT-5.x models in this run are worse than 4.1.
§ 01

The dataset gap

The headline 81% field-match number across the run is misleading. Performance is bimodal: ~90% on synthetic data, ~38% on real. Until the downloaded dataset is fixed or excluded, every other metric should be read split by dataset.

Investigate first. All ten models — across providers — collapse on downloaded. This pattern points to a schema mismatch, golden-reference error, or input format problem, not a model deficiency. Fixing this likely improves the headline number by 10+ points without changing a single model.
acme_foods
90.2%
Field match
1,470 runs2.44 mismatch
nova_exports
89.9%
Field match
1,470 runs2.31 mismatch
downloaded
37.9%
Field match — investigate
630 runs14.43 mismatch
§ 02

Quality vs. latency

Each point is one model, averaged across all few-shot counts. Upper-left is the Pareto frontier. Anthropic models cluster in the top band; openai:4.1 is alone on the speed axis; Gemini and 5-mini are slow and low-quality.

Model frontier
Bubble size = mismatched fields per run · Closer to top-left is better
§ 03

Model leaderboard

All 70 model × few-shot combinations. Sort any column. Filter to a single few-shot count to compare apples to apples. Top quartile marked in green, bottom quartile in red.

70 rows
Model FS Runs Avg s Mismatch Field match
§ 04

Few-shot examples don't move the needle

Averaged across all models, varying the few-shot count from 0 to 6 changes field-match by less than half a percentage point. This is a knob you can stop tuning.

FS 0
80.85%
FS 1
80.99%
FS 2
81.08%
FS 3
81.25%
FS 4
80.90%
FS 5
81.25%
FS 6
81.11%

Spread across all seven configurations: 40 basis points. Most likely within statistical noise at n=510 runs per cell.

§ 05

What to check next

In rough priority order. The first item moves the headline number more than every model swap combined.

01
Audit the downloaded dataset.
Compare a handful of agent outputs to their goldens. If the schema or field names differ, the 38% number reflects evaluation logic, not extraction quality.
02
Split the leaderboard by dataset.
Current averages blend a broken dataset with two working ones. The "best model" may differ on real data vs. synthetic. Recompute per-dataset to find out.
03
Add cost per run.
Latency is a proxy but cost is the real metric. openai:5-mini at 30s/run is both slow and expensive; openai:4.1 at 1.6s is the speed/cost winner.
04
Report stdev or CI.
At n=51 per cell, the gap between sonnet-4-6 (87.6%) and opus-4-6 (87.4%) may not be significant. Showing variance prevents overinterpretation.
05
Drop FS counts to {0, 3} for future sweeps.
Few-shot has no measurable effect. 7 counts × 10 models × 51 chats is 7× more compute than needed. Saves ~85% of run time.
Run configuration (JSON)
{
  "agent": "so_extraction",
  "models": [
    "sonnet-4-6", "sonnet-4-5", "opus-4-5", "opus-4-6",
    "openai:4.1", "openai:5.2", "openai:5-mini", "openai:5.4",
    "gemini:gemini-2.5-pro", "gemini:gemini-2.5-flash"
  ],
  "datasets": ["downloaded", "acme_foods", "nova_exports"],
  "runs_per_chat": 1,
  "max_workers": 25,
  "few_shot_mode": "walk",
  "few_shot_pool_size": 68,
  "few_shot_variants": "fs0 through fs6 (0–6 examples)",
  "few_shot_seed": 42,
  "allow_self_fewshot": false,
  "results_dir": "results/20260512T191113Z"
}
How to read these numbers

SUCCESS RATE — Percentage of runs that completed without an agent or HTTP error. 100% across this run means the harness is stable; it says nothing about output quality.

AVG ELAPSED (s) — Wall clock per run, averaged across the cell. Useful as a latency proxy; combine with provider pricing for cost.

AVG MISMATCH/EXPECTED — Average number of fields in the agent's JSON output that differ from the golden reference. Lower is better.

FIELD MATCH — Percentage of fields across all runs in a cell that match the golden. Higher is better. The headline quality metric.

FS COUNT — Number of few-shot examples prepended to the prompt (0 through 6).