Harness Run 20260512T191113Z

Buried lede

37.9%

Field-match rate on the downloaded dataset — versus 90% on the two synthetic sets. Every model fails on it. This is a data problem, not a model problem.

Best overall

sonnet-4-6

87.6% avg field-match at 4.37s avg. Tops the leaderboard at every few-shot count except 0, where opus-4-6 leads by 0.5 points.

Best speed/quality

openai:4.1

83.5% at 1.6s — 3× faster than the next-best quality tier. Newer GPT-5.x models in this run are worse than 4.1.

§ 01

The dataset gap

The headline 81% field-match number across the run is misleading. Performance is bimodal: ~90% on synthetic data, ~38% on real. Until the downloaded dataset is fixed or excluded, every other metric should be read split by dataset.

Investigate first. All ten models — across providers — collapse on downloaded. This pattern points to a schema mismatch, golden-reference error, or input format problem, not a model deficiency. Fixing this likely improves the headline number by 10+ points without changing a single model.

acme_foods

90.2%

Field match

1,470 runs2.44 mismatch

nova_exports

89.9%

Field match

1,470 runs2.31 mismatch

downloaded

37.9%

Field match — investigate

630 runs14.43 mismatch

§ 02

Quality vs. latency

Each point is one model, averaged across all few-shot counts. Upper-left is the Pareto frontier. Anthropic models cluster in the top band; openai:4.1 is alone on the speed axis; Gemini and 5-mini are slow and low-quality.

Model frontier

Bubble size = mismatched fields per run · Closer to top-left is better

§ 03

Model leaderboard

All 70 model × few-shot combinations. Sort any column. Filter to a single few-shot count to compare apples to apples. Top quartile marked in green, bottom quartile in red.

FS Count

Model

70 rows

Model	FS	Runs	Avg s	Mismatch	Field match

§ 04

Few-shot examples don't move the needle

Averaged across all models, varying the few-shot count from 0 to 6 changes field-match by less than half a percentage point. This is a knob you can stop tuning.

FS 0

80.85%

FS 1

80.99%

FS 2

81.08%

FS 3

81.25%

FS 4

80.90%

FS 5

81.25%

FS 6

81.11%

Spread across all seven configurations: 40 basis points. Most likely within statistical noise at n=510 runs per cell.

§ 05

What to check next

In rough priority order. The first item moves the headline number more than every model swap combined.

Audit the downloaded dataset.
Compare a handful of agent outputs to their goldens. If the schema or field names differ, the 38% number reflects evaluation logic, not extraction quality.

Split the leaderboard by dataset.
Current averages blend a broken dataset with two working ones. The "best model" may differ on real data vs. synthetic. Recompute per-dataset to find out.

Add cost per run.
Latency is a proxy but cost is the real metric. openai:5-mini at 30s/run is both slow and expensive; openai:4.1 at 1.6s is the speed/cost winner.

Report stdev or CI.
At n=51 per cell, the gap between sonnet-4-6 (87.6%) and opus-4-6 (87.4%) may not be significant. Showing variance prevents overinterpretation.

Drop FS counts to {0, 3} for future sweeps.
Few-shot has no measurable effect. 7 counts × 10 models × 51 chats is 7× more compute than needed. Saves ~85% of run time.

Run configuration (JSON)

{
  "agent": "so_extraction",
  "models": [
    "sonnet-4-6", "sonnet-4-5", "opus-4-5", "opus-4-6",
    "openai:4.1", "openai:5.2", "openai:5-mini", "openai:5.4",
    "gemini:gemini-2.5-pro", "gemini:gemini-2.5-flash"
  ],
  "datasets": ["downloaded", "acme_foods", "nova_exports"],
  "runs_per_chat": 1,
  "max_workers": 25,
  "few_shot_mode": "walk",
  "few_shot_pool_size": 68,
  "few_shot_variants": "fs0 through fs6 (0–6 examples)",
  "few_shot_seed": 42,
  "allow_self_fewshot": false,
  "results_dir": "results/20260512T191113Z"
}

How to read these numbers

SUCCESS RATE — Percentage of runs that completed without an agent or HTTP error. 100% across this run means the harness is stable; it says nothing about output quality.

AVG ELAPSED (s) — Wall clock per run, averaged across the cell. Useful as a latency proxy; combine with provider pricing for cost.

AVG MISMATCH/EXPECTED — Average number of fields in the agent's JSON output that differ from the golden reference. Lower is better.

FIELD MATCH — Percentage of fields across all runs in a cell that match the golden. Higher is better. The headline quality metric.

FS COUNT — Number of few-shot examples prepended to the prompt (0 through 6).

Models worked. The data didn't.

The dataset gap

Quality vs. latency

Model leaderboard

Few-shot examples don't move the needle

What to check next