A 3,570-run sweep across 10 models and 7 few-shot configurations on three datasets. One of the datasets is broken — and that's the story.
The headline 81% field-match number across the run is misleading. Performance is bimodal: ~90% on synthetic data, ~38% on real. Until the downloaded dataset is fixed or excluded, every other metric should be read split by dataset.
downloaded. This pattern points to a schema mismatch, golden-reference error, or input format problem, not a model deficiency. Fixing this likely improves the headline number by 10+ points without changing a single model.
Each point is one model, averaged across all few-shot counts. Upper-left is the Pareto frontier. Anthropic models cluster in the top band; openai:4.1 is alone on the speed axis; Gemini and 5-mini are slow and low-quality.
All 70 model × few-shot combinations. Sort any column. Filter to a single few-shot count to compare apples to apples. Top quartile marked in green, bottom quartile in red.
| Model | FS | Runs | Avg s | Mismatch | Field match |
|---|
Averaged across all models, varying the few-shot count from 0 to 6 changes field-match by less than half a percentage point. This is a knob you can stop tuning.
Spread across all seven configurations: 40 basis points. Most likely within statistical noise at n=510 runs per cell.
In rough priority order. The first item moves the headline number more than every model swap combined.
downloaded dataset.openai:5-mini at 30s/run is both slow and expensive; openai:4.1 at 1.6s is the speed/cost winner.{
"agent": "so_extraction",
"models": [
"sonnet-4-6", "sonnet-4-5", "opus-4-5", "opus-4-6",
"openai:4.1", "openai:5.2", "openai:5-mini", "openai:5.4",
"gemini:gemini-2.5-pro", "gemini:gemini-2.5-flash"
],
"datasets": ["downloaded", "acme_foods", "nova_exports"],
"runs_per_chat": 1,
"max_workers": 25,
"few_shot_mode": "walk",
"few_shot_pool_size": 68,
"few_shot_variants": "fs0 through fs6 (0–6 examples)",
"few_shot_seed": 42,
"allow_self_fewshot": false,
"results_dir": "results/20260512T191113Z"
}
SUCCESS RATE — Percentage of runs that completed without an agent or HTTP error. 100% across this run means the harness is stable; it says nothing about output quality.
AVG ELAPSED (s) — Wall clock per run, averaged across the cell. Useful as a latency proxy; combine with provider pricing for cost.
AVG MISMATCH/EXPECTED — Average number of fields in the agent's JSON output that differ from the golden reference. Lower is better.
FIELD MATCH — Percentage of fields across all runs in a cell that match the golden. Higher is better. The headline quality metric.
FS COUNT — Number of few-shot examples prepended to the prompt (0 through 6).