Benchmarks

Image quality control: council vs solo models

Which AI models catch real photo defects — and which flag too many clean photos? First baseline measurement, June 2026.

Pilot · first baselinemediaqc-v3-2026-06-10 · n=300 · 2026-06-10

What did we find?

We sent 300 images (160 with a real defect, 140 clean) to six AI vision models and a council of five models working together. The council caught 87.5% of defects. The best single model caught 66.9%. That gap — 20.6 percentage points — is the main finding. A council of models misses far fewer defects than any model working alone.

Council recall

87.5%

defects correctly flagged

Council advantage

+20.6pp

vs best solo model

Best solo model

66.9%

maximum recall with one model

Recall = the share of real defects the model found. 87.5% recall means the model caught 87.5 out of every 100 defect images.

False-alarm rate = the share of clean photos that were incorrectly flagged as defective. A lower number is better.

Category recall = the model not only flagged the image, but flagged it for the right category of defect (e.g. 'anatomy' rather than 'lighting'). This is a stricter test.

Results per model

Model	Recall	False-alarm rate	Category recall	Median latency	Avg cost/image
CouncilCouncilFive models vote together. Judge step only — actual proposal latency adds time per model.	87.5%	17.1%	78.8%	1.7 s	0.267 c
Council (grounded)Council (grounded)Same council with image-grounded judge (A/B arm). FP drops but recall also drops; flag stays off.	70.6%	10.0%	57.7%	2.2 s	0.448 c
claude-fable-5Solo	66.9%	7.1%	60.3%	7.5 s	3.421 c
gpt-4oSolo	66.9%	15.7%	59.6%	2.3 s	0.437 c
gemini-2.5-proSolo	60.6%	3.6%	48.7%	11.8 s	1.431 c
gemini-2.5-flashSolo	36.9%	7.9%	34.6%	5.2 s	0.238 c
gpt-4o-miniSolo	34.4%	16.4%	30.1%	3.4 s	0.366 c
Mistral-Small-3.2-24B-Instruct-2506Solo	9.4%	12.1%	9.0%	3.3 s	0.017 c

Judge grounding A/B: useful for false alarms, costly for recall

Adding image grounding to the judge reduced false alarms from 17.1% to 10.0% — a real improvement. But it also cut recall by 16.9 percentage points (p < 0.001). The false-alarm improvement has p ≈ 0.08 at n=140, which is directionally convincing but not yet statistically significant. Given the recall cost, the grounded judge flag stays off until a solution is found.

Council latency note

The latency shown for council rows is the judge step only (run over stored proposer responses). A live council call also waits for the slowest of the five panel models. Expected end-to-end latency for a live council: roughly the slowest solo model plus the judge step.

Technical details (+)

Dataset composition

300 images total. 160 defect images: 130 LOKI human-annotated AI-generated images with ground-truth defect labels + 30 synthetic defects. 140 control images: 120 real-world photographs (no AI generation artifacts) + 20 additional controls. All images normalised to JPEG q90, maximum 1024px on the long edge.

Rubric and defect classes

Rubric version v2. Defect classes: anatomy (limbs, fingers, faces), physics (lighting, shadows, reflections), texture (surfaces, materials), background (incoherent elements), other. A model must name the correct class to count toward category recall.

Same-proposer replay design

Solo models ran independently on each image. The council arms ran the judge over the stored solo findings — no repeated API calls for images already scored. This controls for prompt variation and isolates the effect of council voting from individual model quality.

Statistical notes

Judge grounding FPR improvement: from 17.1% to 10.0%, p ≈ 0.08 (Fisher exact, n=140 controls) — directional but not significant. Council recall vs best solo: p < 0.001 (chi-squared, n=160 defects). All confidence intervals based on normal approximation; n is small enough that results are a starting point, not a product guarantee.

This is a first baseline measurement (beginmeting), not a continuous benchmark or product guarantee. Dataset: mediaqc-v3-2026-06-10. Measurement date: 2026-06-10.

← Benchmarks