Skip to content

Benchmarks

Image quality control: council vs solo models

Which AI models catch real photo defects — and which flag too many clean photos? First baseline measurement, June 2026.

Pilot · first baselinemediaqc-v3-2026-06-10 · n=300 · 2026-06-10

What did we find?

We sent 300 images (160 with a real defect, 140 clean) to six AI vision models and a council of five models working together. The council caught 87.5% of defects. The best single model caught 66.9%. That gap — 20.6 percentage points — is the main finding. A council of models misses far fewer defects than any model working alone.

Council recall

87.5%

defects correctly flagged

Council advantage

+20.6pp

vs best solo model

Best solo model

66.9%

maximum recall with one model

Recall = the share of real defects the model found. 87.5% recall means the model caught 87.5 out of every 100 defect images.
False-alarm rate = the share of clean photos that were incorrectly flagged as defective. A lower number is better.
Category recall = the model not only flagged the image, but flagged it for the right category of defect (e.g. 'anatomy' rather than 'lighting'). This is a stricter test.

Results per model

ModelRecallFalse-alarm rateCategory recallMedian latencyAvg cost/image
CouncilCouncilFive models vote together. Judge step only — actual proposal latency adds time per model.
87.5%17.1%78.8%1.7 s0.267 c
Council (grounded)Council (grounded)Same council with image-grounded judge (A/B arm). FP drops but recall also drops; flag stays off.
70.6%10.0%57.7%2.2 s0.448 c
claude-fable-5Solo
66.9%7.1%60.3%7.5 s3.421 c
gpt-4oSolo
66.9%15.7%59.6%2.3 s0.437 c
gemini-2.5-proSolo
60.6%3.6%48.7%11.8 s1.431 c
gemini-2.5-flashSolo
36.9%7.9%34.6%5.2 s0.238 c
gpt-4o-miniSolo
34.4%16.4%30.1%3.4 s0.366 c
Mistral-Small-3.2-24B-Instruct-2506Solo
9.4%12.1%9.0%3.3 s0.017 c

Judge grounding A/B: useful for false alarms, costly for recall

Adding image grounding to the judge reduced false alarms from 17.1% to 10.0% — a real improvement. But it also cut recall by 16.9 percentage points (p < 0.001). The false-alarm improvement has p ≈ 0.08 at n=140, which is directionally convincing but not yet statistically significant. Given the recall cost, the grounded judge flag stays off until a solution is found.

Council latency note

The latency shown for council rows is the judge step only (run over stored proposer responses). A live council call also waits for the slowest of the five panel models. Expected end-to-end latency for a live council: roughly the slowest solo model plus the judge step.

Technical details (+)

Dataset composition

300 images total. 160 defect images: 130 LOKI human-annotated AI-generated images with ground-truth defect labels + 30 synthetic defects. 140 control images: 120 real-world photographs (no AI generation artifacts) + 20 additional controls. All images normalised to JPEG q90, maximum 1024px on the long edge.

Rubric and defect classes

Rubric version v2. Defect classes: anatomy (limbs, fingers, faces), physics (lighting, shadows, reflections), texture (surfaces, materials), background (incoherent elements), other. A model must name the correct class to count toward category recall.

Same-proposer replay design

Solo models ran independently on each image. The council arms ran the judge over the stored solo findings — no repeated API calls for images already scored. This controls for prompt variation and isolates the effect of council voting from individual model quality.

Statistical notes

Judge grounding FPR improvement: from 17.1% to 10.0%, p ≈ 0.08 (Fisher exact, n=140 controls) — directional but not significant. Council recall vs best solo: p < 0.001 (chi-squared, n=160 defects). All confidence intervals based on normal approximation; n is small enough that results are a starting point, not a product guarantee.

This is a first baseline measurement (beginmeting), not a continuous benchmark or product guarantee. Dataset: mediaqc-v3-2026-06-10. Measurement date: 2026-06-10.