Benchmarks
Image quality control: council vs solo models
Which AI models catch real photo defects — and which flag too many clean photos? First baseline measurement, June 2026.
What did we find?
We sent 300 images (160 with a real defect, 140 clean) to six AI vision models and a council of five models working together. The council caught 87.5% of defects. The best single model caught 66.9%. That gap — 20.6 percentage points — is the main finding. A council of models misses far fewer defects than any model working alone.
Council recall
87.5%
defects correctly flagged
Council advantage
+20.6pp
vs best solo model
Best solo model
66.9%
maximum recall with one model
Results per model
| Model | Recall | False-alarm rate | Category recall | Median latency | Avg cost/image |
|---|---|---|---|---|---|
CouncilCouncilFive models vote together. Judge step only — actual proposal latency adds time per model. | 87.5% | 17.1% | 78.8% | 1.7 s | 0.267 c |
Council (grounded)Council (grounded)Same council with image-grounded judge (A/B arm). FP drops but recall also drops; flag stays off. | 70.6% | 10.0% | 57.7% | 2.2 s | 0.448 c |
claude-fable-5Solo | 66.9% | 7.1% | 60.3% | 7.5 s | 3.421 c |
gpt-4oSolo | 66.9% | 15.7% | 59.6% | 2.3 s | 0.437 c |
gemini-2.5-proSolo | 60.6% | 3.6% | 48.7% | 11.8 s | 1.431 c |
gemini-2.5-flashSolo | 36.9% | 7.9% | 34.6% | 5.2 s | 0.238 c |
gpt-4o-miniSolo | 34.4% | 16.4% | 30.1% | 3.4 s | 0.366 c |
Mistral-Small-3.2-24B-Instruct-2506Solo | 9.4% | 12.1% | 9.0% | 3.3 s | 0.017 c |
Judge grounding A/B: useful for false alarms, costly for recall
Adding image grounding to the judge reduced false alarms from 17.1% to 10.0% — a real improvement. But it also cut recall by 16.9 percentage points (p < 0.001). The false-alarm improvement has p ≈ 0.08 at n=140, which is directionally convincing but not yet statistically significant. Given the recall cost, the grounded judge flag stays off until a solution is found.
Council latency note
The latency shown for council rows is the judge step only (run over stored proposer responses). A live council call also waits for the slowest of the five panel models. Expected end-to-end latency for a live council: roughly the slowest solo model plus the judge step.
Technical details (+)
Dataset composition
300 images total. 160 defect images: 130 LOKI human-annotated AI-generated images with ground-truth defect labels + 30 synthetic defects. 140 control images: 120 real-world photographs (no AI generation artifacts) + 20 additional controls. All images normalised to JPEG q90, maximum 1024px on the long edge.
Rubric and defect classes
Rubric version v2. Defect classes: anatomy (limbs, fingers, faces), physics (lighting, shadows, reflections), texture (surfaces, materials), background (incoherent elements), other. A model must name the correct class to count toward category recall.
Same-proposer replay design
Solo models ran independently on each image. The council arms ran the judge over the stored solo findings — no repeated API calls for images already scored. This controls for prompt variation and isolates the effect of council voting from individual model quality.
Statistical notes
Judge grounding FPR improvement: from 17.1% to 10.0%, p ≈ 0.08 (Fisher exact, n=140 controls) — directional but not significant. Council recall vs best solo: p < 0.001 (chi-squared, n=160 defects). All confidence intervals based on normal approximation; n is small enough that results are a starting point, not a product guarantee.