Live Evidence
Why one model is not enough
Real data from every council run we process — updated every 15 minutes. No simulations, no cherry-picked examples.
Blind-spot coverage
A blind spot is a real vulnerability or error that one model silently misses while another model in the same council catches it. The chart below shows which models most often provide the unique catch — the finding no other model in the panel flagged.
Model · Unique-catch rate
- 1Gpt 4o Mini55.2%100.0%
- 2Qwen3.7 Max49.3%89.3%
- 3Claude Sonnet 4 627.7%50.2%
- 4Llama 4 Maverick14.3%25.9%
- 5Gemini 2.5 Flash13.3%24.1%
- 6Claude Opus 4 810.8%19.6%
- 7Gemini 2.5 Pro7.9%14.3%
- 8Deepseek V4 Pro7.7%13.9%
Ranked by unique-catch rate. Only models with sufficient data are shown. Rates are percentages of a model's own events — not relative to other models.
Quality scores
Average quality score (0–100) and ok-rate are computed across all judge evaluations where the model acted as a proposer. Ok-rate = fraction of verdicts rated fully correct.
| Model | Avg quality (0–100) | Ok-rate |
|---|---|---|
| Claude Opus 4 8 | 98.3 | 100.0% |
| Gemini 3.1 Flash Lite | 98.2 | 95.5% |
| Claude Opus 4 6 | 98.2 | 98.9% |
| Gpt 5 Chat Latest | 98.1 | 97.8% |
| Claude Opus 4 5 20251101 | 97.6 | 98.9% |
| Claude Opus 4 7 | 96.8 | 97.8% |
| Gpt 4.1 2025 04 14 | 96.6 | 96.6% |
| Gpt 4.1 | 96.5 | 94.5% |
Reliability
Noise rate = fraction of model responses the council classifier marks as off-topic or low-signal. Error rate = fraction of API calls that returned an error. Both are averages across all qualifying models.
Avg noise rate
2.19%
Share of responses flagged as noise by the council classifier.
Avg API error rate
0.44%
Share of model calls that returned an error.
Security-review benchmark (INT-1929)
Pre-registered blind test · 12 seeded vulnerabilities + 4 clean controls · blind scorer: independent model not in council · cost: €0.43
We seeded a realistic code review task with 12 real vulnerability classes and 4 clean controls. Each arm ran independently. The blind scorer did not know which arm produced which output.
| Arm | Recall (of 12) | False positives (of 4) |
|---|---|---|
| GPT-4o (single) | 7 / 12 | 1 |
| Gemini 2.5 Flash (single) | 11 / 12 | 5 |
| Claude Haiku 4.5 (single) | 12 / 12 | 5 |
| Council — consensus | 12 / 12 | 7 |
GPT-4o silently reported "no security issues found" on 5 of 12 real vulnerabilities — the timing side-channel, the IDOR, the missing-authorization check, the predictable reset token, and the TOCTOU race. These are the context and logic bugs, not the textbook ones. The council caught all five.
Single-model recall varied from 58% (GPT-4o) to 100% (Claude Haiku) on the same tasks. You do not know in advance which model is strongest for the bug in front of you. The council delivers top-of-panel recall without that gamble.
The council did not beat the best single model on recall — it tied it (12/12). This benchmark shows reliability and variance-elimination, not "finds more bugs than any model alive." We report this honestly.
Higher recall costs some precision. False positives on clean code: GPT-4o scored 1 (conservative but missed 5 real bugs), while the council scored 7. A human reviews the extra flags — that triage is the cost of not missing the timing side-channel.
Growing signal
An agent and human feedback signal is actively growing. We will publish ratings and agreement statistics once the dataset is large enough to be meaningful.
Live data fetched at Jun 24, 2026, 5:27 PM