Skip to content

Live Evidence

Why one model is not enough

Real data from every council run we process — updated every 15 minutes. No simulations, no cherry-picked examples.

Blind-spot coverage

A blind spot is a real vulnerability or error that one model silently misses while another model in the same council catches it. The chart below shows which models most often provide the unique catch — the finding no other model in the panel flagged.

Model · Unique-catch rate

  • 1Gpt 4o Mini
    100.0%
    55.2%
  • 2Qwen3.7 Max
    89.3%
    49.3%
  • 3Claude Sonnet 4 6
    50.2%
    27.7%
  • 4Llama 4 Maverick
    25.9%
    14.3%
  • 5Gemini 2.5 Flash
    24.1%
    13.3%
  • 6Claude Opus 4 8
    19.6%
    10.8%
  • 7Gemini 2.5 Pro
    14.3%
    7.9%
  • 8Deepseek V4 Pro
    13.9%
    7.7%

Ranked by unique-catch rate. Only models with sufficient data are shown. Rates are percentages of a model's own events — not relative to other models.

Quality scores

Average quality score (0–100) and ok-rate are computed across all judge evaluations where the model acted as a proposer. Ok-rate = fraction of verdicts rated fully correct.

ModelAvg quality (0–100)Ok-rate
Claude Opus 4 898.3100.0%
Gemini 3.1 Flash Lite98.295.5%
Claude Opus 4 698.298.9%
Gpt 5 Chat Latest98.197.8%
Claude Opus 4 5 2025110197.698.9%
Claude Opus 4 796.897.8%
Gpt 4.1 2025 04 1496.696.6%
Gpt 4.196.594.5%

Reliability

Noise rate = fraction of model responses the council classifier marks as off-topic or low-signal. Error rate = fraction of API calls that returned an error. Both are averages across all qualifying models.

Avg noise rate

2.19%

Share of responses flagged as noise by the council classifier.

Avg API error rate

0.44%

Share of model calls that returned an error.

Security-review benchmark (INT-1929)

Pre-registered blind test · 12 seeded vulnerabilities + 4 clean controls · blind scorer: independent model not in council · cost: €0.43

We seeded a realistic code review task with 12 real vulnerability classes and 4 clean controls. Each arm ran independently. The blind scorer did not know which arm produced which output.

ArmRecall (of 12)False positives (of 4)
GPT-4o (single)7 / 121
Gemini 2.5 Flash (single)11 / 125
Claude Haiku 4.5 (single)12 / 125
Council — consensus12 / 127
Key finding

GPT-4o silently reported "no security issues found" on 5 of 12 real vulnerabilities — the timing side-channel, the IDOR, the missing-authorization check, the predictable reset token, and the TOCTOU race. These are the context and logic bugs, not the textbook ones. The council caught all five.

Variance eliminated

Single-model recall varied from 58% (GPT-4o) to 100% (Claude Haiku) on the same tasks. You do not know in advance which model is strongest for the bug in front of you. The council delivers top-of-panel recall without that gamble.

Honest ceiling

The council did not beat the best single model on recall — it tied it (12/12). This benchmark shows reliability and variance-elimination, not "finds more bugs than any model alive." We report this honestly.

Precision trade-off

Higher recall costs some precision. False positives on clean code: GPT-4o scored 1 (conservative but missed 5 real bugs), while the council scored 7. A human reviews the extra flags — that triage is the cost of not missing the timing side-channel.

Growing signal

An agent and human feedback signal is actively growing. We will publish ratings and agreement statistics once the dataset is large enough to be meaningful.

Live data fetched at Jun 24, 2026, 5:27 PM