Skip to content
Multi-model consensus · neutral judge

Surface the error one model misses.

One prompt fans out to top models in parallel. A neutral judge from a different lab flags where they disagree — and reconciles them into a single, defensible answer. EU-hosted, fully traceable.

Reduce the errors one model would miss.

131
models tracked
13,593
benchmark runs
6
languages
New · early access

5 AI models inspect your image — before your audience does.

Image consensus: a council of five vision models catches anatomy, physics and lighting flaws in AI images that a single model misses.

91%
defects caught
0
false positives · real photos
~71%
max with one model alone
Join the waitlist

More about image consensus →
Pilot 2026-06 · LOKI-35 + real control photos · not a product guarantee.

DEFECTAI-generated
CLEANreal photo
Council:gemini-2.5-progpt-4ofable-5gemini-flashgpt-4o-mini

3 of 5 saw it. One model alone would have missed it — hence a council.

Live rankings

Top models this week

Full leaderboard →

Sample data

Top models — Scientific Reasoning

  • 01Mistral Large 3

    780ms

  • 02Claude Sonnet 4.6

    920ms·

  • 03Llama 3.3 405B

    1.18s

  • 04Gemini 2.5 Pro

    1.42s

  • 05GPT-5o

    1.64s·

  • 06Claude Opus 4.7

    1.82s

Sample · methodology pending

how we test →

Judge verdicts

3,735 evaluations across 63 models — counts only, no customer prompts

⚖️Most endorsed: Claude Opus 4.6 (99% accurate)

Claude Fable 5 — intelligence test

Independent, judge-scored results across our task categories — from real test runs, refreshed continuously.

Read the full Fable 5 analysis
93
Overall score · /100
20 judge-scored runs

Score by task category

Multilingual
100
Reasoning
99
Coding
99
Creative
97
Factual
70

Median response time

Multilingual9.1s
Reasoning9.5s
Coding11.1s
Creative5.7s
Factual7.0s

Each answer is scored 0–100 by an independent judge model on accuracy, completeness, reasoning and format. Lower factual scores reflect our deliberately hard knowledge probes.

Release notes
Blind-spot detection

See where the models split.

Across our weekly intelligence tests, a neutral judge scores every model. These are the questions where the models disagreed most — the blind spots a single model would have hidden. Anonymised; no customer prompts are ever shown.

63
models scored
1
distinct judge
3,735
judged runs
Modelagreed · judge flagged
Gemini 2.5 Flash
16 · 60
Gemini 2.5 Pro
18 · 55
Gemini Pro Latest
26 · 49
Gemini 3.1 Pro Preview Custom Tools
29 · 47
Gemini 3.1 Pro Preview
30 · 46
Gemini 3.5 Flash
4 · 5
Pricing

No fee on single calls. You only pay the fee on consensus.

Ask one model and you pay just its tokens plus a small tier margin — no platform fee. The per-call fee applies only to multi-model consensus checks. 100 consensus checks free every month, no card needed; bundles from €10/month for 500 calls. Every token itemised, nothing hidden.

Free

€0/mo

100 calls/mo

token use: provider +5%

Starter

€10/mo

500 calls

token use: provider +4%

Studio

€25/mo

2,000 calls

token use: provider +3%

Scale

€50/mo

5,000 calls

token use: provider +2%

Founders prices, locked through 2027 · PAYG also available · "token margin" = the small % we add on the model provider's own token price, lower on higher tiers

Single-model call
What you pay: tokens + margin
Details: No call-fee — only consensus checks carry the per-call fee. You pay the model provider's token price plus your tier margin (+2–5%). Example: a small model on ~4k tokens ≈ €0.001.
Consensus call
What you pay: call-fee + tokens + margin
Details: The call-fee varies per package (PAYG founders: 2c/proposer + 3c/judge, a 3+1 council = 9c; bundles: counts against your monthly quota; over quota: 1.5c/call). On top: the model provider's tokens + your tier margin.
Bring your own key (BYOK)
What you pay: call-fee only
Details: On consensus you pay only the per-package call-fee — your own key bills the provider directly, so no token cost and no margin from us. A single-model BYOK call costs nothing.

No per-seat fee. No single-call fee, ever. Every consensus receipt is itemised per model, per token, in and out.

Every cent, itemised

illustrative example
model                 in      out     cost
──────────────────────────────────────────────────
claude-haiku-4.5      812     540     €0.0041
gpt-4o                812     610     €0.0072
gemini-2.5-flash      812     498     €0.0029
judge (gpt-4o)        240     €0.0038
──────────────────────────────────────────────────
orchestration                         included
total                                 €0.0180

Accurate to the last token · your real receipt contains your exact counts

Estimate your cost

500
1005k

€10.00

Bundle price — overage at 1.5c/call above quota

€10.00

estimated / month

How we test

Real prompts, real latency, real scores. Three-tier framework so cost stays under control without compromising transparency.

Tier A

Full coverage

Speed + intelligence test daily across all four languages.

Tier B

Speed only

Latency and uptime sampled four times per day.

Tier C

Health ping

Up/down check every fifteen minutes.

Live · 130+ models available

Try any model — right here

Pick a model, type a prompt, see the answer stream. No sign-up, no wallet, no context-switching.

Open the live tester