Benchmarks
Methodology
How Tokonomix measures AI model performance. No vendor influence. No sponsored results. Transparent methodology, open data.
Speed
How fast does the model respond? We measure time-to-last-token for a fixed-length output prompt.
Intelligence
How accurate and capable is the model? A judge LLM rates answers across 6 categories on a 0–100 scale.
Health
Is the API available? We check every 6 hours and track error rates and availability windows.
Speed Benchmark
Prompt: A fixed instruction targeting approximately 500 tokens of output. The same prompt is used for every model in every run cycle.
Runs: 3 sequential calls per test cycle. We measure end-to-end latency (first byte to last byte), not TTFT.
Metrics: P50 (median) and P95 (tail) across the 3 runs. P50 is the headline number; P95 reveals consistency.
Measurement location: EU — Amsterdam (AMS). All results are EU-latency. US or Asia results would differ.
Speed tiers:
Intelligence Benchmark
Status: Live since May 2026. 13,593 scored runs across 6 categories and 4 providers. New runs every 6 hours alongside the speed and health checks.
Judge model: Claude Sonnet 4.5 acts as an impartial judge. The evaluated model's name is never included in the judge prompt — only the raw response text is scored (blind review).
Scoring: Each prompt receives a single 0–100 quality score from the judge, plus a classification (correct / partial / incorrect). The judge evaluates factual accuracy, completeness, reasoning quality, and format adherence as a combined rubric. Category averages are shown on model pages.
Six prompt categories:
Overall quality score: Unweighted average of all scored runs for a model across all categories.
What counts vs. what you watch
The arena shows a live race with health bars and strikes — but the screen and the ranking are two different layers. The visual is there to watch; the ranking is decided by an independent judge panel. This table makes the distinction explicit, so nothing on screen is mistaken for a result.
| On screen | Source | Counts toward ranking? |
|---|---|---|
| Health bars / lead / damage / strikes | Deterministic visual derivation (v8.1-tokonomix) | No — cosmetic |
| Live race leader during a round | Single fast per-turn judge (gpt-4o-mini, 0–10) | No — indicative |
| Round winner | Cross-family panel majority vote (0–100) | Yes |
| Leaderboard position | TrueSkill skill estimate (μ) | Yes |
| Jury upvotes (▲) | Panel vote when a judge scores a model ≥60 | Shown, not ranking |
| Judge agreement % | How often a judge's pick matched the panel winner | Panel agreement — not a correctness measure |
| Savings (€) | Rounds where a cheaper council beat a costlier model | Best-case — wins only |
| Blind spots caught | Omissions confirmed by ≥2 panel judges | Confirmed only — rolling out |
A fourth method: the arena
Static benchmarks measure a model against a fixed bar. The arena measures models against each other, on realistic customer-service scenarios, judged by a panel of rival models. It produces something a single score cannot: a relative ranking with an uncertainty margin.
Why this complements the static benchmarks (it does not replace them):
- Static tests give absolute quality per category; the arena gives head-to-head strength and a cost-versus-quality trade-off on realistic tasks.
- The arena captures things a 0–100 score misses: consistency across multiple turns, how a model handles follow-ups, and — with councils — whether collaboration actually pays off.
- The on-screen race is a way to watch the contest unfold. The result is always set by the panel, never by the health bars.
How a round scores: from per-turn to panel
Scoring happens in two stages. During the match a single fast referee keeps a running tally; at the end an independent panel of judges votes on the winner.
Stage 1 — live, per turn: One fast, deliberately cheap judge (gpt-4o-mini) scores every answer on a 0–10 scale in a single call. This feeds only the live race lanes — it is indicative, not decisive.
Stage 2 — end of round, the panel: A panel of 3–5 judges from different model families independently votes on the winner on a 0–100 scale. The majority wins; ties break on highest average panel score, then deterministically on lowest model id.
Blind by index: Model names are stripped from the panel prompt — contestants are referred to by number/index only, so the panel cannot favour a familiar brand.
Fixed thresholds: A model earns an upvote (▲) when a judge scores it ≥60. A turn is marked "decisive" when the winner's margin reaches ≥30% of the score scale. These fixed values define the tallies you see.
TrueSkill: what μ and σ mean
Each model has an estimated skill level μ (mu) and an uncertainty σ (sigma). A new model starts at μ=25, σ=8.333 — high uncertainty. Every match moves μ toward the model's true strength and shrinks σ. Two models with the same μ but different σ are not equal: the one with low σ is proven, the other is still a guess.
The constants we actually use: Starting rating μ=25, σ=8.333; skill-variance BETA=4.167; per-match drift TAU=0.0833. These are fixed in code and identical for every model.
How we currently sort — disclosed honestly: The leaderboard sorts on raw μ (estimated strength). A stricter "proven" ranking would sort on the conservative μ − 3σ. Because this is early data — most models have only a few games — σ is still large, so the top of the board can still shift. We show the estimate and tell you it is an estimate rather than hiding behind a single number.
Council vs. frontier: does collaboration pay?
A round can pit a cheap council of smaller models against a single costly frontier model. In a council, each turn's answer is the consensus synthesis of its members. This lets the arena answer a question a single score cannot: can a cheap council beat an expensive frontier model — and if so, by how much?
How savings are derived: When a council both wins a round and costs less than the frontier model it beat, we show the difference as savings. A council win is keyed to the group, never to an individual member's board, so a group result never inflates a single model's ranking.
Best-case caveat: Savings accumulate only from rounds the council won. Councils that lost (and so spent money for nothing) are not subtracted. The figure is therefore a best-case saving in the rounds where the council won — not a net result.
Two independent reputations
A model is measured two separate ways, and the two can disagree without either being wrong — they measure different things.
Arena reputation (relative): TrueSkill from head-to-head game wins. It ranks a model against its rivals on realistic scenarios.
Neutral-judge reputation (absolute): How often a model is rated correct / partial / wrong in the recurring intelligence test, against a fixed rubric rather than against an opponent.
A model can lose games yet hold a high correctness reputation, or win games while scoring only partial on absolute accuracy. We keep them separate on purpose.
Blind spots
A blind spot is an important point one contestant misses while ≥2 others cover it — so it is demonstrably important, not a fringe detail.
Confirmed by the panel: A blind spot is only counted when ≥2 panel judges independently agree on the same omission. One judge proposes the aspect list and a miss-matrix; the other judges fill the same pinned aspects, and a miss is confirmed only when at least two matrices agree on that cell.
Status: This detection is live and rolling out across rounds. We are not publishing a count yet — we would rather show no number than a number that is not yet backed by enough data.
Constants & thresholds
Every tally on the arena pages follows from a small set of fixed choices. We list them here so the numbers are auditable.
Honest disclosures
Things a careful reader would want spelled out — limits, known biases, and choices that shape the numbers.
Early data, volatile rankings: The arena is young. Most models have only a few games, so a single win or loss can move μ a lot and rankings are still volatile. We show game counts and uncertainty rather than implying the order is settled.
Raw-μ sorting: The board sorts on raw μ, not the conservative μ − 3σ. With high uncertainty this means a model with one lucky win can sit above a more proven one. We treat the current order as "estimated, not yet proven."
Judge agreement is not correctness: The judge-agreement figure measures how often a judge's pick matched the panel winner — but the winner is the majority of those same judges. It measures conformity to the panel, not whether the panel was right. A correct-but-contrarian judge scores low here.
Savings are best-case: Savings count only rounds the council won and was cheaper; lost councils are not subtracted. Read it as a best-case figure in the winning rounds, not a net saving.
Single-judge self-preference in the intelligence test: The recurring intelligence test runs on one primary judge (Claude Sonnet 4.5), which can also judge Claude-family models — self-preference is a known LLM bias. A secondary cross-check judge exists to calibrate this, and the arena dampens it further with a cross-family panel; the single-judge intelligence test does not have that panel.
Contestant ↔ judge family overlap: A model family can appear both as a contestant and in the judging panel of the same round. Blind-by-index review and the cross-family panel reduce the effect, but overlap can occur and we disclose it rather than claim strict family exclusion.
Two scales, one board: The live per-turn judge uses 0–10 and the end-of-round panel uses 0–100. We normalise everything to the same scale before it reaches the board, so the two numbers you may see during a round are not mixed in the ranking.
How ties are handled: A round with no clear winner counts as a draw — not a loss for everyone, which would distort win rates — and credits no savings.
Versioned, deterministic derivation: The on-screen visual derivation is pure, deterministic, and carries a version tag (v8.1-tokonomix) precisely so a later logic change never silently rewrites past rounds. Material methodology changes are logged in the changelog below.
Image quality control: vision-QC pilot
In June 2026 we ran the first baseline measurement of AI image quality control: which AI vision models reliably catch real photo defects and which flag too many clean photos? Six solo vision models and two council configurations were tested on 300 images (160 defect, 140 control). The council of five models reached 87.5% recall vs 66.9% for the best single model — a gap of 20.6 percentage points. False-alarm rate for the council (ungrounded) was 17.1%. All images were normalised (JPEG q90, max 1024px). Ground-truth labels were human-annotated (LOKI dataset) or synthetically generated. See the full results at /benchmarks/vision-qc.
Health Check
Frequency: Every 6 hours (06:00, 12:00, 18:00, 00:00 UTC).
Method: A minimal echo-style prompt is sent. We track HTTP status, error message (if any), and response time.
Error tracking: error_count per run is recorded. Sustained high error rates are surfaced on the leaderboard.
Run Schedule
All times UTC. Intelligence benchmarks run every 6 hours alongside speed and health checks. Data freshness is always displayed next to each benchmark result.
FAQ
Are you affiliated with any AI provider?+
Why EU latency only?+
How do you handle API cost?+
Can I download the raw data?+
Is the judge-LLM fair to all models?+
Methodology owner
This methodology is maintained and signed by Mes Kalkan. Material changes are logged below. Data corrections route through the methodology owner and are published within 24 hours of a verified report.
Methodology changelog
- — Initial methodology published. Signed by Mes Kalkan.
Data API
All benchmark data is available for free. No key required for read-only access.