Benchmarks

Methodology

How Tokonomix measures AI model performance. No vendor influence. No sponsored results. Transparent methodology, open data.

Mes Kalkan, Founder, Tokonomix·Published April 30, 2026·Last reviewed May 28, 2026

⚡

Speed

How fast does the model respond? We measure time-to-last-token for a fixed-length output prompt.

🧠

Intelligence

How accurate and capable is the model? A judge LLM rates answers across 6 categories on a 0–100 scale.

💚

Health

Is the API available? We check every 6 hours and track error rates and availability windows.

⚡

Speed Benchmark

Prompt: A fixed instruction targeting approximately 500 tokens of output. The same prompt is used for every model in every run cycle.

Runs: 3 sequential calls per test cycle. We measure end-to-end latency (first byte to last byte), not TTFT.

Metrics: P50 (median) and P95 (tail) across the 3 runs. P50 is the headline number; P95 reveals consistency.

Measurement location: EU — Amsterdam (AMS). All results are EU-latency. US or Asia results would differ.

Speed tiers:

Speed S

< 200 ms

Near-real-time

Speed A

< 500 ms

Interactive

Speed B

< 1000 ms

Acceptable

Speed C

> 1000 ms

Batch suitable

🧠

Intelligence Benchmark

Status: Live since May 2026. 16,357 scored runs across 6 categories and 4 providers. New runs every 6 hours alongside the speed and health checks.

Judge model: Claude Sonnet 4.5 acts as an impartial judge. The evaluated model's name is never included in the judge prompt — only the raw response text is scored (blind review).

Scoring: Each prompt receives a single 0–100 quality score from the judge, plus a classification (correct / partial / incorrect). The judge evaluates factual accuracy, completeness, reasoning quality, and format adherence as a combined rubric. Category averages are shown on model pages.

Six prompt categories:

Reasoning

Multi-step logical deduction and math

Coding

Code generation, debugging, review

Factual

Accuracy of factual claims

Multilingual

Translation and cross-language accuracy

Creative

Open-ended creative output

Healthcare (Zorg)

Dutch healthcare domain knowledge

Overall quality score: Unweighted average of all scored runs for a model across all categories.

🏁

What counts vs. what you watch

The arena shows a live race with health bars and strikes — but the screen and the ranking are two different layers. The visual is there to watch; the ranking is decided by an independent judge panel. This table makes the distinction explicit, so nothing on screen is mistaken for a result.

On screen	Source	Counts toward ranking?
Health bars / lead / damage / strikes	Deterministic visual derivation (v8.1-tokonomix)	No — cosmetic
Live race leader during a round	Single fast per-turn judge (gpt-4o-mini, 0–10)	No — indicative
Round winner	Cross-family panel majority vote (0–100)	Yes
Leaderboard position	TrueSkill skill estimate (μ)	Yes
Jury upvotes (▲)	Panel vote when a judge scores a model ≥60	Shown, not ranking
Judge agreement %	How often a judge's pick matched the panel winner	Panel agreement — not a correctness measure
Savings (€)	Rounds where a cheaper council beat a costlier model	Best-case — wins only
Blind spots caught	Omissions confirmed by ≥2 panel judges	Confirmed only — rolling out

⚔️

A fourth method: the arena

Static benchmarks measure a model against a fixed bar. The arena measures models against each other, on realistic customer-service scenarios, judged by a panel of rival models. It produces something a single score cannot: a relative ranking with an uncertainty margin.

Why this complements the static benchmarks (it does not replace them):

Static tests give absolute quality per category; the arena gives head-to-head strength and a cost-versus-quality trade-off on realistic tasks.
The arena captures things a 0–100 score misses: consistency across multiple turns, how a model handles follow-ups, and — with councils — whether collaboration actually pays off.
The on-screen race is a way to watch the contest unfold. The result is always set by the panel, never by the health bars.

⚖️

How a round scores: from per-turn to panel

Scoring happens in two stages. During the match a single fast referee keeps a running tally; at the end an independent panel of judges votes on the winner.

Stage 1 — live, per turn: One fast, deliberately cheap judge (gpt-4o-mini) scores every answer on a 0–10 scale in a single call. This feeds only the live race lanes — it is indicative, not decisive.

Stage 2 — end of round, the panel: A panel of 3–5 judges from different model families independently votes on the winner on a 0–100 scale. The majority wins; ties break on highest average panel score, then deterministically on lowest model id.

Blind by index: Model names are stripped from the panel prompt — contestants are referred to by number/index only, so the panel cannot favour a familiar brand.

Fixed thresholds: A model earns an upvote (▲) when a judge scores it ≥60. A turn is marked "decisive" when the winner's margin reaches ≥30% of the score scale. These fixed values define the tallies you see.

📈

TrueSkill: what μ and σ mean

Each model has an estimated skill level μ (mu) and an uncertainty σ (sigma). A new model starts at μ=25, σ=8.333 — high uncertainty. Every match moves μ toward the model's true strength and shrinks σ. Two models with the same μ but different σ are not equal: the one with low σ is proven, the other is still a guess.

The constants we actually use: Starting rating μ=25, σ=8.333; skill-variance BETA=4.167; per-match drift TAU=0.0833. These are fixed in code and identical for every model.

How we currently sort — disclosed honestly: The leaderboard sorts on raw μ (estimated strength). A stricter "proven" ranking would sort on the conservative μ − 3σ. Because this is early data — most models have only a few games — σ is still large, so the top of the board can still shift. We show the estimate and tell you it is an estimate rather than hiding behind a single number.

🤝

Council vs. frontier: does collaboration pay?

A round can pit a cheap council of smaller models against a single costly frontier model. In a council, each turn's answer is the consensus synthesis of its members. This lets the arena answer a question a single score cannot: can a cheap council beat an expensive frontier model — and if so, by how much?

How savings are derived: When a council both wins a round and costs less than the frontier model it beat, we show the difference as savings. A council win is keyed to the group, never to an individual member's board, so a group result never inflates a single model's ranking.

Best-case caveat: Savings accumulate only from rounds the council won. Councils that lost (and so spent money for nothing) are not subtracted. The figure is therefore a best-case saving in the rounds where the council won — not a net result.

🪪

Two independent reputations

A model is measured two separate ways, and the two can disagree without either being wrong — they measure different things.

Arena reputation (relative): TrueSkill from head-to-head game wins. It ranks a model against its rivals on realistic scenarios.

Neutral-judge reputation (absolute): How often a model is rated correct / partial / wrong in the recurring intelligence test, against a fixed rubric rather than against an opponent.

A model can lose games yet hold a high correctness reputation, or win games while scoring only partial on absolute accuracy. We keep them separate on purpose.

🔍

Blind spots

A blind spot is an important point one contestant misses while ≥2 others cover it — so it is demonstrably important, not a fringe detail.

Confirmed by the panel: A blind spot is only counted when ≥2 panel judges independently agree on the same omission. One judge proposes the aspect list and a miss-matrix; the other judges fill the same pinned aspects, and a miss is confirmed only when at least two matrices agree on that cell.

Status: This detection is live and rolling out across rounds. We are not publishing a count yet — we would rather show no number than a number that is not yet backed by enough data.

Constants & thresholds

Every tally on the arena pages follows from a small set of fixed choices. We list them here so the numbers are auditable.

Upvote (▲):

A judge score of ≥60 on the 0–100 panel scale.

Decisive turn:

A winning margin of ≥30% of the score scale.

Minimum contestants for blind spots:

At least 3 contestants — below that, "≥2 others covered it" cannot be meaningful.

TrueSkill parameters:

BETA=4.167, TAU=0.0833; starting rating μ=25, σ=8.333.

Ties:

An exact tie counts as a draw — no loss for anyone — and credits no savings.

Honest disclosures

Things a careful reader would want spelled out — limits, known biases, and choices that shape the numbers.

Early data, volatile rankings: The arena is young. Most models have only a few games, so a single win or loss can move μ a lot and rankings are still volatile. We show game counts and uncertainty rather than implying the order is settled.

Raw-μ sorting: The board sorts on raw μ, not the conservative μ − 3σ. With high uncertainty this means a model with one lucky win can sit above a more proven one. We treat the current order as "estimated, not yet proven."

Judge agreement is not correctness: The judge-agreement figure measures how often a judge's pick matched the panel winner — but the winner is the majority of those same judges. It measures conformity to the panel, not whether the panel was right. A correct-but-contrarian judge scores low here.

Savings are best-case: Savings count only rounds the council won and was cheaper; lost councils are not subtracted. Read it as a best-case figure in the winning rounds, not a net saving.

Single-judge self-preference in the intelligence test: The recurring intelligence test runs on one primary judge (Claude Sonnet 4.5), which can also judge Claude-family models — self-preference is a known LLM bias. A secondary cross-check judge exists to calibrate this, and the arena dampens it further with a cross-family panel; the single-judge intelligence test does not have that panel.

Contestant ↔ judge family overlap: A model family can appear both as a contestant and in the judging panel of the same round. Blind-by-index review and the cross-family panel reduce the effect, but overlap can occur and we disclose it rather than claim strict family exclusion.

Two scales, one board: The live per-turn judge uses 0–10 and the end-of-round panel uses 0–100. We normalise everything to the same scale before it reaches the board, so the two numbers you may see during a round are not mixed in the ranking.

How ties are handled: A round with no clear winner counts as a draw — not a loss for everyone, which would distort win rates — and credits no savings.

Versioned, deterministic derivation: The on-screen visual derivation is pure, deterministic, and carries a version tag (v8.1-tokonomix) precisely so a later logic change never silently rewrites past rounds. Material methodology changes are logged in the changelog below.

Image quality control: vision-QC pilot

In June 2026 we ran the first baseline measurement of AI image quality control: which AI vision models reliably catch real photo defects and which flag too many clean photos? Six solo vision models and two council configurations were tested on 300 images (160 defect, 140 control). The council of five models reached 87.5% recall vs 66.9% for the best single model — a gap of 20.6 percentage points. False-alarm rate for the council (ungrounded) was 17.1%. All images were normalised (JPEG q90, max 1024px). Ground-truth labels were human-annotated (LOKI dataset) or synthetically generated. See the full results at /benchmarks/vision-qc.

Full results →

💚

Health Check

Frequency: Every 6 hours (06:00, 12:00, 18:00, 00:00 UTC).

Method: A minimal echo-style prompt is sent. We track HTTP status, error message (if any), and response time.

Error tracking: error_count per run is recorded. Sustained high error rates are surfaced on the leaderboard.

Run Schedule

06:00 UTC

Speed + Health

12:00 UTC

Speed + Health

18:00 UTC

Speed + Health

00:00 UTC

Speed + Health

All times UTC. Intelligence benchmarks run every 6 hours alongside speed and health checks. Data freshness is always displayed next to each benchmark result.

FAQ

Are you affiliated with any AI provider?+

No. Tokonomix is operated by InterIP Networks, an independent infrastructure company. We have no commercial relationships with any AI provider and receive no sponsored placements.

Why EU latency only?+

We operate from Amsterdam and measure real-world latency for EU users. Many providers have multiple regions — latency from US or Asia would differ significantly. We will add region-switching in a future update.

How do you handle API cost?+

We run a fixed prompt budget per cycle. Flagship models (GPT-5, Claude Opus) are tested less frequently due to cost. The run frequency is visible next to each model.

Can I download the raw data?+

Yes — see the Dataset page for JSON export and schema documentation. The full dataset is available at /api/md/{lang}/dataset.

Is the judge-LLM fair to all models?+

We use Claude Sonnet 4.5 as judge, with model names stripped from the evaluation prompt. Cross-family bias is a known concern — we plan to add human baselines (Q3 2026) to calibrate the judge.

Methodology owner

This methodology is maintained and signed by Mes Kalkan. Material changes are logged below. Data corrections route through the methodology owner and are published within 24 hours of a verified report.

Methodology changelog

2026-04-30 — Initial methodology published. Signed by Mes Kalkan.

Data API

All benchmark data is available for free. No key required for read-only access.

GET/api/md/en/datasetFull dataset as JSON