Benchmarks

Methodology

How Tokonomix measures AI model performance. No vendor influence. No sponsored results. Transparent methodology, open data.

⚡

Speed

How fast does the model respond? We measure time-to-last-token for a fixed-length output prompt.

🧠

Intelligence

How accurate and capable is the model? A judge LLM rates answers across 6 categories on a 0–100 scale.

💚

Health

Is the API available? We check every 6 hours and track error rates and availability windows.

⚡

Speed Benchmark

Prompt: A fixed instruction targeting approximately 500 tokens of output. The same prompt is used for every model in every run cycle.

Runs: 3 sequential calls per test cycle. We measure end-to-end latency (first byte to last byte), not TTFT.

Metrics: P50 (median) and P95 (tail) across the 3 runs. P50 is the headline number; P95 reveals consistency.

Measurement location: EU — Amsterdam (AMS). All results are EU-latency. US or Asia results would differ.

Speed tiers:

Speed S

< 200 ms

Near-real-time

Speed A

< 500 ms

Interactive

Speed B

< 1000 ms

Acceptable

Speed C

> 1000 ms

Batch suitable

🧠

Intelligence Benchmark

Judge model: Claude Sonnet 4.6 acts as an impartial judge. Model names are never included in the judge prompt — only the response text is evaluated (blind review).

Six scoring categories (0–100 each):

Reasoning

Multi-step logical deduction

Coding

Code generation & debugging

Creativity

Open-ended creative output

Factual

Accuracy of factual claims

Instruction-following

Constraint adherence

Safety

Refusal of harmful prompts

Overall quality score: Weighted average of the six categories. Weights: Reasoning 25%, Coding 25%, Factual 20%, Instruction-following 15%, Creativity 10%, Safety 5%.

Intelligence benchmarks are currently in development — expected Q3 2026.

💚

Health Check

Frequency: Every 6 hours (06:00, 12:00, 18:00, 00:00 UTC).

Method: A minimal echo-style prompt is sent. We track HTTP status, error message (if any), and response time.

Error tracking: error_count per run is recorded. Sustained high error rates are surfaced on the leaderboard.

Run Schedule

06:00 UTC

Speed + Health

12:00 UTC

Speed + Health

18:00 UTC

Speed + Health

00:00 UTC

Speed + Health

All times UTC. Intelligence benchmarks run once per day (06:00 UTC) when active. Data freshness is always displayed next to each benchmark result.

FAQ

Are you affiliated with any AI provider?+

No. Tokonomix is operated by InterIP Networks, an independent infrastructure company. We have no commercial relationships with any AI provider and receive no sponsored placements.

Why EU latency only?+

We operate from Amsterdam and measure real-world latency for EU users. Many providers have multiple regions — latency from US or Asia would differ significantly. We will add region-switching in a future update.

How do you handle API cost?+

We run a fixed prompt budget per cycle. Flagship models (GPT-5, Claude Opus) are tested less frequently due to cost. The run frequency is visible next to each model.

Can I download the raw data?+

Yes — see the Dataset page for JSON export and schema documentation. The ?format=md query param on any page returns Markdown for LLM crawlers.

Is the judge-LLM fair to all models?+

We use Claude Sonnet 4.6 as judge, with model names stripped from the evaluation prompt. Cross-family bias is a known concern — we plan to add human baselines (Q3 2026) to calibrate the judge.

Data API

All benchmark data is available for free. No key required for read-only access.

GET/api/md/en/datasetFull dataset as JSON

GET/<page>?format=mdAny page as Markdown (for LLM crawlers)