Skip to content

Benchmarks

Methodology

How Tokonomix measures AI model performance. No vendor influence. No sponsored results. Transparent methodology, open data.

Speed

How fast does the model respond? We measure time-to-last-token for a fixed-length output prompt.

🧠

Intelligence

How accurate and capable is the model? A judge LLM rates answers across 6 categories on a 0–100 scale.

💚

Health

Is the API available? We check every 6 hours and track error rates and availability windows.

Speed Benchmark

Prompt: A fixed instruction targeting approximately 500 tokens of output. The same prompt is used for every model in every run cycle.

Runs: 3 sequential calls per test cycle. We measure end-to-end latency (first byte to last byte), not TTFT.

Metrics: P50 (median) and P95 (tail) across the 3 runs. P50 is the headline number; P95 reveals consistency.

Measurement location: EU — Amsterdam (AMS). All results are EU-latency. US or Asia results would differ.

Speed tiers:

Speed S
< 200 ms
Near-real-time
Speed A
< 500 ms
Interactive
Speed B
< 1000 ms
Acceptable
Speed C
> 1000 ms
Batch suitable
🧠

Intelligence Benchmark

Judge model: Claude Sonnet 4.6 acts as an impartial judge. Model names are never included in the judge prompt — only the response text is evaluated (blind review).

Six scoring categories (0–100 each):

Reasoning
Multi-step logical deduction
Coding
Code generation & debugging
Creativity
Open-ended creative output
Factual
Accuracy of factual claims
Instruction-following
Constraint adherence
Safety
Refusal of harmful prompts

Overall quality score: Weighted average of the six categories. Weights: Reasoning 25%, Coding 25%, Factual 20%, Instruction-following 15%, Creativity 10%, Safety 5%.

Intelligence benchmarks are currently in development — expected Q3 2026.

💚

Health Check

Frequency: Every 6 hours (06:00, 12:00, 18:00, 00:00 UTC).

Method: A minimal echo-style prompt is sent. We track HTTP status, error message (if any), and response time.

Error tracking: error_count per run is recorded. Sustained high error rates are surfaced on the leaderboard.

Run Schedule

06:00 UTC
Speed + Health
12:00 UTC
Speed + Health
18:00 UTC
Speed + Health
00:00 UTC
Speed + Health

All times UTC. Intelligence benchmarks run once per day (06:00 UTC) when active. Data freshness is always displayed next to each benchmark result.

FAQ

Are you affiliated with any AI provider?+
No. Tokonomix is operated by InterIP Networks, an independent infrastructure company. We have no commercial relationships with any AI provider and receive no sponsored placements.
Why EU latency only?+
We operate from Amsterdam and measure real-world latency for EU users. Many providers have multiple regions — latency from US or Asia would differ significantly. We will add region-switching in a future update.
How do you handle API cost?+
We run a fixed prompt budget per cycle. Flagship models (GPT-5, Claude Opus) are tested less frequently due to cost. The run frequency is visible next to each model.
Can I download the raw data?+
Yes — see the Dataset page for JSON export and schema documentation. The ?format=md query param on any page returns Markdown for LLM crawlers.
Is the judge-LLM fair to all models?+
We use Claude Sonnet 4.6 as judge, with model names stripped from the evaluation prompt. Cross-family bias is a known concern — we plan to add human baselines (Q3 2026) to calibrate the judge.

Data API

All benchmark data is available for free. No key required for read-only access.

GET/api/md/en/datasetFull dataset as JSON
GET/<page>?format=mdAny page as Markdown (for LLM crawlers)