article

What 23,000 Benchmark Runs Across 220 Models Taught Us About the AI Frontier

Picking an AI model has never felt harder. The market adds new releases faster than most teams can evaluate them, pricing varies by orders of magnitude, and "best in class" claims from vendors are almost always cherry-picked. So we stopped reading press releases and started measuring.

Over six weeks — from April 30 to June 15, 2026 — Tokonomix ran 23,373 benchmark runs across 203 distinct models drawn from our catalog of 220 tracked models and 131 active at the time of measurement, spanning seven providers: Anthropic, OpenAI, Google, OVH (EU-hosted), OpenRouter, DeepSeek, and Mistral. Every model was scored 0–100 across six capability categories: coding, reasoning, factual accuracy, creative writing, multilingual performance, and healthcare. No single company's benchmark, no curated demo prompts — production-grade, continuously updated measurement.

Here is what the data actually shows.

The Top Is Compressed — More Than You Think

The most striking finding is how little separates the frontier leaders. The top ten models by overall score (a mean across the six categories) sit in a band that spans barely one point:

| Model | Overall Score | |---|---| | gemini-3.1-flash-lite | 99.4 | | gemini-flash-lite-latest | 99.2 | | claude-opus-4-5 | 99.1 | | claude-opus-4-7 | 98.9 | | gpt-5-chat-latest | 98.8 | | claude-opus-4-8 | 98.7 | | claude-opus-4-6 | 98.6 | | gpt-4.1 | 98.0 | | gpt-4.1-mini | 98.0 | | gpt-4.1-nano | 98.0 |

Overall = the mean across the six categories, as measured through 15 June 2026. Our public leaderboard updates continuously as new runs land, so the live rankings will differ slightly from this snapshot — that is the point: the frontier moves week to week.

The gap from rank one to rank ten is 1.4 points on a 100-point scale. That compression has a practical consequence: any "Model X is 20% smarter than Model Y" claim you read in a vendor's blog post is almost certainly measuring something narrow and specific, not aggregate capability. At the frontier, aggregate capability has converged.

This does not mean all models are equal — it means the aggregate score is the wrong instrument for choosing between them. You have to go deeper.

Coding and Reasoning Are Saturating

When you break the six categories out, two of them — coding and reasoning — now show ceiling effects at the frontier. Many top models hit the 100 ceiling on both dimensions, which means those categories no longer discriminate between the best options. If you are choosing a model purely for software development or logical problem-solving, you are choosing between models that are all essentially maxed out on the dimensions we can currently measure.

The categories that do separate models at the frontier are factual accuracy, multilingual performance, and healthcare. These are harder to saturate because they require broad knowledge coverage, cultural nuance, and domain precision rather than the rule-following that coding and reasoning tasks tend to reward. If your use case lives in any of these three areas, the selection decision becomes much more meaningful — and more data-dependent.

Cost: You Can Buy ~98% of the Frontier for Pennies

The single number that most surprised us: the overall leader is a "flash-lite" tier model.

gemini-3.1-flash-lite tops the rankings at 99.4 overall — ahead of the largest flagship models from any provider. Capability no longer requires the biggest, most expensive tier. That is not a fluke of our scoring methodology; it shows up consistently across the six weeks of measurement.

More broadly, the cost-efficient frontier looks like this:

gpt-4.1-nano: 10 cents per million input tokens, 40 cents per million output tokens — overall score 98.0. That is within two points of the top-ranked model at a price most flagship models cannot match.
gpt-oss-120b (hosted on OVH in the EU): 8 cents per million input tokens, 40 cents per million output tokens — overall score 97.5.
Mistral-Small-3.2-24B (OVH, EU): 9 cents per million input tokens, 28 cents per million output tokens — overall score 93.7.

The practical implication: for the majority of production workloads, you can reach roughly 98% of the frontier's measured quality at a small fraction of flagship pricing. The remaining 1–2 points on the aggregate score may matter for specific high-stakes tasks, but for general-purpose use, the economics have shifted dramatically toward the efficient tier.

Speed Is Its Own Axis

Latency does not track quality. This sounds obvious, but the data makes it concrete.

The fastest median responders in our dataset are models you may not have heard of in flagship conversations:

voxtral-small-24b: ~157 ms median (p50) response time
nemotron-super-49b: ~200 ms
hermes-3-llama-3.1-70b: ~227 ms
llama-4-scout: ~248 ms

At the other end:

gemma-4-26b: ~22,950 ms median
gemma-4-31b: ~21,940 ms
gpt-4-turbo: ~10,550 ms

The slowest models in our measurement are more than 140 times slower than the fastest, at the median. For a user-facing application where response time is a product quality signal, that difference is the difference between a tool people reach for and one they abandon.

The implication for selection: quality score and latency are independent variables. Some high-scoring models are slow. Some fast models score well on quality. You must evaluate both axes simultaneously for your use case — a background summarization pipeline has different requirements than a real-time coding assistant.

Sovereignty Without Sacrifice: EU-Hosted Models Are Now Near-Frontier

For teams operating under GDPR or other data-residency requirements, EU hosting has historically meant accepting a significant quality discount. That is no longer true.

Among models hosted on OVH infrastructure in France, the following score above 90 overall:

gpt-oss-120b: 97.5
Qwen2.5-VL-72B: 94.3
Mistral-Small-3.2-24B: 93.7
Meta-Llama-3.3-70B: 92.7
Llama-3.1-8B: 91.2

A score of 97.5 from a model with EU data residency, at 8 cents per million input tokens, changes the compliance calculus for a lot of organizations. Six months ago that combination did not exist at this quality level. Now it does.

So Which Model Should You Use?

The honest answer is that "best model" is the wrong question.

The data shows a frontier where the top ten models are separated by 1.4 points and where a flash-lite model leads the overall ranking. In that environment, optimizing for the single highest aggregate score will lead you to pay for differences you cannot measure in production. The right question is: best model for this task, at this cost, at this latency budget, under these data-residency constraints.

That reframing changes how you evaluate:

High-volume text processing where cost dominates: gpt-4.1-nano or gpt-oss-120b give you near-frontier quality in the 8–10 cents per million input token range.
Real-time user-facing features where latency dominates: the sub-250 ms models are the starting point; filter from there by quality on your specific task category.
Factual, multilingual, or healthcare workloads where quality differences are still meaningful: this is exactly where side-by-side category-level scoring matters most, because coding and reasoning scores no longer discriminate at the frontier.
EU data residency required: the OVH-hosted tier now offers 90+ overall scores with full data residency — factor it in from the start rather than treating sovereignty as a fallback.

The common thread is that none of these decisions can be made from a single aggregate ranking or a vendor's benchmark page. They require measuring your task against the models you are actually considering, with your prompts, at your usage scale.

If you want to test this yourself, you can run the same multi-model consensus evaluation on your own prompts at /live-test/consensus. It runs your query across multiple models simultaneously and surfaces agreement, disagreement, and category-level performance — so you can see where models converge and where they diverge on exactly the kind of question you are trying to answer.

The frontier is more crowded, more affordable, and more geographically distributed than it was a year ago. The teams that navigate it well will be the ones that measure rather than assume.