Tier B — Productie

Draait in:FranceGemaakt in:China

Qwen2.5-VL-72B-Instruct

Tier B — Productie

Tokonomix-redactie·Gecontroleerd door Mes Kalkan·Gepubliceerd 27 mei 2026·Laatst gecontroleerd 30 juli 2026

Sectie 01

Snelheidsanalyse

Latency gemeten over alle benchmark-runs. P50 (mediaan) en P95 (95e percentiel) geven een realistisch beeld van de responssnelheid onder normale en piekbelasting.

P50 latency (mediaan)P95 latency101 runs

Sectie 02

Kwaliteitsscores

Evaluatieresultaten van judge-model beoordelingen over diverse taakcategorieën. Scores weerspiegelen coherentie, accuratesse en instructieopvolging.

Creatief

Feitelijk

100

Meertaligheid

100

Redeneren

Sectie 03

Prijsgeschiedenis

Directe provider-tarieven per miljoen tokens, plus een typische gespreks-kostschatting.

💰

API-tarieven — Qwen2.5-VL-72B-Instruct

$0.9100 per 1M input-tokens

$0.9100 per 1M output-tokens

≈ $0.0007 per typisch gesprek (800 tokens)

Input vs output prijs (per 1M tokens)

per 1M input-tokens$0.9100

per 1M output-tokens$0.9100

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.9100

input / 1M

— stable

$0.9100

output / 1M

— stable

2026-06-142026-07-052026-07-26

Input

Output

Price change

⟳ synced weekly

Sectie 04

Tokens per seconde

Doorvoersnelheid in tokens per seconde, afgeleid uit gemeten P50-latency. Hogere waarden zijn beter; fluctuaties weerspiegelen serverbelasting bij de provider.

Doorvoer (tokens / s)1538 / avg 1404

Geschat uit P50-latency × 200 output-tokens — het absolute getal hangt af van deze aanname; de trend is wat telt.

Sectie 05

Mogelijkheden

visionownedBy: Qwen

Sectie 06

Beschikbaarheid

Hoe vaak dit model antwoordt als we het aanroepen — gemeten over echte API-aanvragen en live-tests in de afgelopen 30 dagen. Dit staat los van kwaliteit: deze cijfers laten alleen zien of het model reageert, niet hoe goed het antwoord is.

Afgelopen 7 dagen

—

Afgelopen 30 dagen

100.0%

n=36

Mediane responstijd

4,186ms

n=36

Gebaseerd op 426 metingen in de afgelopen 30 dagen.

Technische details

Alleen echte API-aanroepen en live-testverzoeken tellen mee — interne probes en benchmarkruns zijn uitgesloten.

Aanroepen met een eigen API-sleutel (BYOK) zijn uitgesloten: die fouten zijn sleutelspecifiek en geen teken van modelneergang.

Mislukte aanroepen worden NIET meegeteld in kwaliteitsscores — kwaliteit wordt gemeten op geslaagde responses. Beschikbaarheid en kwaliteit zijn onafhankelijke signalen.

Mediane responstijd (p50) over geslaagde aanroepen met een vastgelegde duur. Uitschieters trekken de mediaan minder dan het gemiddelde.

Totaal aanroepen (30d)

OK-reacties (30d)

Totaal aanroepen (7d)

OK-reacties (7d)

Sectie 07

Tokonomix benchmark-oordelen

⚖️

Endorsed by 2 judges

Independent LLM judges evaluated this model on our weekly intelligence tests

cohere/command-a100/100 · 1 runs

1 correct0 partial0 wrong100% accuracy

claude-sonnet-4-595/100 · 47 runs

43 correct4 partial0 wrong91% accuracy

● 2026-07-26

Quality drops 10 points with slower response times and factual weakness

Qwen2.5-VL-72B-Instruct experienced a notable decline in this benchmark window, with overall quality falling from 98.8 to 88.8 points while latency increased by 37 percent to 8.9 seconds at the median. The model demonstrates exceptional strength in multilingual and reasoning tasks, both scoring perfect 100s, and maintains outstanding creative capabilities at 98 points. However, factual accuracy emerged as a significant weakness, scoring just 57 points and representing a substantial gap in the model's performance profile. The previous window included coding benchmarks where the model scored 98, but this category was not tested in the current window, making direct comparison incomplete. Thelatency increase from 6.5 to 8.9 seconds suggests either infrastructure changes or increased processing complexity. Despite the quality decline, the model retains strong capabilities in three of four tested categories. Users should be aware of the factual accuracy limitations when deploying this model for knowledge-intensive applications, while the model remains well-suited for creative, multilingual, and reasoning-heavy workloads where it continues to excel.

Quality

88.8

Latency p50

8,876 ms

Test runs

✗ Quality dropped 10 points✗ Latency increased 37%✗ Factual accuracy only 57✓ Perfect multilingual and reasoning scores

Laatste automatische test

30 jul 2026 · 08:04 UTC · Snelheidstest

P50 latency

130 ms

P95 latency

1115 ms

Fouten

0 / 6 runs

Laatst beoordeeld door Tokonomix-team·30 juli 2026