Tier B — Production

Runs in:FranceMade in:China

Qwen2.5-VL-72B-Instruct

Tier B — Production

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 27, 2026·Last reviewed July 30, 2026

Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency102 runs

Section 02

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

Creative

Factual

100

Multilingual

100

Reasoning

Section 03

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — Qwen2.5-VL-72B-Instruct

$0.9100 per 1M input tokens

$0.9100 per 1M output tokens

≈ $0.0007 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$0.9100

per 1M output tokens$0.9100

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.9100

input / 1M

— stable

$0.9100

output / 1M

— stable

2026-06-142026-07-052026-07-26

Input

Output

Price change

⟳ synced weekly

Section 04

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)1149 / avg 1401

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 05

Capabilities

visionownedBy: Qwen

Section 06

Availability

How often this model answers when we call it — measured across real API requests and live tests over the last 30 days. This is separate from quality: these numbers only tell you whether the model responds, not how good the answer is.

Last 7 days

—

Last 30 days

100.0%

n=36

Median response time

4,186ms

n=36

Based on 426 measurements over the last 30 days.

Technical details

Only live API calls and live-test requests count — internal probes and benchmark runs are excluded.

Calls with a custom API key (BYOK) are excluded: those failures are key-specific, not a sign of model downtime.

Failed calls are NOT included in quality scores — quality is measured on successful responses only. Availability and quality are independent signals.

Median response time (p50) across successful calls with a recorded duration. Outliers (very slow or very fast calls) pull the median less than the average.

Total calls (30d)

OK responses (30d)

Total calls (7d)

OK responses (7d)

Section 07

Tokonomix benchmark verdicts

⚖️

Endorsed by 2 judges

Independent LLM judges evaluated this model on our weekly intelligence tests

cohere/command-a100/100 · 1 runs

1 correct0 partial0 wrong100% accuracy

claude-sonnet-4-595/100 · 47 runs

43 correct4 partial0 wrong91% accuracy

● 2026-07-26

Quality drops 10 points with slower response times and factual weakness

Qwen2.5-VL-72B-Instruct experienced a notable decline in this benchmark window, with overall quality falling from 98.8 to 88.8 points while latency increased by 37 percent to 8.9 seconds at the median. The model demonstrates exceptional strength in multilingual and reasoning tasks, both scoring perfect 100s, and maintains outstanding creative capabilities at 98 points. However, factual accuracy emerged as a significant weakness, scoring just 57 points and representing a substantial gap in the model's performance profile. The previous window included coding benchmarks where the model scored 98, but this category was not tested in the current window, making direct comparison incomplete. Thelatency increase from 6.5 to 8.9 seconds suggests either infrastructure changes or increased processing complexity. Despite the quality decline, the model retains strong capabilities in three of four tested categories. Users should be aware of the factual accuracy limitations when deploying this model for knowledge-intensive applications, while the model remains well-suited for creative, multilingual, and reasoning-heavy workloads where it continues to excel.

Quality

88.8

Latency p50

8,876 ms

Test runs

✗ Quality dropped 10 points✗ Latency increased 37%✗ Factual accuracy only 57✓ Perfect multilingual and reasoning scores

Last automated test

Jul 30, 2026 · 14:04 UTC · Speed benchmark

P50 latency

174 ms

P95 latency

2843 ms

Errors

0 / 6 runs

Last reviewed by Tokonomix Team·July 30, 2026