Tier B — Production

Runs in:FranceMade in:China

Qwen3-32B

Tier B — Production

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 27, 2026·Last reviewed July 30, 2026

Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency101 runs

Section 02

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

Creative

Factual

Multilingual

Reasoning

Section 03

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — Qwen3-32B

$0.0800 per 1M input tokens

$0.2300 per 1M output tokens

≈ <$0.0001 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$0.0800

per 1M output tokens$0.2300

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.0800

input / 1M

— stable

$0.2300

output / 1M

— stable

2026-06-142026-07-122026-07-26

Input

Output

Price change

⟳ synced weekly

Section 04

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)421 / avg 420

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 05

Capabilities

ownedBy: Qwen

Section 06

Availability

How often this model answers when we call it — measured across real API requests and live tests over the last 30 days. This is separate from quality: these numbers only tell you whether the model responds, not how good the answer is.

Last 7 days

—

Last 30 days

100.0%

n=33

Median response time

145,961ms

n=33

Based on 413 measurements over the last 30 days.

Technical details

Only live API calls and live-test requests count — internal probes and benchmark runs are excluded.

Calls with a custom API key (BYOK) are excluded: those failures are key-specific, not a sign of model downtime.

Failed calls are NOT included in quality scores — quality is measured on successful responses only. Availability and quality are independent signals.

Median response time (p50) across successful calls with a recorded duration. Outliers (very slow or very fast calls) pull the median less than the average.

Total calls (30d)

OK responses (30d)

Total calls (7d)

OK responses (7d)

Section 07

Tokonomix benchmark verdicts

⚖️

Endorsed by 2 judges

Independent LLM judges evaluated this model on our weekly intelligence tests

cohere/command-a95/100 · 1 runs

1 correct0 partial0 wrong100% accuracy

claude-sonnet-4-584/100 · 47 runs

34 correct9 partial4 wrong72% accuracy

● 2026-07-26

Qwen3-32B shows 34% latency gain but factual score plummets to 35

The current benchmark window reveals a mixed performance picture for Qwen3-32B deployed on OVH AI Endpoints. While latency has improved substantially with p50 dropping from 24595ms to 16206ms, representing a 34% speed increase, the overall quality score has declined slightly from 73.4 to 72.3. The most concerning development is the dramatic collapse in factual performance, now scoring just 35 compared to the previous window where factual capabilities weren't measured but coding achieved 94. This suggests a significant regression in knowledge accuracy and reliability. On the positive side, multilingual capabilities have strengthened from 86 to 95, and reasoning performance stands strong at 83. Creative writing has rebounded impressively from 40 to 76, reversing the sharp decline noted in the previous period. The model appears to have shifted its strengths, excelling at multilingual tasks and creative generation while struggling with factual accuracy. Users requiring precise factual responses should exercise caution, while those focused on creative multilingual applications may find the current configuration more suitable. The latency improvements make the service more responsive overall, but the factual performance gap represents a critical weakness for general-purpose deployments.

Quality

72.3

Latency p50

16,206 ms

Test runs

✓ Latency improved 34%✗ Factual score dropped to 35✓ Multilingual performance up to 95✓ Creative rebounds from 40 to 76

Last automated test

Jul 30, 2026 · 08:04 UTC · Speed benchmark

P50 latency

475 ms

P95 latency

620 ms

Errors

0 / 6 runs

Last reviewed by Tokonomix Team·July 30, 2026