Tier B — Production

Runs in:FranceMade in:France

Mistral-Small-3.2-24B-Instruct-2506

Tier B — Production

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 27, 2026·Last reviewed July 31, 2026

Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency105 runs

Section 02

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

Creative

Factual

100

Multilingual

Reasoning

Section 03

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — Mistral-Small-3.2-24B-Instruct-2506

$0.0900 per 1M input tokens

$0.2800 per 1M output tokens

≈ $0.0001 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$0.0900

per 1M output tokens$0.2800

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.0900

input / 1M

— stable

$0.2800

output / 1M

— stable

2026-06-142026-07-052026-07-26

Input

Output

Price change

⟳ synced weekly

Section 04

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)1550 / avg 1463

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 05

Capabilities

ownedBy: mistralai

Section 06

Availability

How often this model answers when we call it — measured across real API requests and live tests over the last 30 days. This is separate from quality: these numbers only tell you whether the model responds, not how good the answer is.

Last 7 days

100.0%

n=1,231

Last 30 days

100.0%

n=6,099

Median response time

1,915ms

n=6,099

Based on 6,479 measurements over the last 30 days.

Technical details

Only live API calls and live-test requests count — internal probes and benchmark runs are excluded.

Calls with a custom API key (BYOK) are excluded: those failures are key-specific, not a sign of model downtime.

Failed calls are NOT included in quality scores — quality is measured on successful responses only. Availability and quality are independent signals.

Median response time (p50) across successful calls with a recorded duration. Outliers (very slow or very fast calls) pull the median less than the average.

Total calls (30d)

6,099

OK responses (30d)

6,099

Total calls (7d)

1,231

OK responses (7d)

1,231

Image quality control pilot (2026-06-10)

Recall

9.4%

n=300

False-alarm rate

12.1%

n=300

Full results →

Section 07

Tokonomix benchmark verdicts

⚖️

Endorsed by 2 judges

Independent LLM judges evaluated this model on our weekly intelligence tests

cohere/command-a100/100 · 1 runs

1 correct0 partial0 wrong100% accuracy

claude-sonnet-4-593/100 · 48 runs

43 correct5 partial0 wrong90% accuracy

● 2026-07-26

Quality drops 10.2 points to 84.6 amid 38% latency increase

Mistral-Small-3.2-24B-Instruct-2506 experienced notable performance degradation in this benchmark window, with overall quality declining from 94.8 to 84.6 points while latency increased by 38% to a median of 6559 milliseconds. The model maintained exceptional multilingual capabilities at 100 points, consistent with previous performance. However, significant shifts occurred in tested categories: coding performance disappeared from evaluation while new reasoning scores emerged strong at 95 points. Creative output remained relatively stable, moving from 85 to 87 points. The most concerning change appears in factual accuracy, which scored only 57 points in the current window, representing a substantial weakness compared to the model's other capabilities. The combination of slower response times and lower quality scores suggests possible infrastructure or configuration issues at the OVH AI Endpoints GRA deployment. Users should expect longer wait times for responses and exercise caution with factual queries, though the model continues to excel at multilingual tasks and demonstrates strong reasoning abilities. The performance decline warrants monitoring in upcoming benchmark windows to determine whether this represents a temporary regression or a sustained shift in model behavior.

Quality

84.6

Latency p50

6,559 ms

Test runs

✗ Quality dropped 10.2 points✗ Latency increased 38%✗ Factual score only 57✓ Multilingual remains perfect 100

Last automated test

Jul 31, 2026 · 08:03 UTC · Speed benchmark

P50 latency

129 ms

P95 latency

182 ms

Errors

0 / 6 runs

Last reviewed by Tokonomix Team·July 31, 2026