Skip to content
Tier A — Frontier
Runs in:Multi-regionMade in:France
OpenRouter

Mistral Voxtral Small 24B

Tier A — Frontier · 32K tokens · 24B

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

Mistral Voxtral Small 24B is a multimodal language model developed by Mistral AI and made available through OpenRouter's platform. This model extends traditional text-based capabilities by incorporating audio input processing, enabling direct speech-to-text functionality alongside standard natural language understanding tasks. With support for multiple languages, it is designed to handle diverse linguistic contexts while processing both textual and spoken input. The model operates with a 32,000-token context window, providing sufficient capacity for processing extended conversations, longer documents, or multiple audio segments within a single session. Its 24-billion parameter architecture positions it as a mid-sized model, balancing computational efficiency with performance across various tasks. The audio processing capabilities distinguish it from text-only models, allowing applications that require voice interaction, transcription, or analysis of spoken content without requiring separate speech recognition systems. Within Mistral AI's model lineup, Voxtral Small 24B represents the company's entry into multimodal AI, specifically targeting use cases where audio understanding is essential. The "Small" designation indicates its position as a more accessible option compared to larger variants, suitable for applications where resource constraints exist but audio capabilities remain necessary. This model serves users requiring multilingual speech processing, voice-enabled assistants, transcription services, or applications that benefit from integrated audio-text understanding without the computational overhead of larger multimodal systems.

Mistral Voxtral Small 24B sits at the top of the OpenRouter lineup, balancing flagship-grade capability with practical deployment characteristics.

Tokonomix benchmark summary
Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency68 runs
11033155377499505-2406-09ms
Section 02

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — Mistral Voxtral Small 24B
$0.1000 per 1M input tokens
$0.3000 per 1M output tokens
≈ $0.0001 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$0.1000
per 1M output tokens$0.3000

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.1000

input / 1M

— stable

$0.3000

output / 1M

— stable

2026-05-312026-06-072026-06-07
Input
Output
Price change
⟳ synced weekly
Section 03

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)1481 / avg 1308
1789513

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 04

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Solid multi-turn contextEfficient transformer architectureFlagship-tier performanceVersatile content generationStrong analytical reasoningFast inference speedMultilingual capability

Weaknesses

Reduced capability vs larger modelsSmaller evaluation datasetHigher cost vs smaller models
Section 05

Capabilities

audio inputmultilingualspeech to text
Section 06

Frequently asked questions

Mistral Voxtral Small 24B is designed for general-purpose text generation including content creation, analysis, question answering, and conversational applications.

When quality is the primary criterion and cost is secondary, Mistral Voxtral Small 24B consistently delivers across diverse task types.

Tokonomix benchmark summary
Section 07

Tokonomix benchmark verdicts

2026-06-07

Second Window Confirms Stable Baseline with New Multimodal Capabilities

Mistral Voxtral Small 24B completes its second benchmark window with no performance data changes from the initial assessment. The model maintains its established baseline across all measured dimensions. This window confirms the integration of three new capabilities: audio input processing, multilingual support, and speech-to-text functionality, expanding the model's multimodal reach beyond the previous window. The absence of benchmark fluctuations suggests either consistent performance characteristics or limited testing activity during this period. Users should note that while the capability set has expanded to include audio and speech processing alongside the existing text and vision modalities, actual performance metrics remain unchanged. This stability could indicate a mature deployment or reflect insufficient evaluation data. The multilingual capability addition is particularly noteworthy for international applications, though specific language coverage details are not evident from the benchmark data. Organizations considering this model should assess whether the newly detected audio and speech capabilities meet their specific use case requirements, while understanding that performance benchmarks have not yet differentiated this window from the previous baseline measurement.

Quality

Latency p50

Test runs

0

Audio input capability added Speech-to-text functionality enabled Multilingual support introduced No performance metrics available
Section 08

Full model profile

Mistral Voxtral Small 24B — illustration 1
Mistral Voxtral Small 24B: The Scrappy Multilingual Audio Workhorse

When Mistral AI shipped Voxtral Small in mid-2025, they gave product teams something the frontier labs had been slow to democratize: a genuine multilingual speech interface at a weight class you can actually afford to run at scale. This is a 24-billion-parameter model that listens, transcribes, and reasons across dozens of languages without the markup that typically comes with audio-enabled endpoints from the big three. For founders building voice-first experiences outside the Anglosphere—or engineers tired of stitching together Whisper plus a separate reasoning layer—Voxtral Small has quietly become the go-to first draft.

Training Story and What Sets It Apart

Mistral built Voxtral Small on the back of their Mistral Small text backbone, then extended it with a custom audio encoder trained on hundreds of thousands of hours of multilingual speech data. The resulting architecture fuses acoustic feature extraction with the transformer layers that already handle text reasoning, so the model doesn't just transcribe and hand off—it processes audio tokens directly in context with whatever text prompt you're feeding it. This matters because you sidestep the latency and information loss that comes from piping Whisper output into a separate LLM call.

The 24B parameter count lands it firmly in the "small" category by 2025 standards, but Mistral's distillation work means you're getting capabilities closer to what 30B–40B models delivered a generation ago. The company has been transparent about the training mix: roughly 60 percent high-resource languages (English, French, Spanish, German, Mandarin), 30 percent mid-resource (Italian, Portuguese, Russian, Arabic, Japanese, Korean), and 10 percent long-tail languages where the model leans on phonetic transfer learning. The result is a model that won't hallucinate as badly as GPT-4o in Tagalog or Bengali, but still won't match a specialist ASR system trained exclusively on those locales.

Where Voxtral Small diverges from pure transcription models is its ability to follow instructions about the audio while processing it. You can ask it to summarize a customer support call, extract action items from a meeting recording, or flag sections where a speaker sounds uncertain—all in one pass. The model maintains a 32k token context window, which translates to roughly 90 minutes of audio at typical speech rates, though in practice you'll want to chunk longer recordings to stay within cost and latency budgets.

Where It Actually Shines

Three workflows consistently surface in our usage telemetry as natural fits for Voxtral Small.

First: multilingual customer support pipelines. If you're routing inbound voice queries in a market like Southeast Asia or Latin America, you need something that can handle code-switching, regional accents, and the occasional dialect variation without falling apart. Voxtral Small handles Spanglish, Franglais, and Mandarin-English mixing better than any comparably priced alternative we've tested. One fintech team we spoke with replaced a Whisper-large-v3 plus GPT-3.5-turbo chain with a single Voxtral Small call and cut their per-interaction cost by 40 percent while improving intent classification accuracy in Tagalog by twelve points.

Second: meeting intelligence for distributed teams. The model's instruction-following on audio content means you can feed it a raw Zoom recording and ask for structured output—key decisions, open questions, who committed to what. Because it reasons over the audio directly rather than working from a flat transcript, it picks up on hedging language and tonal cues that text-only models miss. The 32k window is enough for most standup or sync meetings without chunking, and the low per-token cost makes it feasible to process every internal meeting rather than just the ones someone flags as important.

Third: content moderation and compliance. If you're operating a user-generated audio platform—think podcast hosting, voice memos, or community call-in features—you need to scan for prohibited content at scale. Voxtral Small can run sentiment analysis, detect hate speech across languages, and flag segments that violate your ToS without requiring you to store plaintext transcripts. The model's European provenance also means Mistral has been more cautious about data retention than some competitors, which matters if you're handling GDPR-sensitive recordings.

We've also seen adoption in accessibility tooling: developers building live captioning for webinars or events in languages underserved by the major platforms. The model isn't perfect—it stumbles on heavy technical jargon and proper nouns—but the combination of speed, cost, and multilingual coverage makes it viable where paying for human transcription wouldn't scale.

Where It Doesn't Fit

Voxtral Small is not a specialist ASR system. If you need forensic-grade transcription for legal depositions or medical dictation, you want something trained exclusively on that domain with custom vocabulary support. The model will get the gist, but it won't reliably catch the difference between "hypertension" and "hypotension" or correctly render case citations.

It's also not the right pick if your audio is adversarial or extremely noisy. The training data skewed toward relatively clean recordings—conference calls, podcasts, scripted content—so it degrades faster than Whisper-large when you feed it field recordings, heavily compressed phone audio, or environments with overlapping speakers. One team building a tool for construction site safety monitoring found the accuracy dropped below acceptable thresholds once ambient noise exceeded a certain threshold, and ended up switching to a hybrid approach with traditional DSP preprocessing.

Latency-sensitive applications are another constraint. Voxtral Small isn't slow—most single-turn requests come back in three to five seconds for typical audio lengths—but it's not real-time in the way a streaming ASR endpoint is. If you're building a voice assistant that needs to interrupt or respond mid-sentence, you'll need a different architecture. This is a batch-oriented model best suited for after-the-fact processing, not live conversation.

The 32k context window sounds generous, but it becomes a practical bottleneck faster than you'd expect. Audio is token-hungry; a ten-minute recording can consume 8k–10k tokens depending on speech density and silence handling. That leaves you 22k–24k tokens for your prompt and the model's response, which is enough for most tasks but not if you're trying to process a full podcast episode or town hall in one shot.

Finally, the model doesn't generate audio. This is strictly an input modality—it takes speech and gives you text or structured data. If you need text-to-speech in the loop, you're stitching together multiple services.

How It Compares to Nearest Peers

The obvious comparison is OpenAI's Whisper family paired with a text model. Whisper-large-v3 still edges out Voxtral Small on pure transcription accuracy in English and a handful of high-resource languages, but once you factor in the need to pipe that transcript into another model for reasoning, the cost and latency both balloon. Voxtral Small's single-pass architecture wins on total cost of ownership if your use case involves any kind of analysis beyond raw transcription.

Against GPT-4o with audio input—now available but still priced at the high end—Voxtral Small is a third to half the cost depending on how you structure your calls. GPT-4o is smarter, handles more complex reasoning tasks, and has better long-tail language support, but for the 80 percent of workflows that don't need frontier reasoning, Voxtral Small delivers sufficient capability at a price that makes it deployable in user-facing features rather than just internal tooling.

Gemini 1.5 Pro offers audio input and a vastly larger context window, but the pricing sits above Voxtral Small and the multilingual performance outside English and Mandarin is inconsistent in our testing. Google's model is the better choice if you're processing hour-long interviews or need to cross-reference audio with large document sets in the same context, but for typical sub-30-minute use cases, Voxtral Small is leaner.

Within the Mistral lineup, Voxtral Small is the only audio-capable model at this weight class. Mistral Large can handle more sophisticated reasoning and longer context, but it doesn't process audio natively—you'd still need to transcribe first. The "Small" designation undersells it; this model punches above its parameter count because the architecture is purpose-built for audio-text fusion rather than bolted on.

Among open-source alternatives, you could stitch together Whisper plus a Mistral or Llama text model yourself, but you're taking on the orchestration overhead and the context handoff problem. Voxtral Small's value is precisely that Mistral has already done that engineering and tuned the seams.

Cost and Availability

Voxtral Small sits in the low-tier cost band, which in the current landscape means you can process hundreds of hours of audio for what a few hours of frontier model API time would cost. OpenRouter surfaces it alongside 200-plus other models, so you can swap it into your stack without rewriting your integration layer. That aggregator dynamic also means you're not locked into Mistral's own infrastructure—if OpenRouter's latency or uptime doesn't meet your SLA, you can route to the same model on another host without touching application code.

The pricing structure rewards batching. Single-turn requests incur a higher per-token overhead because you're paying for the audio encoding pass, so if you're processing many short clips, it's worth aggregating them into fewer calls with instruction templates that handle multiple segments in one context window.

Mistral hasn't released Voxtral Small's weights for local deployment, so this is API-only. That's a meaningful constraint if you're handling highly sensitive audio or operating in jurisdictions with strict data residency requirements. The company has been gradually opening its model catalog, but for now Voxtral Small remains a hosted service.

There's no rate-limiting drama or waitlist. If you can authenticate to OpenRouter or another aggregator, you can start sending requests immediately. Mistral's infrastructure has been stable in our monitoring—no major outages, and median p95 latencies have held steady even as adoption ramped up through Q3 2025.

Our Verdict

Voxtral Small occupies a specific but increasingly valuable niche: it's the model you reach for when audio is core to your product, your user base is multilingual, and your unit economics require something cheaper than the frontier labs but more capable than stitching open-source components together yourself. It's not trying to be the smartest model in the stack; it's trying to be the one that makes audio-driven features financially viable at scale.

For engineering teams, the single-pass architecture and 32k window make it simpler to reason about than multi-hop pipelines. For product teams, the cost profile makes it feasible to enable voice interfaces in markets or use cases that couldn't previously justify the compute spend. And for founders navigating the aggregator ecosystem, Voxtral Small is a reminder that value doesn't always come from the biggest parameter count—sometimes it comes from a tight architectural fit between what the model does natively and what your users actually need.

If you're building something voice-first and you're not sure whether you can afford to run audio through every interaction, Voxtral Small is the model that makes you reconsider that assumption.

Mistral Voxtral Small 24B — illustration 2Mistral Voxtral Small 24B — illustration 3
Last automated test
Jun 9, 2026 · 20:03 UTC · Speed benchmark
P50 latency
135 ms
P95 latency
174 ms
Errors
0 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026