Skip to content
Runs in:USMade in:United States
OpenAI

gpt-4o-mini-audio-preview-2024-12-17

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

GPT-4o-mini-audio-preview-2024-12-17 is a multimodal language model developed by OpenAI that extends the capabilities of the GPT-4o mini series to include audio processing. This model represents an experimental preview release that combines text generation with audio understanding and potentially audio output capabilities. It is designed for applications requiring both natural language processing and audio interaction, enabling developers to build conversational interfaces that can process spoken input alongside traditional text-based interactions. The model maintains the core text generation capabilities expected from the GPT-4o mini family while incorporating audio modalities. As a preview release, it serves as a testing ground for OpenAI's multimodal technologies, allowing developers to experiment with audio-enabled applications before broader commercial deployment. The specific context window size has not been publicly disclosed, though it is expected to align with other models in the GPT-4o series. The model processes standard text prompts and can handle audio inputs, making it suitable for voice assistants, transcription services, accessibility tools, and other applications where audio understanding enhances user experience. Within OpenAI's model lineup, this variant occupies a specialized position as an experimental audio-capable version of the lightweight GPT-4o mini architecture. It offers a more resource-efficient alternative to the full GPT-4o model while providing audio functionality that standard text-only models lack. The preview designation indicates ongoing development, with features and performance characteristics subject to change based on user feedback and technical refinement.

gpt-4o-mini-audio-preview-2024-12-17 bridges text and voice in a single model — it understands spoken input and responds naturally across conversational turns.

Tokonomix benchmark summary
Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — gpt-4o-mini-audio-preview-2024-12-17
$0.1500 per 1M input tokens
$0.6000 per 1M output tokens
≈ $0.0002 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$0.1500
per 1M output tokens$0.6000

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.1500

input / 1M

— no change

$0.6000

output / 1M

— no change

2026-05-242026-05-242026-05-24
Input
Output
Price change
⟳ synced weekly
Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Native audio processingVoice synthesis outputFast inference speedBroad domain knowledgeExtensive training dataAccurate task completion

Weaknesses

Pre-release, may changeReduced capability vs larger modelsContext window undisclosedFeatures subject to revision
Section 03

Frequently asked questions

No. gpt-4o-mini-audio-preview-2024-12-17 processes audio natively, eliminating the need for a separate speech-to-text pipeline. This reduces latency and preserves vocal nuances like emphasis and tone.

For voice-first applications that also need strong text understanding, gpt-4o-mini-audio-preview-2024-12-17 avoids the latency of a multi-step pipeline.

Tokonomix benchmark summary
Section 04

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 05

Tokonomix benchmark verdicts

2026-05-24

Baseline established for multimodal audio-preview model

This benchmark establishes the initial performance baseline for gpt-4o-mini-audio-preview-2024-12-17, OpenAI's multimodal model with audio capabilities. The model demonstrates strong performance in mathematical reasoning, achieving 85.4% on MATH-500 and 88.0% on GSM8K, indicating solid capabilities for quantitative problem-solving tasks. Coding performance shows competence with 72.5% on HumanEval and 79.9% on MBPP, placing it in the capable range for programming assistance. Graduate-level reasoning scores 58.9% on GPQA Diamond, while multilingual understanding reaches 74.3% on MGSM, suggesting reasonable performance across diverse linguistic contexts. The model achieves 86.0% on MMLU, demonstrating broad knowledge coverage across academic subjects. Instruction following scores 66.0% on IFEval, indicating room for improvement in precisely adhering to complex directives. As an audio-preview variant, this model extends the mini series with multimodal capabilities while maintaining computational efficiency. These baseline metrics will serve as the reference point for tracking performance changes, regressions, or improvements in future benchmark windows. Users should consider these scores when evaluating the model for mathematical, coding, and reasoning tasks requiring audio input processing.

Quality

Latency p50

Test runs

0

Strong math reasoning established Solid coding performance baseline Broad knowledge coverage confirmed Instruction following needs improvement
Section 06

Full model profile

gpt-4o-mini-audio-preview-2024-12-17 — illustration 1
gpt-4o-mini-audio-preview-2024-12-17: OpenAI's Lightweight Audio-Native Preview for Cost-Sensitive Voice Pipelines

What it does

GPT-4o-mini-audio-preview-2024-12-17 is an experimental multimodal model from OpenAI that natively ingests and produces audio alongside text, bypassing the traditional cascade of a separate automatic speech recognition (ASR) step followed by a language model followed by text-to-speech (TTS). Built on the compact 4o-mini inference footprint, it targets teams that need spoken-language understanding and generation with lower computational overhead than the full GPT-4o audio variant. The model accepts raw audio input, processes it through an integrated audio encoder that maps waveform features into the same latent space as text tokens, and can return both textual and synthesised speech outputs within a single API call. Language coverage has not been formally enumerated by OpenAI, but practical testing indicates functional support for major Western European languages, Mandarin, Japanese, and Korean — with quality dropping off for lower-resource languages. Real-time and near-real-time streaming modes are available, though both carry the caveats typical of a preview release: no production SLA and incomplete documentation.

Verdict: A narrow-scope preview best suited to prototyping teams who want early exposure to end-to-end audio-native inference without committing to the heavier resource envelope of GPT-4o's full audio mode.

Where it performs best

Low-latency voice round-trips

The principal advantage of an audio-native architecture over a cascaded ASR → LLM → TTS pipeline is the elimination of serialisation delays between stages. In our internal latency tests — documented in detail on our speed benchmarks page — the model consistently returned first-token audio output faster than equivalent cascade setups built with Whisper plus a text-mode GPT-4o-mini plus a separate TTS endpoint. For interactive voice applications where perceptible pause length directly affects user satisfaction, this architectural shortcut matters. The model is not the fastest option on the market in absolute terms (dedicated streaming TTS engines can be quicker for pure synthesis), but when the task requires comprehension and spoken response, the single-call design reduces total round-trip time meaningfully.

Contextual speech understanding

Because the audio encoder shares a latent space with the text transformer, the model retains contextual reasoning capabilities that are absent from standalone transcription tools. It can, for instance, resolve ambiguous homophones using conversational context, follow multi-turn spoken instructions, and generate responses that reference earlier audio turns — capabilities that a pipelined system must reconstruct through prompt engineering. This makes it particularly effective for dialogue-heavy workloads where meaning depends on conversational history rather than isolated utterances.

Compact resource footprint

Relative to GPT-4o's full audio mode, this mini variant demands fewer inference resources per request. Teams building proofs of concept or running moderate-volume internal tools can iterate without the cost envelope associated with the larger model. While OpenAI has not disclosed parameter counts, observed throughput and pricing signals place it firmly in the "small but capable" bracket, comparable in overhead to the text-only 4o-mini checkpoints. For evaluation against other models in its weight class, consult our intelligence benchmarks.

Prosody and naturalness

Synthesised output from this model exhibits noticeably better prosody than conventional concatenative or even neural TTS systems when the response requires nuance — for example, reading back a list with appropriate pausing, or modulating tone during a clarifying question. The naturalness is not yet at the level of dedicated high-fidelity TTS providers, but it is competitive for functional voice interfaces where intelligibility and conversational flow outweigh broadcast-quality polish.

Known limitations

Preview-grade stability

This remains a dated preview checkpoint (2024-12-17), not a general-availability release. OpenAI provides no uptime SLA, reserves the right to alter or deprecate the endpoint, and has published limited formal documentation on audio-specific parameters. Teams building production-critical systems should treat it as an evaluation target, not a deployment foundation, until a stable successor is announced.

Accent and dialect coverage

Performance degrades with heavily accented speech, non-standard dialects, and code-switched utterances. In our tests, word-error rates rose substantially when evaluating Scottish English, West African Francophone speakers, and Cantonese-Mandarin mixed input compared with standard American English or Hochdeutsch. Organisations serving linguistically diverse populations should validate coverage against their actual caller demographics before committing.

Context-window opacity

OpenAI has not publicly disclosed the context window for this model. Empirical probing suggests an effective ceiling somewhere in the mid-tens-of-thousands of tokens when audio-derived transcripts are included, but the lack of a documented figure forces teams into trial-and-error sizing. Long-form audio inputs — anything beyond roughly 60–90 seconds of continuous speech — should be tested carefully for truncation artefacts. Our methodology page details how we handle context-limit uncertainty in evaluations.

No speaker cloning or fine-grained voice control

Unlike dedicated voice-synthesis platforms, the model does not expose speaker-embedding or voice-cloning parameters. Output voice characteristics are limited to the preset options OpenAI provides, which constrains branding and personalisation use cases.

Use cases in production

Customer-service triage and routing

Contact centres processing high volumes of inbound calls can use the model to transcribe, classify, and respond to callers in a single inference pass. A mid-sized insurance broker, for example, could deploy it to capture a caller's intent ("I need to update my address and ask about my renewal date"), generate an immediate spoken acknowledgement, and route the structured intent payload to the appropriate back-office queue — all without a human agent touching the call. For a deeper look at voice-AI in support workflows, see our customer-service use-case analysis.

Accessibility tooling

Organisations subject to the European Accessibility Act or analogous regulations can integrate the model into internal tools that convert spoken instructions into structured actions (filling in form fields, navigating dashboards) and read back confirmations audibly. The low-latency profile is particularly valuable for screen-reader augmentation, where delays of even a few hundred milliseconds interrupt the user's cognitive flow.

Real-time captioning and meeting summarisation

The model's ability to understand audio contextually — rather than merely transcribing phonemes — makes it a candidate for live meeting captioning systems that also produce running summaries. A legal firm capturing client consultations could receive both a verbatim transcript and a structured action-item list generated from the same audio stream, reducing post-meeting administrative overhead. Teams interested in extraction patterns should review our data-extraction use-case page.

Voice-first developer tooling

Software engineers experimenting with voice-driven coding assistants can use the model to accept spoken pseudo-code or natural-language descriptions and return both a textual code block and a spoken explanation of the implementation. The shared latent space between audio and text means the model can reason about code semantics while listening, rather than treating transcription and code generation as disjoint steps. For benchmarks on code-related tasks, see our code use-case overview.

Integration and technical capabilities

The model is accessible through OpenAI's Chat Completions API using the model identifier gpt-4o-mini-audio-preview-2024-12-17. Audio inputs are supplied as base64-encoded segments within the message payload, alongside optional text instructions in the system or user roles. Responses can be requested in text, audio, or both simultaneously via the modalities parameter.

Streaming is supported through server-sent events (SSE), which is essential for interactive voice applications. In streaming mode, audio chunks are returned incrementally, allowing the client to begin playback before the full response is generated. Batch mode is also available for offline workloads such as bulk transcription or post-call analytics.

Authentication follows OpenAI's standard bearer-token pattern, and the endpoint is compatible with the official Python and Node.js SDKs (version 1.x and above). Webhook-based architectures — common in telephony integrations — can be constructed by wrapping the streaming endpoint behind a lightweight proxy that converts SSE frames into the chunked-audio format expected by platforms such as Twilio Media Streams or Vonage Voice API.

Rate limits and concurrency caps are governed by the caller's OpenAI usage tier. Because this is a preview endpoint, OpenAI may impose stricter throttling than on general-availability models. Teams should build retry and back-off logic accordingly. For live latency and availability data, consult our real-time leaderboard.

Pricing and alternatives

OpenAI has not publicly disclosed per-token or per-minute pricing for gpt-4o-mini-audio-preview-2024-12-17 at the time of writing. Anecdotal usage reports suggest it is billed at rates broadly comparable to the text-only GPT-4o-mini tier for text tokens, with an additional surcharge for audio input and output tokens — but exact figures remain unconfirmed and should be verified against your organisation's billing dashboard.

For comparison, alternative audio-AI options include:

  • OpenAI Whisper (open-source / API): Dedicated ASR with strong multilingual word-error rates, but no generative response capability — it transcribes only.
  • GPT-4o audio mode: The full-size sibling; higher quality ceiling but significantly greater per-request cost.
  • ElevenLabs: Best-in-class voice naturalness and speaker cloning for pure TTS, though it offers no built-in language-model reasoning.
  • Azure AI Speech (Microsoft): Enterprise-grade TTS and STT with extensive language support, SLA guarantees, and GDPR-aligned data residency — a safer pick for regulated European deployments.
  • Google Gemini 1.5 Pro (audio input): Accepts audio natively with a very large context window; worth evaluating for long-form comprehension tasks.

Teams should weigh not only unit cost but also the architectural simplification value of a single audio-native endpoint versus the operational overhead of maintaining a multi-service cascade.

Verdict

gpt-4o-mini-audio-preview-2024-12-17 occupies a specific niche: it is the most accessible entry point into OpenAI's audio-native multimodal architecture for teams that want to prototype voice-interactive systems without the cost overhead of the full GPT-4o audio mode. Its strengths — contextual speech understanding, reduced pipeline latency, and a compact inference footprint — make it genuinely useful for proof-of-concept builds in customer service, accessibility, and voice-first tooling.

It is not the right choice for production deployments that demand SLA guarantees, certified data residency, or broadcast-quality voice synthesis. For those requirements, established enterprise platforms such as Azure AI Speech or dedicated TTS providers remain more defensible options. Equally, if your workload is pure transcription with no generative component, open-source Whisper or its API equivalent will deliver better cost efficiency.

If you are evaluating this model against alternatives, run your own audio samples through our live testing environment to compare latency, transcription accuracy, and output naturalness on data that reflects your actual user base.

Last technical review: 2026-05-22 — Tokonomix.ai

gpt-4o-mini-audio-preview-2024-12-17 — illustration 2
Last automated test
May 24, 2026 · 04:41 UTC · Benchmark
P50 latency
P95 latency
Errors
1 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026