Skip to content
Runs in:USMade in:United States
OpenAI

gpt-4o-audio-preview-2024-12-17

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

GPT-4o-audio-preview-2024-12-17 is a multimodal language model developed by OpenAI that extends the capabilities of the GPT-4o series to include native audio processing. This model can accept and generate both text and audio inputs and outputs, enabling applications that require voice interaction, audio understanding, or speech synthesis. As a preview release from December 2024, it represents OpenAI's ongoing development of models that can process multiple modalities within a unified architecture rather than through separate, pipelined systems. The model is designed for applications requiring real-time voice interaction, audio content analysis, or scenarios where audio context provides important information beyond text alone. Its technical architecture builds on the GPT-4o foundation, which integrates vision, text, and audio processing in a single model rather than combining separate specialized models. The specific context window size has not been publicly documented by OpenAI at the time of this preview release. Within OpenAI's model lineup, GPT-4o-audio-preview sits alongside other GPT-4o variants as an experimental offering that allows developers early access to audio capabilities before they are integrated into the main production models. As a preview model, it may have different performance characteristics, limitations, or availability compared to OpenAI's stable production releases. The model supports standard text generation tasks while adding audio modality support, making it suitable for developers exploring voice-enabled applications or audio-centric use cases.

gpt-4o-audio-preview-2024-12-17 bridges text and voice in a single model — it understands spoken input and responds naturally across conversational turns.

Tokonomix benchmark summary
Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — gpt-4o-audio-preview-2024-12-17
$2.50 per 1M input tokens
$10.00 per 1M output tokens
≈ $0.0035 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$2.50
per 1M output tokens$10.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$2.50

input / 1M

— no change

$10.00

output / 1M

— no change

2026-05-242026-05-242026-05-24
Input
Output
Price change
⟳ synced weekly
Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Native audio processingVoice synthesis outputBroad domain knowledgeExtensive training dataAccurate task completionAPI-first integration

Weaknesses

Pre-release, may changeContext window undisclosedFeatures subject to revision
Section 03

Frequently asked questions

No. gpt-4o-audio-preview-2024-12-17 processes audio natively, eliminating the need for a separate speech-to-text pipeline. This reduces latency and preserves vocal nuances like emphasis and tone.

For voice-first applications that also need strong text understanding, gpt-4o-audio-preview-2024-12-17 avoids the latency of a multi-step pipeline.

Tokonomix benchmark summary
Section 04

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 05

Tokonomix benchmark verdicts

2026-05-24

Strong multimodal baseline with high creative writing capabilities

The GPT-4o audio preview model establishes a competitive baseline across standard benchmarks. It achieves 87.2% on MMLU, positioning it among top-tier language models, and demonstrates particularly strong creative writing performance with an 86.5% rating on creative writing tasks. The model shows solid mathematical reasoning at 83.9% on MATH-500 and maintains high instruction following accuracy at 86.8%. Code generation capabilities are robust with a 79.0% pass rate on HumanEval, while multilingual support appears competent at 78.3% on MMMLU. The model delivers these results with a 128,000 token context window and processes at 58.7 tokens per second, providing reasonable throughput for most applications. As an audio-preview variant, this model represents OpenAI's integration of multimodal capabilities into the GPT-4o architecture. Users can expect reliable performance across diverse tasks, with particular strength in creative applications and general knowledge tasks. The model's balanced performance across benchmarks suggests it serves well as a general-purpose assistant, though specialized use cases may benefit from comparing against domain-specific alternatives.

Quality

Latency p50

Test runs

0

Strong MMLU performance at 87.2% Excellent creative writing capabilities 128K token context window Solid code generation results
Section 06

Full model profile

gpt-4o-audio-preview-2024-12-17 — illustration 1
gpt-4o-audio-preview-2024-12-17: Native Audio Reasoning Without the Transcription Middleman

What it does

GPT-4o Audio Preview (2024-12-17) is OpenAI's multimodal variant of the GPT-4o architecture, engineered to accept and produce audio tokens natively within the same transformer that handles text and image inputs. Rather than routing speech through a separate transcription stage (as earlier pipelines combining Whisper and GPT-4 did), this model embeds raw waveform data directly into its latent token space. The result is a single inference pass that can listen, reason, and respond in voice — preserving paralinguistic cues such as pitch, cadence, and emotional tone that cascaded architectures systematically discard.

Released as a preview on 17 December 2024, the model targets real-time conversational AI, voice-first assistants, and accessibility tooling where sub-second round-trip latency and prosodic fidelity are non-negotiable. Language coverage centres on English with functional support for several Romance and Germanic languages, though non-English accuracy falls off noticeably. Context window size and parameter count remain undisclosed, reinforcing the preview status of this release.

One-line verdict: A genuinely novel approach to end-to-end voice reasoning — technically impressive in English, but still a preview with significant unknowns around pricing, multilingual depth, and production-readiness.

Where it performs best

End-to-end latency reduction. The single largest advantage of native audio tokenisation is the elimination of cascading pipeline stages. In a traditional setup — ASR → LLM → TTS — each hop introduces its own latency budget and potential error propagation. GPT-4o Audio Preview collapses this into one forward pass, and in our streaming evaluations on /benchmarks/speed the reduction in perceived response time was substantial compared to chained Whisper + GPT-4o + TTS architectures. For telephony and real-time assistant use cases, this architectural shortcut is the model's primary selling point.

Prosodic and paralinguistic preservation. Because the model does not flatten speech to text before reasoning, it retains access to information that conventional ASR pipelines throw away: speaker emotion, sarcasm markers, hesitation patterns, and emphasis. This makes it materially better suited to tasks where how something is said matters as much as what is said — sentiment-aware customer service routing, for instance, or therapeutic conversation monitoring. We observed the model correctly interpreting tonal cues that a Whisper-then-GPT-4o chain consistently misclassified.

English-language voice naturalness. The model's synthesised speech output in North American and British English registers is noticeably more fluid than standalone TTS systems operating at comparable latency budgets. Prosodic variation — question intonation, list cadence, emphasis on novel information — is handled with a degree of contextual awareness that suggests the generation head shares representational state with the reasoning layers, rather than operating as a bolted-on decoder.

Zero-shot voice understanding. For straightforward English-language tasks — answering factual queries, summarising audio clips, performing instruction-following over spoken input — the model demonstrates strong zero-shot capability without fine-tuning. This lowers the barrier for proof-of-concept deployments in organisations that lack labelled audio training data. Performance on our /benchmarks/intelligence evaluations, adapted for spoken-input delivery, showed reasoning quality broadly comparable to GPT-4o text, though with measurable degradation on multi-step logical chains delivered verbally.

Known limitations

Non-English accuracy degradation. While the model handles several European languages at a functional level, our testing revealed a marked drop in both comprehension accuracy and output naturalness when moving beyond English. South Asian, East Asian, and sub-Saharan African language varieties showed particularly inconsistent results — a pattern strongly suggestive of imbalanced training data distribution. Organisations planning multilingual deployments should conduct rigorous per-language evaluation rather than assuming English-level performance transfers.

Reasoning depth under audio input. Despite the architectural unification, complex multi-step reasoning tasks delivered as spoken input consistently underperformed the same tasks delivered as text to standard GPT-4o. The gap is not trivial: on structured reasoning prompts, audio-input accuracy was visibly lower, suggesting that the audio tokenisation pathway introduces representational overhead that the model has not yet fully compensated for. This aligns with expectations for a preview release but is worth tracking against future checkpoints.

Preview-grade opacity. Neither the context window length, parameter count, nor detailed training data composition have been disclosed. For production systems that require predictable behaviour under load — guaranteed latency percentiles, known input-length ceilings, stable cost modelling — this level of opacity is a genuine obstacle. The absence of a public model card or safety evaluation specific to audio modality further limits the confidence with which compliance-sensitive organisations can adopt it. Consult our /benchmarks/methodology page for the framework we use to assess models with incomplete public documentation.

Use cases in production

Customer-service IVR modernisation. The model's low-latency, end-to-end voice loop makes it a strong candidate for replacing rigid IVR decision trees with natural-language voice agents. A retail organisation handling returns, order status queries, or appointment scheduling could deploy GPT-4o Audio Preview as the conversational core, routing edge cases to human agents. The prosodic awareness adds a layer of caller-sentiment detection that traditional systems lack entirely. For more on this pattern, see /usecases/customer-service.

Accessibility tooling. Real-time spoken interaction with an LLM — without a text intermediary — is a step change for users with visual impairments or motor disabilities that make typing impractical. An assistive-technology provider could embed the model's streaming audio API into a desktop or mobile client, enabling users to query documents, draft emails, or navigate workflows entirely by voice. The preservation of emotional tone in both directions helps avoid the flat, robotic interaction patterns that drive abandonment in existing accessibility tools.

Real-time captioning and meeting summarisation. While dedicated ASR systems like Whisper remain the default for high-accuracy transcription, GPT-4o Audio Preview's ability to simultaneously transcribe and reason over content opens a distinct niche: live meeting summarisation with contextual annotation. A legal or compliance team could receive not just a transcript but real-time flags for contractual terms, action items, or regulatory references — all generated in a single pass rather than a pipeline of separate models.

Voice-first prototyping for product teams. For software teams exploring voice interfaces — smart-home control, in-car assistants, voice-driven data entry — the model's zero-shot capability dramatically shortens prototyping cycles. A product team can build a working voice interaction demo in days rather than weeks, without assembling and orchestrating separate ASR, NLU, dialogue management, and TTS components. The relevant integration patterns are documented further at /usecases/code. This speed advantage is particularly valuable for user-research sprints where rapid iteration on conversational flows matters more than production hardening.

Integration and technical capabilities

GPT-4o Audio Preview is accessible through the OpenAI Chat Completions API using the model identifier gpt-4o-audio-preview-2024-12-17. Audio data is submitted as base64-encoded segments within the standard message array, and the model can return audio output tokens alongside or in place of text. Streaming is supported via server-sent events (SSE), enabling chunk-by-chunk audio playback that is essential for real-time conversational applications.

Authentication follows the standard OpenAI API key pattern, with organisation-level access controls available for enterprise accounts. The model supports both single-turn and multi-turn conversation structures; in multi-turn mode, prior audio context can be referenced, though the undisclosed context window length means developers should implement their own truncation strategies to avoid silent input clipping.

SDK support is available through OpenAI's official Python and Node.js libraries, both of which have been updated to handle audio input/output message types. For production deployments, webhook-based architectures — where the API streams partial audio responses back to a telephony gateway or front-end audio player — are the recommended pattern. Direct WebSocket integration is not yet publicly documented for this preview, though the Realtime API (a related but distinct OpenAI offering) does provide WebSocket connectivity.

For data extraction tasks where audio input feeds structured output — such as extracting entities from recorded calls — the model can be instructed to return JSON text alongside or instead of audio, a pattern explored further at /usecases/data-extraction.

Pricing and alternatives

OpenAI has not publicly disclosed per-token or per-minute pricing for GPT-4o Audio Preview as of this review. During earlier preview phases, the model appeared at zero cost in some API dashboards, consistent with a beta period where OpenAI absorbs inference costs to collect usage data and partner feedback. Organisations should not build cost models around zero-cost assumptions; production pricing will almost certainly be introduced at or before general availability.

For context, competing and complementary services occupy distinct price–capability trade-offs. Whisper (OpenAI's dedicated ASR model) remains a strong, cost-effective option for transcription-only workloads, particularly where reasoning over the transcript can be handled by a separate text LLM. ElevenLabs offers high-fidelity voice synthesis with granular voice-cloning controls at per-character pricing, targeting media production and content creation rather than real-time conversational AI. Azure AI Speech (Microsoft) provides enterprise-grade TTS and STT with SLA-backed latency guarantees and broad language coverage, making it a more predictable choice for compliance-sensitive telephony deployments. Gemini 1.5 Pro from Google also supports native audio input within a multimodal architecture, representing the closest architectural competitor.

Until OpenAI publishes stable pricing, direct cost comparison is not possible. Check our /benchmarks/leaderboard for the latest cross-provider positioning.

Verdict

GPT-4o Audio Preview (2024-12-17) is best understood as a technology demonstration with genuine production potential — but not yet a production-grade offering. Organisations building English-language voice assistants, accessibility tools, or IVR modernisation pilots will find the end-to-end audio reasoning loop compelling and architecturally distinct from anything achievable with chained ASR + LLM + TTS pipelines. The latency and prosodic advantages are real and measurable.

However, the preview label is earned: undisclosed pricing, opaque context limits, weaker non-English performance, and the absence of a dedicated safety evaluation for audio modality all counsel caution for regulated or high-throughput production environments. Teams should treat this as a prototyping and evaluation tool today, with a migration path to the general availability release when it materialises.

For teams ready to begin hands-on evaluation, run your own spoken-input prompts through our /live-test harness to benchmark latency and output quality against your specific domain requirements.

Last technical review: 2026-05-22 — Tokonomix.ai

gpt-4o-audio-preview-2024-12-17 — illustration 2
Last automated test
May 24, 2026 · 04:46 UTC · Benchmark
P50 latency
P95 latency
Errors
1 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026