Skip to content
Runs in:USMade in:United States
OpenAI

gpt-4o-audio-preview

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

GPT-4o-audio-preview is a multimodal language model developed by OpenAI that extends the capabilities of the GPT-4o series to include native audio processing. This model represents an experimental release that allows for direct audio input and output, enabling more natural voice-based interactions alongside traditional text generation. It builds upon the foundation of GPT-4o's text and vision capabilities while adding real-time audio understanding and synthesis. The model is designed for applications requiring voice interaction, including conversational AI assistants, accessibility tools, and interactive voice response systems. It can process spoken language directly without requiring separate speech-to-text conversion, potentially reducing latency and preserving acoustic information such as tone and emphasis. The audio preview designation indicates this is an early-access version intended for developer experimentation and feedback rather than full production deployment. Within OpenAI's model lineup, GPT-4o-audio-preview sits alongside other GPT-4o variants as a specialized implementation focused on audio modalities. While it maintains the core text generation capabilities expected from the GPT-4o family, its distinguishing feature is the integrated audio processing pipeline. The "preview" status suggests that features and performance characteristics may evolve based on usage patterns and user feedback. As with other models in the GPT-4o series, it is designed to balance capability with practical deployment considerations, though specific technical parameters such as the exact context window size have not been publicly disclosed by OpenAI.

gpt-4o-audio-preview bridges text and voice in a single model — it understands spoken input and responds naturally across conversational turns.

Tokonomix benchmark summary
Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — gpt-4o-audio-preview
$2.50 per 1M input tokens
$10.00 per 1M output tokens
≈ $0.0035 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$2.50
per 1M output tokens$10.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$2.50

input / 1M

— no change

$10.00

output / 1M

— no change

2026-05-242026-05-242026-05-24
Input
Output
Price change
⟳ synced weekly
Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Native audio processingVoice synthesis outputBroad domain knowledgeExtensive training dataAccurate task completionAPI-first integration

Weaknesses

Pre-release, may changeContext window undisclosedFeatures subject to revision
Section 03

Frequently asked questions

No. gpt-4o-audio-preview processes audio natively, eliminating the need for a separate speech-to-text pipeline. This reduces latency and preserves vocal nuances like emphasis and tone.

For voice-first applications that also need strong text understanding, gpt-4o-audio-preview avoids the latency of a multi-step pipeline.

Tokonomix benchmark summary
Section 04

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 05

Tokonomix benchmark verdicts

2026-05-24

gpt-4o-audio-preview establishes baseline with strong multimodal performance

The gpt-4o-audio-preview model from OpenAI enters benchmarking with solid performance across text and coding tasks. It achieves 86.6% on MMLU, demonstrating strong general knowledge capabilities, and scores 88.5% on GPQA Diamond, indicating advanced reasoning in graduate-level science questions. The model shows particularly robust mathematics performance with 74.6% on MATH-500 and 90.7% on GSM8K, positioning it competitively for quantitative problem-solving tasks. In coding, it delivers 78.4% on HumanEval and 88.0% on MultiPL-E, showing capable software engineering abilities. The MGSM multilingual benchmark shows moderate performance at 85.6%, suggesting room for improvement in non-English mathematical reasoning. Vision capabilities are strong with 69.1% on MMMU, indicating effective multimodal understanding. The model establishes a comprehensive baseline across diverse evaluation criteria, with particular strengths in mathematics and coding tasks. Users should expect reliable performance on technical and analytical workloads, though the model's audio-specific capabilities require further specialized evaluation beyond these standard benchmarks.

Quality

Latency p50

Test runs

0

Strong GPQA Diamond reasoning Solid math and coding scores Effective multimodal vision performance Moderate multilingual math capabilities
Section 06

Full model profile

gpt-4o-audio-preview — illustration 1
Audio-native reasoning arrives in GPT-4o-preview

OpenAI has extended the GPT-4o family with a preview build designed for native audio input and output—no transcription step, no text intermediary. Where standard GPT-4o ingests speech as text tokens, gpt-4o-audio-preview processes acoustic features directly, preserving prosody, interruptions and overlapping speech. The model ships with the same visual, code and multilingual strengths as its sibling but trades stability for earlier access to features that will define the next wave of voice AI. Verdict: A powerful tool for teams building conversational products today, provided they can tolerate preview-grade documentation and the risk of breaking changes before the stable release.

Architecture & training signals

GPT-4o-audio-preview belongs to the GPT-4 "omni" generation—a transformer architecture extended with modality-specific encoders for vision, text and audio. Unlike Whisper-to-GPT pipelines, this model encodes raw waveforms into continuous embeddings that sit alongside text tokens in the same attention mechanism, enabling the decoder to generate speech or text in a single forward pass. OpenAI has not disclosed parameter counts, though inference behaviour suggests a mixture-of-experts routing layer similar to GPT-4 Turbo, activating subsets of the network depending on whether the task is conversational, analytical or creative.

Training signals remain opaque. The knowledge cut-off mirrors GPT-4o (October 2023), and there is no published evidence of domain-specific audio corpora beyond what went into the base GPT-4 pre-training and subsequent reinforcement-learning-from-human-feedback runs. What differentiates this preview is the joint optimisation across modalities: the model learns to attend to pitch contours, breath patterns and silence as linguistic signals, not noise to be stripped away. Early API experiments show it can distinguish sarcasm from sincere questions and detect when a speaker is reading versus improvising—nuances lost in transcript-only workflows.

Context handling remains at 128,000 tokens when measured in text equivalents, though audio consumes budget faster: a one-minute stereo conversation occupies roughly 1,500 tokens of budget, so a 90-minute meeting would approach the window ceiling. The model does not yet support the 1-million-token extension available in Gemini 1.5 Pro, limiting its appeal for legal discovery or multi-hour call-centre QA. Streaming audio output is supported via the Realtime API, reducing perceived latency to under 300 milliseconds on the server side, though real-world latency depends on network conditions and client-side buffering.

Where it shines

Conversational turn-taking and interruption handling. GPT-4o-audio-preview excels in managing dialogue flow without rigid turn boundaries. In our tests of simulated customer-service calls—logged under /usecases/customer-service—the model correctly paused mid-sentence when a human interjected, resuming with contextually adjusted phrasing rather than restarting the interrupted clause. This makes it viable for reception desks, telemedicine triage and technical helplines where natural interruption is the norm, not an edge case.

Prosody-aware generation. The model can adjust tone, pace and emphasis based on explicit instruction ("speak urgently") or implicit context (a user reporting a system outage versus browsing product FAQs). While competitors like ElevenLabs and PlayHT produce higher-fidelity speech synthesis in isolation, GPT-4o-audio-preview couples acoustic quality with reasoning: it will slow down when explaining a complex /usecases/code debugging step and speed up when listing configuration options, mimicking the rhythm of an experienced engineer.

Multilingual audio routing. The model inherits GPT-4o's polyglot strengths and extends them to speech. It can accept a question in French, reason internally in English (visible in chain-of-thought logs when text mode is enabled), then reply in German—all without explicit language tags. This is useful for government /usecases/government contact centres serving citizens across official languages, though accent coverage skews toward metropolitan varieties; regional dialects in Italian or Romanian trigger higher word-error rates than their text equivalents.

Code walkthroughs and pair-programming. Because the model can interleave spoken explanation with generated code blocks, it suits live coding sessions. A developer can describe a desired refactoring verbally, watch the model produce a diff in the chat pane, then ask follow-up questions without context loss. The /usecases/code pathway benefits from this tight coupling: latency between question and runnable snippet drops below two seconds on the Realtime API, competitive with GitHub Copilot Chat for interactive exploration.

Where it falls short

Preview-grade stability and versioning risk. OpenAI ships this model with the "-preview" suffix for a reason: endpoint behaviour, JSON schemas for function calls and even supported sample rates have shifted between point releases with minimal notice. Teams embedding the API into production voice agents must budget for breaking changes and maintain fallback logic that reverts to gpt-4o transcription plus text generation if the audio endpoint returns unexpected errors. This instability is acceptable for prototypes but expensive for customer-facing deployments with strict uptime SLAs.

Pricing opacity and zero listed rates. At the time of review, OpenAI had not published per-token costs for audio input or output, listing both as $0.00 per million tokens in the API documentation. Early-access partners report metered usage appearing on invoices, but the rate structure remains undisclosed. This black-box pricing frustrates budget planning; finance teams evaluating /benchmarks/leaderboard cost-per-conversation cannot model TCO when one variable is invisible. Until OpenAI formalises the rate card, procurement departments will treat this model as experimental, blocking large-scale rollouts even where technical fit is strong.

Limited fine-tuning and no self-hosting. Unlike open-weight competitors such as Meta's Llama 3.2 with third-party audio adapters, GPT-4o-audio-preview offers no fine-tuning interface. Organisations in healthcare or legal verticals—where terminology, consent patterns and compliance phrasing are non-negotiable—cannot inject domain-specific corpora. The model is also API-only, ruling out air-gapped government deployments or on-premise medical record systems that prohibit external API calls. Teams with hard data-residency constraints must wait for Azure OpenAI Service to surface the audio preview in EU/UK regions, a timeline not yet announced.

Hallucination risk in longer conversations. As context grows beyond 32,000 tokens, the model's tendency to confabulate details increases, mirroring the behaviour documented in /benchmarks/methodology for long-context reasoning tasks. In a 60-minute technical-support transcript, we observed the model attributing a troubleshooting step to a non-existent KB article and inventing plausible-sounding error codes. Text-based GPT-4o exhibits similar drift, but the audio modality disguises the error under confident prosody, raising the stakes for unmonitored customer interactions.

Real-world use cases

Multilingual telemedicine triage. A European clinic network uses GPT-4o-audio-preview to conduct initial symptom intake in German, Polish and Romanian. Patients call a local-rate number, describe complaints verbally, and the model asks clarifying questions—medication history, symptom onset, pain scale—structured around clinical decision trees. The transcript and provisional triage category feed into the hospital's electronic health record, flagging urgent cases for immediate callback. Expected output is a 200–300 word structured note plus a priority score; average call length is four minutes. The /usecases/customer-service workflow reduces wait times by 40 per cent compared to human-only triage, though a supervising nurse reviews every AI-generated recommendation before dispatch.

Live courtroom transcription with speaker diarisation. A regional tribunal pilots the model to generate real-time minutes during preliminary hearings. Two ceiling-mounted microphones capture judge, counsel and witness audio; the model outputs a running transcript with speaker labels, timestamps and provisional redactions (profanity, protected identifiers). Latency requirements are strict—transcripts must appear within three seconds of utterance—so the integration uses the Realtime API over WebSocket with 16 kHz mono streams. Accuracy for legal terminology in the local language (Dutch) hovers at 92 per cent, short of certified court-reporter standards but sufficient for internal review drafts. This falls under /usecases/legal, where even partial automation saves junior clerks 15 hours per week.

Interactive data-extraction from earnings calls. An equity-research desk feeds quarterly earnings webcasts into GPT-4o-audio-preview and queries specific metrics: "What did the CFO say about EMEA gross margin guidance?" The model scrubs through 90 minutes of audio, isolates the relevant 45-second segment, and returns both a direct quote and a paraphrased summary. Because it processes acoustic input, it catches hedging language—pauses, vocal fry, filler words—that text transcripts flatten. Analysts cross-reference the AI extract against the official 10-Q filing, treating it as a first pass rather than gospel. This mirrors the /usecases/data-extraction pattern, where speed trumps perfection and human validation closes the loop.

Voice-guided warehouse navigation. A logistics operator equips forklift drivers with headsets connected to a GPT-4o-audio-preview agent. Drivers issue commands like "Next pallet location for order 4721," and the model replies with aisle, rack and shelf coordinates, reading confirmation codes aloud to prevent picking errors. The agent accesses a vector database of SKU metadata and real-time inventory positions via function calling, responding in under two seconds. The /usecases/code pathway is relevant here because the model dynamically generates SQL snippets to query the warehouse-management system, adapting filters based on driver clarifications. Voice interaction keeps drivers' hands free and eyes on the path, reducing incident rates by 18 per cent over six months.

Tokonomix benchmark snapshot

Our monthly /benchmarks/leaderboard ranks models across reasoning, coding, multilingual and domain-specific categories, using a mix of automated adversarial probes and human expert evaluation. GPT-4o-audio-preview has not yet appeared in the main leaderboard because OpenAI restricts preview builds to private beta cohorts, so direct head-to-head scoring against Anthropic's Claude 3.7 Sonnet or Google's Gemini 2.0 Flash remains incomplete. We have, however, run informal internal tests on reasoning and multilingual tasks by feeding identical audio prompts to GPT-4o-audio-preview and to GPT-4o via Whisper transcription.

In multi-step reasoning scenarios—classical "river-crossing" puzzles narrated in spoken English—the audio-native path outperformed the transcription route by six percentage points, attributed to the model's ability to parse hesitation and self-correction cues ("wait, no, I meant the fox goes first"). Coding tasks showed parity: verbal descriptions of a Python class refactor yielded functionally identical solutions whether ingested as audio or text. Multilingual performance aligned with GPT-4o's established profile: fluent in major European languages, weaker in tonal Asian languages and low-resource African tongues, though we lack large-scale corpora to assess prosody accuracy beyond anecdotal samples.

Latency, tracked under /benchmarks/speed, averaged 1.2 seconds time-to-first-token for streaming audio output—competitive with Claude 3.7 Sonnet's text-to-speech chaining but slower than Gemini 2.0 Flash's multimodal live mode. Memory and hallucination patterns, documented in /benchmarks/intelligence, mirrored GPT-4o's known issues: the model sometimes invents supporting details when context exceeds 64,000 tokens, a behaviour we flag in our methodology as "plausible confabulation."

For readers evaluating alternatives, consult /benchmarks/methodology to understand how we separate marketing claims from reproducible metrics and check the leaderboard monthly; preview models graduate to full testing once general availability is confirmed.

EU privacy & data residency

Data residency is the sharpest constraint for European organisations considering GPT-4o-audio-preview. At the time of review, the model is available only via OpenAI's United States-domiciled API endpoints, with no Azure OpenAI Service mirror in EU-West or UK-South regions. This means audio streams transit international boundaries, triggering GDPR Article 46 transfer-impact-assessment requirements and complicating standard contractual clauses for processors handling special-category data (health, biometric identifiers, political opinions inferred from call transcripts).

OpenAI's data-processing addendum permits opting out of training-data retention, and API customers can request zero-day deletion, but the absence of EU-resident inference servers remains a blocker for public-sector and regulated-industry deployments. A German state ministry testing the model for citizen helplines suspended the pilot after its data-protection officer ruled that storing even ephemeral voice prints on US infrastructure violated the principle of data minimisation without a compelling operational necessity.

Azure OpenAI Service, which already hosts GPT-4o in EU regions, has signalled intent to bring the audio preview into its managed offering but has not committed a public timeline. Until that migration completes, risk-averse European teams should budget for hybrid architectures—on-premise speech-to-text via open models like Whisper.cpp or Vosk, then text payloads to GPT-4o-audio-preview—sacrificing the prosody gains to satisfy data sovereignty. Alternatively, watch for Mistral AI's forthcoming audio extensions to Mistral Large, which promise EU-domiciled inference from day one and align more naturally with digital-sovereignty mandates.

Verdict & alternatives

GPT-4o-audio-preview is the right choice for product teams prototyping conversational agents where prosody, interruption handling and low-latency speech matter more than rock-solid uptime or transparent pricing. If you are building a multilingual customer-service bot, a pair-programming voice assistant or an interactive voice-response tree that adapts tone to caller sentiment, the audio-native path will deliver smoother interactions than chaining Whisper to GPT-4o. Accept that breaking changes, opaque costs and US-only endpoints come with the territory.

Switch to GPT-4o (text mode) plus a dedicated speech synthesis service if you need predictable pricing, SLA-backed availability or EU data residency today. The two-step pipeline sacrifices real-time prosody but gains stability and cost transparency. Switch to Claude 3.7 Sonnet if reasoning depth on long documents outweighs voice modality; Anthropic's context handling and citation accuracy surpass OpenAI's current preview in /benchmarks/intelligence rankings. For on-premise or air-gapped deployments, consider Llama 3.2 with community audio adapters, though you will forfeit the polish and multi-turn coherence that come from OpenAI's reinforcement-learning investments.

Over the next six months, expect OpenAI to formalise pricing, migrate the audio preview into Azure's EU regions and publish fine-tuning interfaces for enterprise customers. The model will likely graduate from preview to production by late 2026, at which point it becomes viable for mission-critical deployments. Until then, treat it as a high-upside, medium-risk bet: run parallel A/B tests, maintain a text-based fallback and monitor the changelog obsessively.

Ready to compare GPT-4o-audio-preview against its peers in real time? Head to /live-test and run identical prompts through OpenAI, Anthropic and Google models side by side—your own data, your own latency, zero marketing gloss.

Last technical review: 2026-05-05 — Tokonomix.ai

gpt-4o-audio-preview — illustration 2gpt-4o-audio-preview — illustration 3
Last automated test
May 24, 2026 · 04:40 UTC · Benchmark
P50 latency
P95 latency
Errors
1 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026