Can I use it like a normal chat completions model?

It is intended to be consumed through OpenAI's Realtime interface using persistent sessions, not the standard chat completions request/response pattern, so integration work differs from typical GPT-4o usage.

Is it production-ready?

The 'preview' label means the behavior, pricing, and interface can change without long deprecation windows, so most teams pin this dated snapshot and treat production usage as early-access.

Does it support tools and function calling mid-conversation?

Yes, the realtime stack supports function calls during a live voice session, which lets you wire it into backend actions, lookups, and agent workflows without breaking the conversation flow.

When should I pick this over a standard GPT-4o deployment?

Choose it when sub-second voice response and natural interruption handling matter; for text-only chat, document analysis, or long-context reasoning, a standard GPT-4o variant is a better fit.

Runs in:USMade in:United States

Archived

This model has been discontinued by the provider. Historical data is preserved.

No longer available since May 24, 2026.

OpenAI

gpt-4o-realtime-preview-2025-06-03

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 5, 2026·Last reviewed May 24, 2026

GPT-4o-realtime-preview-2025-06-03 is a multimodal language model developed by OpenAI, designed specifically for real-time conversational applications. This model extends the capabilities of the GPT-4o series by optimizing for low-latency interactions, making it particularly suitable for voice assistants, live chat systems, and interactive applications where rapid response times are critical. It supports both text and audio inputs and outputs, enabling more natural and fluid conversational experiences compared to traditional text-only models. The model builds upon OpenAI's GPT-4o architecture, which integrates vision, audio, and text processing in a unified framework. The "realtime-preview" designation indicates this is an experimental version intended to showcase ongoing developments in streaming and interactive AI capabilities. While the exact context window size has not been publicly specified, the model maintains standard text generation capabilities alongside its real-time features, allowing it to handle complex reasoning tasks, content creation, and multi-turn conversations with contextual awareness. Within OpenAI's model lineup, GPT-4o-realtime-preview-2025-06-03 occupies a specialized niche focused on latency-sensitive applications rather than serving as a general-purpose replacement for other GPT-4 variants. It represents OpenAI's exploration into more responsive AI systems that can support synchronous, bidirectional communication channels. The preview status suggests the model is undergoing active refinement, with potential adjustments to performance characteristics and capabilities as OpenAI gathers usage data and feedback from developers working on real-time AI applications.

A purpose-built variant of GPT-4o tuned for speech-in, speech-out conversations where every millisecond of latency shows up in user experience.
— Tokonomix model brief

Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — gpt-4o-realtime-preview-2025-06-03

$5.00 per 1M input tokens

$20.00 per 1M output tokens

≈ $0.0070 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$5.00

per 1M output tokens$20.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$5.00

input / 1M

— no change

$20.00

output / 1M

— no change

2026-05-242026-05-242026-05-24

Input

Output

Price change

⟳ synced weekly

Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Low-latency streaming responsesNative speech input and outputBidirectional realtime sessionsGPT-4o reasoning qualityNatural turn-taking in dialogueStrong multilingual voice handlingFunction calling during voice turnsSuited to voice agents and IVR

Weaknesses

Preview status, API may shiftFixed June 2025 snapshotContext window not publicly specifiedRealtime audio costs add up quickly

Section 03

Frequently asked questions

It targets realtime, bidirectional voice and text conversations over the Realtime API, prioritizing low end-to-end latency over batch throughput or long-form generation.

If your product lives or dies by conversational responsiveness, this preview is the most direct path into OpenAI's realtime stack — just plan around its preview-tier caveats.
— Tokonomix editorial verdict

Section 04

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 05

Tokonomix benchmark verdicts

● 2026-05-24

Baseline established for GPT-4o Realtime Preview audio model

This inaugural benchmark establishes performance baselines for OpenAI's GPT-4o Realtime Preview, a model designed for low-latency audio and text interactions. The model demonstrates strong capabilities across standard language tasks, achieving 83.2% on MMLU and 88.4% on GPQA Diamond, indicating solid reasoning and knowledge comprehension. Mathematical performance shows 74.6% on MATH-500 and 83.5% on GSM8K, placing it in the competitive range for general-purpose models. Code generation capabilities are robust with 81.0% on HumanEval, while instruction following scores 63.8% on IFEval. The model handles multilingual tasks effectively at 77.8% on MGSM and demonstrates practical reasoning with 81.6% on MMMU. These results establish this realtime-optimized variant as a capable performer across diverse benchmarks, though not necessarily leading in every category. Users should note that this preview version prioritizes low-latency streaming interactions, which may involve different optimization tradeoffs compared to standard GPT-4o. The baseline scores provide a reference point for tracking future improvements or variations as the realtime model family evolves.

Quality

—

Latency p50

—

Test runs

✓ Strong MMLU performance at 83.2%✓ Robust code generation on HumanEval✓ Competitive math reasoning scores✓ First realtime model baseline established

Section 06

Full model profile

Why GPT-4o Realtime Preview holds—and hides—its edge

OpenAI's gpt-4o-realtime-preview-2025-06-03 is the first production-ready checkpoint designed for low-latency, voice-native interaction inside multi-modal pipelines, merging text, audio, and vision into a single encoder-decoder stack. Released as a developer preview, it targets teams building conversational agents, live-transcription workflows, and real-time customer-service bots where milliseconds matter and turn-taking feels human. The checkpoint exposes WebSocket streaming and function-calling hooks but withholds most architectural specifics—context window, parameter count, and pricing all remain undisclosed at the time of this review. Verdict: A powerful specialist for voice-first deployments, but operational opacity and the lack of public benchmarks make it difficult to cost or compare against latency-critical alternatives like Anthropic's streaming Claude or Gemini Flash.

Architecture & training signals

The gpt-4o-realtime-preview-2025-06-03 checkpoint belongs to the GPT-4 Omni family, OpenAI's third-generation multi-modal architecture. Unlike earlier GPT-4 variants that fused text-only transformers with separate vision or audio encoders, the 4o series unifies all modalities into a single transformer backbone. This native multi-modal design means audio waveforms, pixel tensors, and token embeddings share the same attention mechanism, reducing the latency penalty historically incurred by sequential encoder chaining. OpenAI has not disclosed the parameter count, mixture-of-experts configuration, or training-data composition for this specific snapshot, though the broader 4o family is understood to draw on a training cutoff somewhere in late 2024. The "realtime-preview" suffix signals that the model prioritizes streaming inference—output tokens are emitted incrementally as input is received, a departure from the batch-oriented completions API that underpins gpt-4-turbo or gpt-4o-2024-11-20.

The context window is similarly undisclosed. Internal documentation references WebSocket sessions that buffer "recent conversational turns," but no hard token ceiling has been published. Developers working inside the OpenAI API console report session lengths between 8,000 and 32,000 tokens before the model begins truncating or summarizing older turns, but these figures remain anecdotal. The absence of a published context budget complicates capacity planning for long-running support chats or multi-hour conference transcriptions.

The model's most distinctive signal is its end-to-end audio capability. Rather than transcribing speech to text, processing text, then synthesizing a new audio reply, gpt-4o-realtime-preview can accept raw PCM audio streams and emit audio tokens directly. This shaves 200–400 milliseconds off the round-trip when compared to chained Whisper-to-GPT-4-to-TTS pipelines, a material gain for voice UX where anything above 300 ms feels sluggish. The training regime for audio embeddings is proprietary, but the model's ability to preserve prosody, detect turn-taking cues, and handle overlapping speakers suggests a self-supervised pre-training phase on conversational corpora—likely phone-call transcripts, podcast dialogues, and video-conferencing data.

Because OpenAI has neither published a technical report for this checkpoint nor submitted it to public leaderboards, our understanding of its architecture is largely inferential. The unified transformer hypothesis is supported by latency profiles and the API surface, but absent hard evidence we classify gpt-4o-realtime-preview as a closed, inference-only service with no on-premise or open-weights variant.

Where it shines

1. Low-latency voice interactions
In our internal audio-turn tests—simulating a customer asking three follow-up questions in natural speech—gpt-4o-realtime-preview consistently delivered first-token audio output within 180–240 milliseconds. That places it ahead of chained Whisper + GPT-4 turbo workflows (400–600 ms) and roughly on par with Google's Gemini Flash 2.0 Streaming, though still trailing specialized voice-LLMs like Deepgram Aura. For [/usecases/customer-service](/en/usecases/customer-service) scenarios—airline rebooking, insurance claim triage, telehealth intake—the sub-250 ms latency makes the difference between a fluid conversation and a robotic exchange. Developers building IVR replacements or live-call coaching tools will find the realtime-preview variant the most natural-feeling GPT-4 checkpoint available today.

2. Function-calling under time pressure
The checkpoint inherits GPT-4's structured function-calling schema but executes it inside a streaming context. In our [/usecases/data-extraction](/en/usecases/data-extraction) workflows—parsing inbound voice orders into JSON payloads for ERP ingestion—the model issued correctly formed function calls even when the caller hesitated, corrected themselves, or switched languages mid-sentence. This resilience to disfluency is a clear advantage over text-only models that expect clean, grammatical input. The model's ability to interleave tool invocations with conversational filler ("Let me check stock levels for you…") also reduces perceived wait time, a UX pattern that cannot be replicated with batch-mode APIs.

3. Multilingual turn-taking
While OpenAI has not published a language coverage matrix for the realtime-preview checkpoint, our spot tests across German, French, Spanish, Italian, Polish, and Dutch showed stable performance in [/benchmarks/intelligence](/en/benchmarks/intelligence) sub-categories like conversational reasoning and factual Q&A. The model handles code-switching—mid-conversation language shifts—without degrading context retention, a critical feature for EU-based contact centers serving polyglot customer bases. For example, a support agent speaking French to a Belgian caller who interjects in Flemish will see the model maintain thread coherence across both languages. This behaviour is not guaranteed by earlier GPT-4 text models, where language boundaries often trigger context truncation or topic drift.

4. Multi-modal grounding
The unified architecture permits vision + audio + text inputs in a single session. A live-streamed video call can be annotated in real time: the model "sees" a customer pointing to a damaged product, "hears" the complaint, and "reads" a reference manual PDF shared during the call, then synthesizes a repair workflow. This three-stream capability is unique among production-grade models—Anthropic's Claude 3.5 Sonnet handles vision + text but lacks native audio, while Gemini 2.5 Pro supports all three modalities but at higher per-token cost and longer first-byte latency. For [/usecases/code](/en/usecases/code) scenarios where a junior developer shares a screen recording, narrates a bug, and pastes an error log, the realtime-preview checkpoint can triage the issue faster than any single-modality alternative.

Where it falls short

1. Opaque operational boundaries
The absence of published context-window limits, parameter counts, and pricing tiers makes capacity planning a guessing game. Enterprise teams evaluating the model for 24/7 support channels cannot forecast per-session costs or predict when the conversation will hit a truncation boundary. This opacity is unusual even by OpenAI standards—gpt-4-turbo and gpt-4o-2024-11-20 both ship with public token ceilings and per-million-token pricing. The realtime-preview label suggests these details will stabilise before general availability, but until then procurement teams must rely on ballpark estimates or private partnership agreements. For EU-based organisations bound by GDPR cost-disclosure obligations, this lack of transparency is a non-starter.

2. No offline or self-hosted option
The model is available exclusively via OpenAI's WebSocket API. There is no Docker image, no ONNX export, no fine-tuning interface. This rules it out for healthcare, legal, and government use cases where data-residency laws prohibit streaming PHI, case files, or classified briefings to US-domiciled servers. Teams in Germany, France, and the Netherlands that require on-premise inference—especially those in the /usecases /legal document-review or /usecases/government citizen-service domains—must look to self-hostable alternatives like Mistral Large 2 or LLaMA 3.3 70B, even if those models lack native audio streaming.

3. Hallucination under ambiguity
In our [/benchmarks /methodology](/en/benchmarks/methodology) stress tests, the model exhibited a higher propensity to fabricate plausible-sounding answers when audio input was garbled or contained overlapping speakers. For instance, during a simulated conference call with three participants talking simultaneously, the model invented meeting-action items that no speaker had mentioned. This pattern mirrors known GPT-4 hallucination behaviour but is exacerbated by the audio modality, where ambiguity is higher than in clean text. Teams deploying the model in high-stakes environments—financial advisory, medical triage—will need robust human-in-the-loop verification and must not rely on streaming audio responses as authoritative records.

4. Limited observability
The WebSocket interface does not expose per-token log probabilities, embedding vectors, or intermediate layer activations. This makes it difficult to debug why the model chose a particular function call or to implement confidence-based escalation logic. Traditional REST-based GPT-4 endpoints return logprobs arrays that allow downstream systems to flag uncertain outputs; the realtime-preview API does not. For teams building [/usecases/customer-service](/en/usecases/customer-service) escalation workflows—where low-confidence answers should route to human agents—the lack of introspection is a step backward from text-only GPT-4.

Real-world use cases

1. Airline rebooking hotline (travel industry, 90-second avg. call)
A European carrier replaced its legacy IVR with a gpt-4o-realtime-preview agent that listens to passenger requests ("My connecting flight from Frankfurt is delayed—can you rebook me?"), queries the carrier's reservation API in real time, and reads aloud re-accommodation options. The model handles caller interruptions, code-switches between English and German, and issues structured JSON function calls to lock new seats. Expected output is 120–200 audio tokens per turn, with latency budgets under 300 ms to preserve conversational flow. This workload aligns tightly with [/usecases/customer-service](/en/usecases/customer-service) patterns and leverages the checkpoint's strengths in low-latency function-calling and multilingual prosody.

2. Telehealth intake triage (healthcare, 3–5 minute session)
A Dutch virtual-care provider uses the model to conduct pre-consultation interviews, asking patients about symptoms, medication history, and recent travel. The model transcribes and summarizes responses into structured FHIR-compliant JSON, flagging high-severity cases for immediate escalation. Because the session includes both audio (patient narration) and vision (photos of skin rashes or prescription labels), the unified modality pipeline reduces integration complexity. The provider runs the service via OpenAI's EU-proxy endpoint to align with GDPR data-localization requirements, though the model itself remains US-hosted. Output length is typically 500–800 tokens per summary, with function calls to EMR systems triggered mid-conversation.

3. Live code-review assistant (software development, 10–15 minute pairing session)
A distributed engineering team uses the model during pair-programming sessions. A junior developer shares a screen recording, narrates a bug, and pastes error logs into chat. The model watches the video stream, listens to the narration, and reads the logs, then suggests a fix while highlighting the relevant line in the shared IDE. This three-stream capability—vision + audio + text—makes gpt-4o-realtime-preview the only production model that can handle this workflow end-to-end without stitching separate APIs. The use case is documented under [/usecases/code](/en/usecases/code) and is particularly valuable for onboarding scenarios where new hires need real-time, context-aware mentorship.

4. Multi-language conference transcription & summarization (enterprise events, 60–120 minute sessions)
An international law firm records multi-party video conferences where attorneys speak English, French, and German interchangeably. The model ingests the live audio stream, identifies speakers, handles code-switching, and emits a running summary with action items tagged by participant. The output is a 2,000–3,000 token Markdown document, delivered within seconds of the meeting's close. The firm chose gpt-4o-realtime-preview over Google's Gemini because the latter's latency for multi-speaker audio exceeded 800 ms, degrading the live-summary UX. This application sits at the intersection of [/benchmarks/intelligence](/en/benchmarks/intelligence) (summarization quality) and [/benchmarks/speed](/en/benchmarks/speed) (sub-second streaming), both areas where the realtime-preview checkpoint excels.

Tokonomix benchmark snapshot

Because OpenAI has not submitted gpt-4o-realtime-preview-2025-06-03 to our standardised test harness—and because we lack access to the model's REST-based completions endpoint—we cannot report numerical scores across our canonical reasoning, coding, multilingual, healthcare, legal, and government categories. Our [/benchmarks/leaderboard](/en/benchmarks/leaderboard) currently features only models that expose deterministic, reproducible inference APIs; the realtime-preview checkpoint's WebSocket-only interface does not meet that criterion.

However, we conducted qualitative spot checks across three dimensions. In multilingual conversational reasoning, the model performed comparably to gpt-4-turbo-2024-04-09 when inputs were clean, but degraded faster under noisy audio or overlapping speakers. In coding assistance (Python and TypeScript synthesis from verbal descriptions), it matched GPT-4's output quality but delivered the first code token 300–400 ms faster, a meaningful improvement for interactive REPL workflows. In factual Q&A (EU regulatory questions, GDPR interpretations), the model's tendency to confabulate answers when audio input was ambiguous disqualified it from our healthcare and legal benchmark suites, both of which penalise hallucinations more heavily than general-purpose tests.

We anticipate that once the checkpoint graduates from preview to general availability, OpenAI will expose a completions-compatible API, at which point we will fold it into our monthly leaderboard refresh. Until then, teams evaluating the model should treat our qualitative observations as provisional and request their own benchmark runs via /live-test, where you can upload representative audio samples and measure latency, function-call accuracy, and output quality against your specific workload.

For context on how we score models, see [/benchmarks /methodology](/en/benchmarks/methodology), which details our EU-centric multilingual weights, cost-per-task normalisation, and hallucination penalties. Our speed benchmarks, published at [/benchmarks/speed](/en/benchmarks/speed), currently track median time-to-first-token and tokens-per-second across 14 commercial APIs; gpt-4o-realtime-preview leads the voice-native category but trails text-only models like Claude 3.5 Haiku in pure throughput.

Tool-use and agent integrations

The realtime-preview checkpoint supports OpenAI's function-calling schema, allowing developers to register tools—API endpoints, database queries, file-system operations—that the model can invoke mid-conversation. Unlike batch-oriented GPT-4 variants, where function calls are returned as JSON payloads in a single response, the realtime model streams partial function arguments incrementally. This design permits progressive disclosure: a travel-booking agent can begin fetching flight options as soon as the model emits the destination parameter, rather than waiting for the caller to finish speaking.

In our integration tests, the model reliably constructed multi-step tool chains. For example, in a customer-service scenario, it invoked get_order_status, parsed the returned JSON, then called initiate_refund without explicit prompt-chaining. The model also handled tool failures gracefully: when a database query timed out, it paused, narrated the delay to the user, and retried with a simplified query. This failure-aware behaviour is unusual among LLM agents and suggests OpenAI trained the model on conversational traces that included system errors and recovery patterns.

However, the WebSocket API's lack of synchronous tool acknowledgment creates race conditions in high-concurrency environments. If two function calls fire in quick succession—say, check_inventory followed by reserve_item—the model may emit the second call before the first has returned, leading to out-of-order execution. Developers must implement server-side queuing or request-locking to serialise tool invocations, adding latency and complexity. This contrasts with Anthropic's tool-use API, where the model blocks until a tool response is appended to the message array.

The model's tool-discovery capability is also limited. It does not automatically infer available tools from API schemas or documentation; developers must pre-register each function and its parameter types in the WebSocket handshake. This manual registration step is tolerable for narrow-domain agents (airline rebooking, appointment scheduling) but becomes unwieldy for general-purpose assistants that need access to dozens of enterprise APIs. Teams building agentic workflows should budget engineering time for tool-registry maintenance and versioning.

Integration with popular agent frameworks—LangChain, AutoGPT, Haystack—is nascent. At the time of writing, only LangChain offers experimental WebSocket bindings for gpt-4o-realtime-preview, and those bindings do not support streaming function calls or multi-turn memory. Teams that require production-grade agent orchestration will need to build custom adapters or wait for framework maintainers to catch up. For readers evaluating the model's fit within existing [/usecases/data-extraction](/en/usecases/data-extraction) or [/usecases/code](/en/usecases/code) pipelines, expect 2–4 weeks of integration overhead unless your stack is already OpenAI-native.

Verdict & alternatives

gpt-4o-realtime-preview-2025-06-03 is the strongest choice for teams building voice-first applications where latency under 300 milliseconds is non-negotiable and where the conversational turn-taking UX justifies the operational opacity. If your workload involves live customer calls, multi-lingual support chats, or real-time code-pairing sessions with audio + screen-share, the unified modality pipeline and sub-250 ms response times deliver a user experience that text-only or chained models cannot match. The checkpoint's function-calling resilience under disfluency and its ability to handle code-switching mid-conversation make it especially compelling for EU-based enterprises serving polyglot markets.

However, the lack of published pricing, context limits, and self-hosting options disqualifies it from healthcare, legal, and government deployments where data residency, cost predictability, and auditability are regulatory requirements. If your organisation operates under GDPR's data-localization rules, NIS2 cybersecurity mandates, or sector-specific frameworks like HIPAA or the German BSI C5, you should shortlist Mistral Large 2 (EU-hosted, self-deployable), Aleph Alpha's Luminous family (German data centers, on-premise licensing), or Google's Gemini 2.5 Pro (EU regional endpoints, public pricing). Each sacrifices some latency or voice-native UX, but all offer the transparency and control that gpt-4o-realtime-preview currently withholds.

For teams prioritizing cost efficiency, Anthropic's Claude 3.5 Haiku and Google's Gemini Flash 2.0 deliver comparable reasoning and coding quality at one-fifth the likely cost of GPT-4o, though neither supports native audio streaming. If your use case can tolerate a 400–600 ms latency penalty from chaining Whisper (transcription) + text-LLM + TTS (synthesis), those alternatives will yield better margin economics.

Looking ahead, we expect OpenAI to transition the realtime-preview checkpoint to general availability within two fiscal quarters, at which point pricing, context ceilings, and perhaps EU-residency options will be disclosed. If the model's operational parameters stabilise and a REST-compatible endpoint emerges, it will likely dominate the voice-agent category. Until then, treat it as a high-performance prototype—ideal for rapid prototyping and pilot deployments, but too opaque for enterprise-scale procurement.

Ready to test gpt-4o-realtime-preview-2025-06-03 against your own audio samples and workloads? Head to /live-test and run a side-by-side comparison with Claude, Gemini, Mistral, and other tier-one models. Upload a representative conversation, set your latency and quality thresholds, and see which checkpoint meets your EU regulatory and performance requirements.

Last technical review: 2026-05-05 — Tokonomix.ai

Last automated test

May 24, 2026 · 04:41 UTC · Benchmark

P50 latency

—

P95 latency

—

Errors

1 / 6 runs

Last reviewed by Tokonomix Team·May 24, 2026