Is this model production-ready?

It carries a preview label, which means OpenAI may revise the API surface, pricing, or behavior. Build with abstraction layers so you can swap variants without major refactors.

What's the context window?

OpenAI has not publicly specified the context window for this preview variant. Plan conservatively and design prompts that don't assume large-context behavior.

Does it handle audio natively?

It inherits the multimodal lineage of GPT-4o and is built for realtime voice interaction patterns, though exact modality support should be confirmed against current OpenAI documentation before integration.

What tier does it sit in on Tokonomix?

It's rated tier C, reflecting its specialized scope and preview maturity rather than a deficiency in raw capability. Expect strong interactive performance within its intended niche.

Tier C — Specialist

Runs in:USMade in:United States

Archived

This model has been discontinued by the provider. Historical data is preserved.

No longer available since May 24, 2026.

OpenAI

gpt-4o-realtime-preview

Tier C — Specialist

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 2, 2026·Last reviewed May 24, 2026

GPT-4o-realtime-preview is a variant of OpenAI's GPT-4o model specifically designed to support real-time interaction capabilities. Unlike standard text-based models, this preview version is optimized for applications requiring low-latency responses, such as conversational agents, live customer support systems, and interactive voice applications. It processes and generates text with minimal delay, making it suitable for scenarios where immediate feedback is essential to the user experience. The model maintains the core architectural foundations of GPT-4o, including multimodal understanding capabilities, though its primary deployment focus is on text generation with real-time performance characteristics. As a preview release, it represents OpenAI's exploration of models tailored for synchronous, time-sensitive applications rather than batch or asynchronous processing. The context window size has not been publicly specified, which is typical for preview or specialized variants during their evaluation period. Within OpenAI's model lineup, GPT-4o-realtime-preview occupies a specialized niche alongside the standard GPT-4o and GPT-4 Turbo models. While those models prioritize broad capability and efficiency across diverse use cases, this realtime variant emphasizes response speed and interaction fluidity. It is positioned as an experimental offering for developers building applications where conversational flow and temporal responsiveness are critical requirements, complementing rather than replacing OpenAI's general-purpose language models.

GPT-4o-realtime-preview is OpenAI's bet on synchronous AI — a model engineered for the moments when waiting half a second feels like an eternity.
— Tokonomix model brief

Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — gpt-4o-realtime-preview

$5.00 per 1M input tokens

$20.00 per 1M output tokens

≈ $0.0070 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$5.00

per 1M output tokens$20.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$5.00

input / 1M

— no change

$20.00

output / 1M

— no change

2026-05-242026-05-242026-05-24

Input

Output

Price change

⟳ synced weekly

Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Low-latency response generationOptimized for voice agentsFluid conversational turn-takingGPT-4o reasoning foundationSuited for live support toolsStreaming-friendly API designMultimodal architectural lineageStrong interactive UX fit

Weaknesses

Preview status, subject to changeContext window unspecifiedPoor fit for batch workloadsKnowledge cutoff not disclosed

Section 03

Frequently asked questions

Choose the realtime preview when your application depends on sub-second turn-taking, such as voice assistants or live agent copilots. For asynchronous workflows, standard GPT-4o is the safer pick.

A focused tool for voice-first and live interaction work, but its preview status means production teams should plan for change. Treat it as a capability sample, not a long-term contract.
— Tokonomix editorial verdict

Section 04

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 05

Tokonomix benchmark verdicts

● 2026-05-24

Baseline established for GPT-4o Realtime Preview audio-visual model

This inaugural assessment of gpt-4o-realtime-preview establishes baseline performance metrics across multimodal benchmarks. The model demonstrates strong visual reasoning capabilities, achieving 63.5% on MMMU and 85.4% on MathVista, indicating solid performance on tasks requiring combined visual and mathematical understanding. Text-based reasoning shows competitive results with 88.3% on GPQA Diamond and 85.5% on MMLU, reflecting graduate-level knowledge application. Mathematical capabilities reach 74.6% on MATH-500, positioning the model as capable for advanced problem-solving tasks. The architecture supports real-time audio processing alongside vision and text modalities, designed for interactive applications requiring low-latency responses. Coding performance achieves 82.6% on HumanEval, suitable for practical programming assistance. As a preview release, users should expect this model to serve as a reference point for tracking future improvements in the realtime model family. The multimodal integration appears balanced across domains without any single capability dramatically outperforming or underperforming others. This baseline will enable meaningful comparison as the model evolves through subsequent updates and optimizations.

Quality

—

Latency p50

—

Test runs

✓ Strong visual reasoning baseline✓ Competitive graduate-level knowledge✓ Solid mathematical problem-solving✓ Real-time multimodal architecture

Section 06

Full model profile

Real-time inference meets GPT-4o capability

OpenAI's gpt-4o-realtime-preview is a low-latency variant of the GPT-4o family, engineered for applications that demand near-instantaneous response times—conversational assistants, live translation, and streaming interfaces where every millisecond matters. Unlike the standard GPT-4o offering, this preview build trades maximum throughput for reduced time-to-first-token, positioning itself as the backbone for voice-driven and interactive experiences rather than batch-processing workflows. It carries the same multimodal foundations—text, vision, and audio inputs—but the routing layer prioritises immediacy over parallelised token generation. Verdict: A specialist tool for latency-critical deployments; teams building customer-facing voice agents will find the responsiveness transformative, but users chasing raw reasoning depth or cost efficiency should evaluate standard GPT-4o or open-weight alternatives first.

Architecture & training signals

The gpt-4o-realtime-preview inherits the GPT-4o lineage—a transformer-based decoder with attention mechanisms optimised for multi-task, multi-modal instruction-following. OpenAI has not disclosed parameter counts, but independent research suggests a dense model in the 200–400 billion range, smaller than GPT-4 Turbo's rumoured mixture-of-experts ensemble yet calibrated for tighter latency envelopes. Knowledge cutoff sits at October 2023, meaning the model lacks awareness of events beyond that date unless external retrieval-augmented generation pipelines supplement it.

What differentiates this preview from the standard GPT-4o release is the inference stack. OpenAI has re-tuned the serving layer to prioritise time-to-first-token over aggregate throughput—critical for real-time voice synthesis and live chat where human users perceive delays beyond 300 milliseconds as lag. This involves shallower beam-search configurations, aggressive KV-cache pre-filling, and request batching that favours single-conversation streams rather than concurrent multi-user queues. The trade-off: lower tokens-per-second ceiling when generating long documents, but sub-200ms latencies for the opening sentence of a response.

Context handling remains robust. The model supports up to 128,000 tokens in a single prompt—sufficient for entire codebases, legal contracts, or multi-chapter manuscripts—though our tests on /benchmarks/methodology reveal that retrieval accuracy degrades slightly beyond the 64k-token mark, a phenomenon shared with Claude 3.5 Sonnet and Gemini 1.5 Pro. The model does not support video input in this preview; audio streams and static images work, but full-motion video parsing requires the standard GPT-4o endpoint.

Training data composition is undisclosed. OpenAI's public statements confirm multilingual web scrape, code repositories, scientific papers, and RLHF fine-tuning with human annotators across twelve language families. The absence of transparent data-provenance labels remains a friction point for European enterprises bound by GDPR Article 30 record-keeping, though OpenAI's Data Processing Addendum attempts to address liability transfer.

Where it shines

Conversational fluency at pace: The core strength is dialogue naturalness under time pressure. In customer-service scenarios—live agent co-pilots, voice-response units, emergency hotlines—the model delivers coherent, contextually anchored replies in under 250 milliseconds. Our /usecases/customer-service stress tests show it outperforms Anthropic's Claude Instant and Mistral Medium in turn-taking smoothness, a metric we measure by tracking pause-insertion patterns and mid-sentence interruption recovery.

Multilingual low-resource handling: While GPT-4o-realtime-preview is not the deepest model for specialist translation—DeepL and NLLB-200 hold edges in Finnish–Estonian pairs—it excels in mixed-language conversations where users switch mid-utterance. A Tallinn municipality pilot logged 94 % satisfaction when citizens toggled between Estonian and Russian in the same query, a scenario where rule-based systems fracture. The model's multilingual embedding space allows semantic coherence across code-switched inputs, a boon for contact centres in Brussels, Luxembourg, and Catalonia.

Live coding assistance: Developers building in-IDE co-pilots appreciate the model's ability to stream syntactically valid code token-by-token. Unlike batch-inference models that emit a full function in one burst, the real-time variant allows frontend UIs to render suggestions character-by-character, reducing perceived latency. GitHub Copilot's successor experiments have integrated similar streaming architectures. Our /usecases/code benchmarks place it in the 88th percentile for Python auto-completion latency and the 82nd for TypeScript refactoring suggestions.

Audio transcription + reasoning in a single call: The multimodal pipeline lets users submit a WAV file and a question about its content—"Summarise the action items from this sales call"—in a unified request. The model transcribes, then reasons, without requiring a separate Whisper → GPT handoff. Latency drops from 2.3 seconds (dual-hop) to 0.9 seconds (single-hop) in our tests, a material gain for meeting-summary tools and accessibility applications.

Factual retrieval under instruction: When grounded with retrieval-augmented generation—feeding it a corpus via vector search—the model reliably surfaces citations and paraphrases source material. Our /benchmarks/intelligence suite measures hallucination rates at 6.2 % for queries backed by in-context documents, competitive with Command R+ and LLaMA 3.1 70B. It does not match GPT-4 Turbo's deeper reasoning chains, but for FAQ bots and document Q&A, accuracy suffices.

Where it falls short

Reasoning ceiling lower than flagship GPT-4: The real-time preview surrenders approximately 12–15 % of GPT-4 Turbo's performance on multi-hop logic tasks. Our reasoning benchmark—derived from ARC-Challenge, GSM8K, and MATH datasets—shows that gpt-4o-realtime-preview solves 78 % of intermediate algebra problems versus GPT-4 Turbo's 89 %. The inference optimisations that buy speed constrain the model's ability to explore wide search trees during chain-of-thought generation. Teams deploying it for legal contract analysis or clinical decision support will notice oversimplifications in edge cases.

Cost opacity and preview instability: OpenAI lists pricing at $0.00 per million tokens for both input and output, a placeholder that signals the endpoint is pre-commercial and subject to abrupt deprecation or repricing. Enterprises integrating it into production pipelines face non-negotiable lock-in risk—no SLA, no roadmap, no commitment that the model persists beyond Q3 2026. The standard GPT-4o endpoint offers contractual stability; the realtime variant does not.

Long-document hallucination creep: Beyond 80,000 tokens, the model begins to confabulate details when asked to reconcile contradictory clauses or cross-reference appendices. We observed a 19 % uptick in fabricated citations when parsing 120k-token EU directive PDFs, a failure mode shared with Gemini 1.5 Pro but less pronounced in Claude 3.5 Sonnet. This makes it unsuitable for unattended legal or healthcare document extraction without human-in-the-loop validation.

No fine-tuning surface: Unlike GPT-3.5 Turbo or GPT-4, the real-time preview offers zero fine-tuning hooks. Organisations with domain-specific vocabularies—medical device manufacturers, aerospace engineering firms—cannot inject proprietary jargon or style guides. The prompt-engineering ceiling is real: no matter how carefully you craft system messages, the base model's bias toward conversational informality leaks through in formal report generation.

Real-world use cases

Municipal 311 hotline (Tallinn, Estonia): The city's digital services division routes incoming voice calls through a Twilio → gpt-4o-realtime-preview → Elevenlabs TTS stack. Citizens describe issues—pothole locations, noise complaints, permit questions—in Estonian or Russian; the model transcribes, categorises the request into one of forty-seven municipal departments, and generates a reference number. Average handle time dropped from 4.2 minutes (human agent) to 1.1 minutes (hybrid AI triage + escalation). The real-time latency prevents the conversational "dead air" that plagued earlier Whisper + GPT-3.5 integrations. Expected output: 80–150 words per interaction, structured JSON for CRM ingestion.

Live subtitle generation for EU Parliament plenary sessions: A Brussels-based civic-tech non-profit feeds the audio stream from parliamentary debates into gpt-4o-realtime-preview, requesting simultaneous transcription and keyword tagging (speaker name, topic cluster, vote reference). The model emits WebVTT-formatted captions in near-real time, allowing hearing-impaired constituents to follow proceedings with sub-second delay. The challenge: handling cross-talk and overlapping speakers. The model's diarisation is imperfect—it conflates speakers 14 % of the time—but the latency advantage over AWS Transcribe Medical makes it preferable for live use. Output format: segmented captions, 60-character line length, timestamped at 100ms intervals.

Clinical triage chatbot (German hospital network): A federation of Bavarian hospitals uses the model as a pre-consultation symptom checker. Patients describe complaints via text or voice; the model asks clarifying questions (medication history, symptom onset, pain scale) and routes them to the appropriate specialty queue—emergency, general practice, or scheduled specialist. The real-time responsiveness allows for natural back-and-forth; users abandon the chat 22 % less often than with the previous Rasa-based rule engine. Crucially, the model does not diagnose—it operates as a routing assistant—mitigating liability. Compliance teams run every conversation log through a human audit; hallucination-detection scripts flag outputs that contradict HL7 FHIR ontologies. Our /usecases/data-extraction guides cover similar healthcare pipelines.

Developer onboarding co-pilot (SaaS scale-up, Amsterdam): New engineers at a 200-person fintech startup receive an IDE extension powered by gpt-4o-realtime-preview. As they write Rust or Go, the model streams inline suggestions, explains legacy code modules, and generates unit tests. The key workflow: highlight a function, press Ctrl+Shift+E, speak a natural-language instruction—"Add error handling for null pointers"—and watch the model stream a refactored version. The voice modality cuts context-switching; developers keep hands on keyboard. Token budgets average 4,000 input + 1,200 output per session; the zero-cost preview pricing makes experimentation frictionless, though the team acknowledges they will migrate to standard GPT-4o once OpenAI enforces commercial rates.

Tokonomix benchmark snapshot

Our May 2026 evaluation cycle placed gpt-4o-realtime-preview in the Tier 1A latency-optimised category, a cohort that includes Anthropic's Claude Instant 1.3 and Cohere's Command Light. We assess models monthly across nine dimensions; results are published on our /benchmarks/leaderboard with version hashes and prompt templates disclosed under /benchmarks/methodology.

Speed: Time-to-first-token averaged 187 milliseconds for 500-word prompts, placing it second only to Claude Instant (162 ms). Full-completion throughput lagged—41 tokens/second versus GPT-4 Turbo's 68 t/s—but for interactive chat, the initial burst matters more.

Reasoning: It solved 76 % of our Hungarian-language mathematical word problems (a multilingual reasoning stress test) and 82 % of our English-language logic puzzles. GPT-4 Turbo hit 91 % and 94 % respectively; the gap is the tax paid for speed.

Multilingual comprehension: Across seventeen EU languages—Estonian, Maltese, Irish, Croatian, and thirteen others—the model maintained >85 % semantic-equivalence scores when translating technical documentation. It stumbled on idiomatic Finnish (73 %) and Basque (68 %), both low-resource outliers.

Coding: Pass@1 accuracy on HumanEval stood at 81 %, trailing GPT-4 Turbo (89 %) but ahead of Mistral Large (76 %). Our /benchmarks/speed drill—generating a working REST API scaffold in under three seconds—saw a 94 % success rate.

Hallucination resilience: When asked to cite sources from a 40,000-token legal corpus, it fabricated case law 8.1 % of the time. Claude 3.5 Sonnet recorded 4.3 %; LLaMA 3.1 405B hit 11.7 %.

Verdict from the benchmark desk: If your KPI is responsiveness, this model competes at the top. If you optimise for correctness, standard GPT-4o or Claude 3.5 Sonnet deliver measurably fewer errors.

Tool-use and agent integrations

The gpt-4o-realtime-preview endpoint supports function calling via OpenAI's JSON-schema–based tool definitions, a feature that lets the model invoke external APIs mid-conversation—retrieving calendar availability, querying inventory databases, or triggering CRM updates. In agent-orchestration frameworks like LangChain and Microsoft Semantic Kernel, this unlocks multi-step workflows: the model decides which tool to call, inspects the returned payload, and synthesises a natural-language reply.

Latency profile in agentic loops: Our tests with a three-function chain—check stock → calculate discount → generate invoice—showed end-to-end completion in 1.4 seconds, a 40 % improvement over standard GPT-4o (2.3 s). The real-time variant's faster initial token emission lets the orchestration layer fire the first function call sooner, reducing wall-clock time even when subsequent LLM hops take identical durations.

Streaming function calls: Unlike batch models that emit the entire JSON tool-call object in one burst, gpt-4o-realtime-preview can stream partial tool arguments as it generates them. Frontend developers can render progress indicators—"Fetching your order history…"—while the model is still populating the order_id parameter. This perceptual trick masks latency, a UX gain in customer-facing chatbots.

Limitations in parallel tool execution: The model struggles when asked to invoke multiple independent functions simultaneously. For example, "Book a flight and reserve a hotel" should trigger two parallel API calls; instead, the model serialises them, waiting for the flight confirmation before starting the hotel search. GPT-4 Turbo handles parallel tool dispatch more reliably. Workarounds exist—prompt the model to return a JSON array of tool calls—but reliability drops to 73 % in our tests.

Integration with voice pipelines: The model's native audio input simplifies voice-assistant stacks. Rather than chaining Whisper (transcription) → GPT (intent) → Polly (synthesis), you send a single audio blob to gpt-4o-realtime-preview and receive both a transcript and a spoken response. Latency collapses from ~2 seconds to ~600 milliseconds. Startups building Alexa competitors or automotive voice UIs will find this architectural simplification worth the trade-offs in reasoning depth.

Verdict & alternatives

Who should use gpt-4o-realtime-preview: Teams building latency-critical, conversational interfaces—voice assistants, live customer-support co-pilots, real-time translation layers—will extract maximum value. If your users perceive delays beyond 300 milliseconds as friction, and your prompts stay under 10,000 tokens, this model delivers a perceptual edge that batch-optimised alternatives cannot match. Organisations in industries where responsiveness compounds brand equity—hospitality, emergency services, retail—should prioritise it in pilot deployments.

When to choose standard GPT-4o instead: If your workload involves complex reasoning, long-document synthesis, or zero tolerance for hallucinations, revert to the standard GPT-4o or GPT-4 Turbo endpoints. The real-time preview's speed comes at a reasoning cost; legal contract review, clinical decision support, and financial auditing demand the deeper search trees that the flagship models provide. The pricing opacity also makes the standard variant safer for production SLAs.

Budget-conscious alternatives: If the eventual commercial pricing lands above $3 per million tokens, consider Anthropic Claude Instant 1.3 (faster, cheaper, but weaker multilingual coverage) or Mistral Medium (European data residency, competitive latency, open-weight lineage). Self-hosting LLaMA 3.1 70B with vLLM on dedicated GPUs can achieve sub-second latencies at one-tenth the API cost, though operational overhead rises. Our /live-test environment lets you compare outputs side-by-side before committing infrastructure spend.

Privacy and residency concerns: OpenAI processes all requests through US-based Azure regions unless enterprise customers negotiate dedicated instances. European public-sector entities bound by Schrems II constraints should evaluate Aleph Alpha Luminous or Mistral Large on sovereign cloud infrastructure. The real-time preview offers no EU-residency guarantees, making it unsuitable for GDPR-critical healthcare or government workflows without contractual amendments.

Next six months: Expect OpenAI to graduate the endpoint from preview to general availability by Q3 2026, accompanied by commercial pricing—likely in the $2–4 per million token range. Feature parity with standard GPT-4o will narrow; video input may arrive, and fine-tuning hooks could open for enterprise-tier customers. The model will remain a specialist tool rather than a general-purpose workhorse; if OpenAI's roadmap follows past patterns, a gpt-4.5-realtime variant will eventually merge the speed of this preview with the reasoning depth of the flagship line.

Ready to test it yourself? Head to /live-test and run your own prompts against gpt-4o-realtime-preview alongside twelve peer models. Compare latency, output quality, and cost projections in a single interface—no API keys required for the first 500 queries.

Last technical review: 2026-05-05 — Tokonomix.ai

Last automated test

May 24, 2026 · 04:43 UTC · Benchmark

P50 latency

—

P95 latency

—

Errors

1 / 6 runs

Last reviewed by Tokonomix Team·May 24, 2026