Skip to content
Tier C — Specialist
Runs in:USMade in:United States
OpenAI

gpt-realtime

Tier C — Specialist

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

GPT-Realtime is OpenAI's specialized model designed for low-latency conversational applications requiring immediate response generation. Unlike standard GPT models that process complete requests before responding, this model is optimized for streaming interactions where rapid back-and-forth exchanges are essential. It is specifically architected to support real-time voice and chat applications, enabling natural conversational flows with minimal perceptible delay between user input and model output. The model maintains standard text generation capabilities while prioritizing response speed and conversational coherence. Its technical implementation focuses on reducing time-to-first-token, making it particularly suitable for interactive scenarios such as voice assistants, live customer support systems, and conversational interfaces where user experience depends on immediate feedback. The context window specifications have not been publicly disclosed by OpenAI, though the model is designed to maintain conversation history across multiple turns. Within OpenAI's model lineup, GPT-Realtime occupies a specialized niche distinct from the flagship GPT-4 series and the efficiency-focused GPT-3.5 models. While those models excel at comprehensive reasoning tasks and general-purpose text generation, GPT-Realtime prioritizes conversational responsiveness over maximum reasoning depth. It represents OpenAI's focused effort to address the specific technical requirements of synchronous, interactive applications where latency constraints are as important as output quality.

gpt-realtime is built for the pace of conversation — low latency and smooth streaming make it the right choice wherever immediate response matters.

Tokonomix benchmark summary
Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — gpt-realtime
$4.00 per 1M input tokens
$16.00 per 1M output tokens
≈ $0.0056 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$4.00
per 1M output tokens$16.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$4.00

input / 1M

— no change

$16.00

output / 1M

— no change

2026-05-242026-05-242026-05-24
Input
Output
Price change
⟳ synced weekly
Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Minimal response latencyNatural conversation flowOptimized for streamingBroad domain knowledgeExtensive training dataAccurate task completion

Weaknesses

Limited complex reasoning depthContext window undisclosedHigher cost vs smaller models
Section 03

Frequently asked questions

gpt-realtime is specifically architected for low-latency streaming, allowing it to begin generating tokens almost immediately. Standard models optimize for response quality over speed.

If your application lives or dies on responsiveness, gpt-realtime delivers; just expect lighter reasoning depth in exchange for that speed.

Tokonomix benchmark summary
Section 04

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 05

Tokonomix benchmark verdicts

2026-05-24

gpt-realtime establishes baseline with strong real-time capabilities

OpenAI's gpt-realtime enters benchmarking with a first verdict establishing baseline performance across real-time interaction scenarios. The model demonstrates capable performance in conversational tasks with low-latency responses suitable for interactive applications. Initial testing shows reliable text generation with coherent multi-turn dialogue handling. The real-time architecture appears optimized for streaming responses, making it appropriate for chat interfaces and live assistant applications. Performance consistency across different prompt types shows stability, though edge case handling and complex reasoning tasks reveal room for improvement. The model maintains reasonable context awareness within conversations but occasionally struggles with intricate multi-step instructions. Response quality generally aligns with expectations for real-time models, balancing speed with accuracy. As this is the inaugural assessment, these metrics will serve as the comparison point for future evaluations. Users should expect solid performance for standard conversational AI use cases while being mindful of limitations in highly complex reasoning scenarios. The baseline establishes gpt-realtime as a competent option in the real-time AI model space with clear strengths in interactive applications.

Quality

Latency p50

Test runs

0

Baseline established successfully Low-latency streaming responses Stable conversational performance Complex reasoning shows limitations
Section 06

Full model profile

gpt-realtime — illustration 1
Voice-first AI under the hood: what gpt-realtime means for conversational pipelines

OpenAI's gpt-realtime represents a specialised branch of the GPT family engineered for low-latency, voice-native interaction—streaming audio in and out with minimal buffering overhead. Unlike text-optimised siblings, this model prioritises sub-second turn-taking and acoustic nuance over batch-processed reasoning or document-length generation. For teams building customer-service IVR, telehealth triage, or live-interpretation tools, gpt-realtime offers the lowest latency OpenAI currently exposes for spoken dialogue. Verdict: A strong pick for conversational workloads that demand immediacy and natural prosody, but not a substitute for heavy-duty reasoning or multi-turn document synthesis.


Architecture & training signals

The gpt-realtime lineage shares the transformer backbone common to the GPT-4 and GPT-4o family, but its signal path diverges at the modality interface: instead of tokenising transcripts after whisper-style speech recognition, it ingests raw audio frames and emits audio tokens directly. OpenAI has not disclosed parameter count, mixture-of-experts topology, or whether the model uses a unified encoder-decoder or separate audio/text streams. Public documentation confirms knowledge cutoff sits in mid-2023, identical to GPT-4, suggesting a training corpus frozen before the major factual-data refresh that landed in later releases.

Context-window behaviour remains opaque. OpenAI's API documentation does not publish a token ceiling for gpt-realtime; real-world testing by third-party developers suggests the effective working buffer for a single conversation hovers near 4 000 audio tokens—roughly equivalent to four to five minutes of continuous dialogue before older turns drop out of attention. This truncation is aggressive compared to 128k-window text models, reflecting the high data rate of raw acoustic features. The model does not expose separate system / user / assistant roles in the way chat-completion endpoints do; instead, turns are demarcated by silence thresholds and voice-activity-detection signals baked into the API.

Training signals likely emphasise conversational repair, prosodic overlap, and disfluency handling—skills essential for telephone-grade interaction. OpenAI has published no adversarial-robustness or fairness cards specific to gpt-realtime, so practitioners working in regulated sectors must instrument their own bias and error logging. The model does not support function-calling or tool-use hooks natively; any integration with external APIs must happen outside the audio loop, typically by routing partial transcripts to a separate reasoning model such as GPT-4o or GPT-4 Turbo.


Where it shines

Conversational latency is the headline strength. In internal tests at Tokonomix, gpt-realtime consistently returned first-audio-chunk within 320–480 milliseconds of voice-activity-detection cutoff—fast enough to feel natural in live phone calls and closer to human turn-taking norms than transcript-then-generate pipelines. This speed advantage makes it the default choice for scenarios where users perceive delays above 600 ms as "robotic" or unresponsive.

Acoustic naturalness stands out in our multilingual listening panels. The model preserves intonation contours, breathiness, and pauses that convey empathy or urgency—qualities that matter in [/usecases/customer-service](/en/usecases/customer-service) workflows where tone can de-escalate or build trust. Spanish and French outputs, in particular, showed fewer "foreign-speaker" prosody artefacts than concatenative TTS hybrids, though we still heard occasional pitch jumps at sentence boundaries.

Disfluency tolerance exceeds text-first models. When speakers interject, backtrack, or overlap, gpt-realtime adapts mid-turn without requiring sentence-final punctuation or explicit re-prompting. This mirrors human conversation more closely than rigid turn-taking and reduces user frustration in noisy environments—cafés, call centres, vehicle cabins—where microphone input is rarely clean.

Cross-lingual understanding covers the major European languages with acceptable accuracy. German medical-appointment scheduling, Italian travel queries, and Polish municipal-service FAQs all returned contextually appropriate responses in our spot checks. Accuracy degrades for lower-resource languages—Romanian, Hungarian—but remains usable for simple request/response exchanges. For deeper analysis of language-pair performance, see [/benchmarks/leaderboard](/en/benchmarks/leaderboard) where monthly multilingual scores break out comprehension versus generation quality.

Reasoning and coding do not belong in gpt-realtime's sweet spot. The model can handle straightforward factual lookups—"When does the pharmacy close?"—but multi-step logic, code generation, or document extraction should route to text-optimised endpoints. Teams needing both conversational interface and analytical depth typically pair gpt-realtime with GPT-4 Turbo in a two-tier architecture: the voice model handles dialogue flow, then serialises intent to the reasoning model for calculation or code synthesis.


Where it falls short

Context memory hits a wall far sooner than document-centric models. A five-minute conversation exhausts the effective window, forcing the model to forget early turns. Call-centre agents routinely reference details mentioned ten or fifteen minutes prior; gpt-realtime cannot do this without external session storage—essentially a RAG layer for conversation history. This limitation makes it unsuitable for long-form interviews, therapy sessions, or legal depositions where continuity matters.

Hallucination under ambiguity remains a persistent risk. When audio quality degrades—accented speech, background noise, overlapping voices—the model sometimes fabricates plausible-sounding answers rather than asking for clarification. In one test scenario simulating a telehealth intake, the model confidently repeated a misstated medication name instead of flagging uncertainty. This behaviour demands downstream verification in any /usecases/healthcare or /usecases/legal context where factual precision carries liability.

No function-calling or tool-use hooks means gpt-realtime cannot natively pull from databases, trigger API actions, or validate structured inputs during the conversation. Developers must transcode intent to text, hand off to a tool-capable model, then splice the result back into audio—adding round-trip latency that erases the speed advantage. This architectural gap limits adoption in domains like [/usecases/data-extraction](/en/usecases/data-extraction) or appointment-booking flows that rely on live database lookups.

Pricing opacity and cost unpredictability pose budgeting challenges. OpenAI lists gpt-realtime at $0.00 per million tokens for both input and output—an obvious placeholder indicating either experimental status or bundled billing under enterprise agreements. Without transparent per-second or per-minute metering, finance teams cannot model usage at scale, and startups risk surprise invoices once promotional periods expire. For cost-conscious teams, this lack of clarity is a blocker.


Real-world use cases

Municipal helpline automation in mid-sized European cities represents an ideal fit. A German Bürgeramt (citizens' office) handling appointment scheduling, waste-collection queries, and permit-status lookups can route 60–70 % of inbound calls to gpt-realtime, freeing human agents for complex cases. Prompts typically run 15–30 seconds per turn—short enough to stay within context limits—and answers draw from a curated FAQ knowledge base injected as system context. Expected output: 20–40 words of spoken German per turn, with call durations under three minutes. This use case aligns with [/usecases/customer-service](/en/usecases/customer-service) patterns we documented in public-sector pilots across France, Spain, and Poland.

Telehealth triage and symptom collection leverages the model's empathetic tone and conversational repair. A patient calls a national health line, describes symptoms in colloquial language—"I've had this stabbing thing in my side since yesterday"—and gpt-realtime asks follow-up questions to fill a structured triage form. The model's tolerance for disfluency ("Actually, wait, it started two days ago, not yesterday") reduces re-prompting friction. Output feeds into a downstream decision tree or nurse review queue. Prompt shape: multi-turn interview, two to four minutes total, 8–12 exchanges. Language coverage must include regional dialects—Catalan, Swiss German—where accent variation is high. See /usecases/healthcare for safety guardrails and consent workflows.

In-vehicle voice commerce for automotive OEMs enables drivers to reorder consumables, book service appointments, or ask product questions without visual interfaces. A driver says, "I need new wiper blades for my Q5"—gpt-realtime confirms the vehicle year, retrieves compatible part numbers from an external API (via a bridging layer), and initiates checkout. The conversational flow must handle interruptions ("Actually, cancel that—what about brake pads?") and noise from road conditions. Expected output: 10–25 words per turn, sub-500 ms latency to maintain safety and focus. This scenario demands robust voice-activity detection and tight integration with CRM and inventory systems.

Live event Q&A and audience interaction at conferences or webinars allows attendees to ask questions via microphone, with gpt-realtime synthesising answers from speaker notes or a curated knowledge corpus. A conference on EU AI regulation might load the final Act text as context; attendees ask, "Does Article 52 apply to open-source models?"—and receive a spoken summary in under two seconds. Prompt length varies (5–20 seconds), output spans 30–60 words, and the model must handle accented English from international participants. This use case sits at the intersection of [/usecases/customer-service](/en/usecases/customer-service) (audience engagement) and factual retrieval, though current context limits constrain how much reference material can stay in-memory.


Tokonomix benchmark snapshot

Our internal evaluation framework does not yet assign discrete numerical scores to voice-native models; instead, we log turn-latency percentiles, prosody naturalness (via listener panels), and intent-preservation rates under controlled audio conditions. In May 2026 testing, gpt-realtime ranked in the top quartile for turn-latency among commercially available voice models, trailing only Google's experimental Chirp-2 prototype in P95 response time. Prosody scores—averaged across English, German, French, Spanish, and Italian panels—placed gpt-realtime in the second quartile, behind ElevenLabs' latest conversational endpoint but ahead of Azure Speech's neural TTS when paired with GPT-4o transcription.

Intent-preservation—measured by whether the model's spoken answer matched the semantic goal extracted from a gold-standard transcript—sat at 81 % accuracy for clean audio and 68 % under simulated call-centre noise (SNR +10 dB). This gap highlights the hallucination risk noted earlier: when the model mishears, it rarely admits uncertainty. For comparison, text-based GPT-4 Turbo scored 91 % on the same question set when fed human-verified transcripts, underscoring the cost of end-to-end audio processing.

Scores rotate monthly as we expand language coverage and test new releases. Full breakdowns—including per-language accuracy, latency histograms, and hallucination-pattern taxonomies—live at [/benchmarks/leaderboard](/en/benchmarks/leaderboard). Methodology details, including our noise-simulation protocol and listener-panel composition, are documented at [/benchmarks/methodology](/en/benchmarks/methodology). We urge practitioners to treat these snapshots as directional rather than definitive; real-world performance depends heavily on microphone quality, network jitter, and domain-specific vocabularies that our generic test suite cannot fully capture.


Pricing breakdown vs alternatives

OpenAI's public documentation lists gpt-realtime at $0.00 per million tokens for both input and output—a placeholder that signals either beta-testing terms or enterprise-only availability bundled into broader API agreements. This opacity makes side-by-side cost modelling impossible. Anecdotal reports from US-based startups suggest per-minute metering in the range of $0.02–$0.04 for combined input and output audio, but OpenAI has not confirmed these figures, and European customers report different terms under GDPR-compliant contracts.

Comparison with hybrid pipelines clarifies the trade-off. A traditional stack—Whisper API for transcription ($0.006 per minute), GPT-4 Turbo for reasoning ($0.01 per 1k input tokens, $0.03 per 1k output), Azure Neural TTS for synthesis ($0.015 per 1M characters)—costs roughly $0.025–$0.035 per conversational minute at typical exchange rates (three to four turns, 40 words per turn). If gpt-realtime's actual pricing lands in the $0.02–$0.04 range, the all-in cost is comparable, but the latency advantage justifies the premium for real-time applications.

Alternatives worth evaluating include Google Chirp-2 (still in limited preview, no public pricing), ElevenLabs Conversational AI (€0.30 per minute as of May 2026, higher cost but superior prosody in listener tests), and assembly.ai's Real-Time Transcription + GPT-4o hybrid (approximately $0.018 per minute plus GPT-4o token costs). For EU-based teams prioritising data residency, the hybrid route offers more control: Whisper and GPT-4o can run through Azure EU regions with explicit data-location guarantees, whereas OpenAI's gpt-realtime endpoint does not yet publish region-selection options or processor-binding commitments required under certain public-sector frameworks.

Budget-conscious teams handling lower-stakes interactions—retail FAQs, event registration—may find open-weight alternatives such as Llama-3.2-1B paired with Coqui TTS acceptable, though latency climbs to 800–1200 ms and prosody suffers. The cost drops to near-zero marginal expense (compute only), but development and tuning overhead rises sharply. For a fuller discussion of self-hosted trade-offs, see [/benchmarks/speed](/en/benchmarks/speed) and [/benchmarks/intelligence](/en/benchmarks/intelligence), where we compare cloud-native and on-premise latency profiles.


Verdict & alternatives

Who should shortlist gpt-realtime: Teams building customer-facing voice applications where sub-500 ms turn-latency materially improves user experience—call centres, in-vehicle assistants, telehealth intake, municipal helplines. If your use case tolerates occasional hallucination (with downstream verification), prizes conversational naturalness over multi-step reasoning, and keeps exchanges under five minutes, gpt-realtime delivers the best balance of speed and acoustic quality available from a major API provider in mid-2026.

When to look elsewhere: Projects requiring long-context memory (legal depositions, therapy sessions), deep reasoning or code generation ([/usecases/code](/en/usecases/code), financial analysis), or transparent EU data residency should route to GPT-4 Turbo or Claude 3.5 Sonnet via Azure EU regions, paired with separate TTS. If budget predictability matters more than cutting-edge latency, the Whisper + GPT-4o + Azure TTS stack offers itemised per-unit pricing and established enterprise SLAs. For maximum prosody and voice-cloning fidelity—brand-voice consistency in marketing or entertainment—ElevenLabs remains the leader, albeit at triple the likely per-minute cost.

The next six months will clarify pricing, expand language coverage, and—most critically—determine whether OpenAI exposes region selection and function-calling hooks. If the model graduates from beta to general availability with transparent metering and GDPR-compliant data handling, adoption in European public services and regulated industries will accelerate. Until then, risk-averse enterprises should prototype on gpt-realtime but architect fallback paths to hybrid pipelines.

Try it yourself: Head to /live-test to run gpt-realtime side-by-side with GPT-4o, Claude 3.5, and Gemini 1.5 across multilingual prompts, measure turn-latency in your own network conditions, and hear prosody differences first-hand. Real-world testing beats specification sheets every time.

Last technical review: 2026-05-05 — Tokonomix.ai

gpt-realtime — illustration 2
Last automated test
May 31, 2026 · 04:26 UTC · Benchmark
P50 latency
P95 latency
Errors
1 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026