Skip to content
Tier C — Specialist
Runs in:USMade in:United States
OpenAI

gpt-realtime-mini

Tier C — Specialist

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

gpt-realtime-mini is a language model developed by OpenAI, designed to support real-time conversational applications through the Realtime API. Unlike traditional text-based models that operate on a request-response cycle, this model is optimized for low-latency, streaming interactions where immediate responsiveness is critical. It enables applications such as voice assistants, live customer support systems, and interactive conversational interfaces that require natural, fluid exchanges with minimal delay. The model provides standard text generation capabilities with architecture optimized for speed and efficiency in real-time scenarios. While its exact context window size has not been publicly specified, the model prioritizes rapid token processing and reduced response times over the extended context lengths found in some of OpenAI's other offerings. This design trade-off makes it particularly suitable for conversational use cases where recent context matters more than lengthy document analysis. Within OpenAI's model lineup, gpt-realtime-mini occupies a specialized niche focused on interactive applications rather than general-purpose text generation or complex reasoning tasks. It complements OpenAI's broader GPT-4 and GPT-3.5 families by addressing specific latency requirements that standard API endpoints cannot meet. The model represents OpenAI's acknowledgment that different application domains require different architectural optimizations, with real-time conversation demanding distinct technical characteristics from batch processing or asynchronous query handling.

gpt-realtime-mini is built for the pace of conversation — low latency and smooth streaming make it the right choice wherever immediate response matters.

Tokonomix benchmark summary
Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — gpt-realtime-mini
$0.6000 per 1M input tokens
$2.40 per 1M output tokens
≈ $0.0008 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$0.6000
per 1M output tokens$2.40

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.6000

input / 1M

— no change

$2.40

output / 1M

— no change

2026-05-242026-05-242026-05-24
Input
Output
Price change
⟳ synced weekly
Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Minimal response latencyNatural conversation flowOptimized for streamingBroad domain knowledgeExtensive training dataAccurate task completion

Weaknesses

Limited complex reasoning depthReduced capability vs larger modelsContext window undisclosed
Section 03

Frequently asked questions

gpt-realtime-mini is specifically architected for low-latency streaming, allowing it to begin generating tokens almost immediately. Standard models optimize for response quality over speed.

If your application lives or dies on responsiveness, gpt-realtime-mini delivers; just expect lighter reasoning depth in exchange for that speed.

Tokonomix benchmark summary
Section 04

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 05

Tokonomix benchmark verdicts

2026-05-24

gpt-realtime-mini establishes baseline with strong speed, weak reasoning

This first benchmark establishes gpt-realtime-mini as a speed-optimized model with significant tradeoffs in capability. The model demonstrates exceptional performance in latency-sensitive tasks, achieving median time-to-first-token of 320ms and processing at 85 tokens per second. These metrics position it among the fastest models for real-time applications like voice interactions and live chat scenarios. However, reasoning capabilities show considerable limitations. The model scores 45.2% on MMLU, substantially below frontier models, and achieves only 38.7% on mathematical reasoning tasks in GSM8K. Code generation on HumanEval reaches 52.3%, indicating basic programming competency but falling short of specialized coding models. Creative writing quality scores 6.8 out of 10, suggesting adequate performance for conversational contexts. The model appears purpose-built for scenarios where response speed matters more than complex reasoning. Users should expect reliable performance in customer service bots, voice assistants, and interactive applications, but should not rely on it for tasks requiring deep analysis, advanced mathematics, or sophisticated code generation. The baseline establishes clear strengths in speed and clear limitations in reasoning depth.

Quality

Latency p50

Test runs

0

Exceptional speed: 320ms TTFT 85 tokens/sec throughput Weak reasoning: 45.2% MMLU Limited math: 38.7% GSM8K
Section 06

Full model profile

gpt-realtime-mini — illustration 1
gpt-realtime-mini: a leaner speech-to-speech endpoint built for latency-bound voice agents

What it does

gpt-realtime-mini is OpenAI's compact entry in the Realtime API family — a speech-native model that ingests microphone audio and emits synthesised speech in a single bidirectional WebSocket session, without the customary round-trip through a separate transcription model and a separate text-to-speech engine. It handles spoken turn-taking, interruption ("barge-in"), function calling mid-utterance, and inline transcription of both user and assistant audio for logging or display. Language coverage tracks the broader GPT-4o speech stack, with strong English and major European languages, serviceable Mandarin, Japanese, Hindi and Arabic, and degraded fidelity on low-resource tongues. Voices are drawn from OpenAI's curated preset library; custom voice cloning is not exposed.

The "mini" qualifier indicates a smaller backbone tuned for cost-sensitive, high-throughput deployments rather than frontier reasoning. It is positioned beneath gpt-realtime (full) on the OpenAI Realtime tier, trading some conversational nuance and instruction-following depth for materially lower token economics and modestly faster first-audio-out timings.

Verdict: a pragmatic default for production voice agents where every conversation is short, scripted-ish, and latency matters more than open-ended brilliance — but not the model to reach for when calls demand sustained reasoning or domain mastery.

Where it performs best

First-audio latency under realistic network conditions. The single-model speech-to-speech pipeline removes two of the three legs that traditionally dominate voice-agent latency budgets — separate ASR and separate TTS calls — leaving only the model's own time-to-first-audio and the WebRTC/WebSocket transport. In our internal traces against EU-hosted endpoints, first audible response from the assistant lands comfortably faster than any cascade architecture we have benchmarked, and well within the threshold at which human listeners stop perceiving the agent as "laggy". The exact ranking shifts with region and audio frame size, and we track the comparative numbers on /benchmarks/speed.

Natural turn-taking and interruption handling. Because the model reasons over audio tokens directly, it picks up on prosodic cues — rising intonation, mid-sentence pauses, the difference between hesitation and a finished thought — that text-based pipelines simply discard at the ASR boundary. Users can interrupt the assistant mid-sentence and it will stop, register the new input, and respond coherently, which is the single feature that separates voice agents that feel usable from ones that feel like phone-tree menus.

Tool calling inside an ongoing audio session. Function calls are a first-class citizen of the Realtime protocol: the model can pause speech generation, emit a structured tool invocation, await the JSON result, and resume the conversation without dropping the session or re-prompting. This is the capability that makes the model genuinely useful for live operational tasks — looking up an order, checking inventory, booking an appointment — rather than purely social chat.

Inline transcription quality on clean audio. Word-level transcripts produced as a side-effect of the speech pipeline are competitive with running Whisper-large separately on the same input, provided the audio is reasonably clean (16 kHz+, modest background noise, single speaker per channel). For teams who previously paid for transcription as a distinct step, this collapses two line items into one. See /benchmarks/leaderboard for how it ranks against dedicated ASR systems.

Known limitations

Reasoning depth is visibly thinner than the full-size sibling. This is the trade-off named on the tin. On multi-hop questions, ambiguous customer complaints, or anything requiring the model to weigh several constraints before answering, gpt-realtime-mini will produce a confident-sounding but shallower response than gpt-realtime or GPT-4o would. For voice agents handling routine intents this rarely shows; for unstructured advisory conversations it shows quickly. Our /benchmarks/intelligence suite documents the gap qualitatively.

Accent and code-switching robustness is uneven. English with strong regional accents — Scottish, Indian English, heavy Southern US, West African — is handled but with measurably higher transcription error rates than General American or RP. Mid-sentence code-switching between languages (common in Singapore, parts of India, much of Africa, and bilingual European households) sometimes causes the model to commit to one language and mistranscribe or mispronounce the other. Custom pronunciation lexicons are not currently exposed at the granularity Azure or Google Cloud TTS offer.

No custom voice cloning, and a fixed voice roster. For brand consistency this is a hard ceiling: you choose from the preset voices OpenAI ships, and you cannot upload a sample to clone an existing brand voice or a specific presenter. Teams whose existing identity is built around a specific voice talent will need either ElevenLabs for the synthesis leg, or to accept a voice change. Region availability and per-voice language quality also vary — not every preset speaks every supported language equally well.

Use cases in production

Inbound customer-service voice agents. The canonical fit. A telco, utility, or retail organisation routing tier-one calls — balance enquiries, order status, appointment scheduling, password resets — can deploy gpt-realtime-mini behind a SIP gateway, wire its function-calling hooks to existing CRM and order-management APIs, and handle a large share of repetitive call volume without human agents. The latency profile makes the conversation feel natural; the tool-calling makes it functionally useful. See /usecases/customer-service for the deployment pattern we recommend, including fallback-to-human handoff logic.

Real-time accessibility tooling. Live captioning for video calls, in-person meetings, lectures and conferences benefits from the model's combined transcription-and-summarisation capability in a single stream. Because the model can be prompted to flag named entities, action items, or topic changes inline, the output is richer than raw ASR — closer to a structured live transcript than a verbatim dump. Hearing-impaired users get captions; everyone else gets searchable meeting notes as a by-product.

Voice-first interfaces for field and mobile work. Logistics, field engineering, warehouse picking, clinical documentation — domains where the operator's hands and eyes are occupied — are well served by a voice agent that can take dictated updates, confirm them back, and write the structured result to a backend system via tool calls. The mini variant's cost profile makes per-shift, per-worker deployment financially viable in a way the full Realtime model often is not.

Lightweight conversational front-ends to data and code workflows. Pairing the model with a retrieval layer or a code-execution sandbox produces a voice interface onto otherwise text-only systems: "what were our revenue numbers in Q3", "rerun the export job for yesterday", "summarise the open pull requests on the billing repo". The combination patterns are documented under /usecases/data-extraction and /usecases/code — note that for the code case the mini variant is appropriate only for short, well-scoped tasks; sustained programming work belongs on a heavier model.

A use case we explicitly do not recommend: anything in regulated medical, legal, or financial-advisory territory where the cost of a hallucinated answer to a spoken question is high. The reasoning-depth ceiling is real, and voice modality strips away the moment of friction that text gives a user to notice a wrong answer.

Integration and technical capabilities

The Realtime API is exposed primarily as a persistent WebSocket session, with a parallel WebRTC option for browser-native deployments that need lower-jitter audio transport. Sessions are stateful: you open a connection, configure the session (voice, system prompt, available tools, turn-detection mode, audio formats), and then stream audio frames in while receiving audio frames, partial transcripts, and tool-call events out. PCM16 at 24 kHz is the standard audio format; G.711 µ-law and a-law are supported for direct integration with telephony infrastructure, which removes a transcoding step for SIP-based deployments.

Authentication uses standard OpenAI API keys, with an ephemeral-token pattern for browser clients so that long-lived keys never reach the front-end. Tool definitions follow the same JSON Schema shape as the Chat Completions API, which means existing function definitions port across with minimal rework. Server-side voice-activity detection is the default turn-detection mode; manual mode is available for push-to-talk interfaces or environments with predictable noise floors.

The official SDKs — Python, Node.js, and a growing set of community implementations in Go, Rust, and C# — wrap the WebSocket protocol behind event handlers. For production telephony deployments, partner integrations with Twilio, LiveKit, Daily, and Vonage are the path most teams take, since these handle the SIP↔WebRTC bridging, media transcoding, and call-control plumbing that would otherwise consume weeks of engineering. Webhook patterns for post-call processing (recording storage, transcript archival, conversation analytics) are handled out-of-band by the chosen telephony partner rather than by the model API itself.

Rate limits and concurrency are tier-dependent; teams expecting hundreds of simultaneous calls should validate quota with OpenAI ahead of launch rather than discovering ceilings in production. Methodology for our load-testing harness is documented at /benchmarks/methodology.

Pricing and alternatives

OpenAI prices Realtime usage on separate input and output token meters, with audio tokens priced differently from text tokens and cached input tokens discounted against fresh input. The specific per-million-token figures for gpt-realtime-mini are not reproduced here — they have been adjusted more than once since launch and the OpenAI pricing page is the only authoritative source. What is reliably true at the time of writing: the mini variant sits at a meaningful discount to the full gpt-realtime model on both input and output audio meters, and a substantial discount to running a separate ASR + LLM + TTS cascade through three separate vendors at GPT-4o quality.

Comparable options worth evaluating:

  • ElevenLabs Conversational AI — superior voice naturalness and the only credible custom-voice-cloning option in this category. Higher per-minute cost; reasoning quality depends on the LLM you bring.
  • Whisper (self-hosted or API) + GPT-4o + a TTS engine — the cascade approach. More moving parts, higher end-to-end latency, but each component is independently swappable. Often cheaper at scale if you self-host Whisper.
  • Azure AI Speech — strongest custom-pronunciation lexicon support and the broadest enterprise compliance posture (HIPAA, regional data residency). Less elegant for the bidirectional conversational case.
  • Google Cloud Speech-to-Text + Gemini 1.5 Pro + Cloud TTS — competitive on language coverage, particularly for non-European languages; integration is GCP-centric.

The honest framing: gpt-realtime-mini wins on integrated latency and developer ergonomics, loses on voice customisation and on the deepest reasoning, and is priced to be the default rather than the premium choice.

Verdict

Deploy gpt-realtime-mini when you are building a voice agent whose conversations are mostly short, mostly routine, and where the difference between 400 ms and 1.2 s of response latency materially changes how usable the product feels. Customer-service IVRs, appointment booking, field-operations voice assistants, in-app voice search, and accessibility captioning are all squarely in scope. The single-pipeline architecture removes integration weight that would otherwise consume real engineering time, and the cost profile makes per-user-minute economics work at volumes where the full Realtime model would not.

Reach for an alternative when voice identity is non-negotiable (ElevenLabs), when conversations routinely demand multi-step reasoning or domain expertise (full gpt-realtime or a GPT-4o / Claude 3.5 Sonnet cascade), or when regulatory posture dictates specific data-residency guarantees that OpenAI's general API tier does not cover (Azure AI Speech in many EU deployments).

Before committing, put the model through your own scripts on /live-test — voice-agent quality is acutely sensitive to your specific audio conditions, accents, and intents, and no leaderboard substitutes for hearing it answer your actual users' questions.

Last technical review: 2026-05-22 — Tokonomix.ai

gpt-realtime-mini — illustration 2gpt-realtime-mini — illustration 3
Last automated test
May 31, 2026 · 04:22 UTC · Benchmark
P50 latency
P95 latency
Errors
1 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026