What is the primary use case for gpt-audio-2025-08-28?

gpt-audio-2025-08-28 is designed for general-purpose text generation including content creation, analysis, question answering, and conversational applications.

How does gpt-audio-2025-08-28 compare to other OpenAI models?

Within OpenAI's lineup, gpt-audio-2025-08-28 occupies a standard position, balancing capability and resource requirements for production use cases.

Can gpt-audio-2025-08-28 be accessed via API?

Yes, gpt-audio-2025-08-28 is available through OpenAI's API infrastructure, allowing integration into custom applications and workflows.

Tier B — Production

Runs in:USMade in:United States

OpenAI

gpt-audio-2025-08-28

Tier B — Production

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 5, 2026·Last reviewed May 24, 2026

GPT-Audio-2025-08-28 is a multimodal language model developed by OpenAI that extends the capabilities of traditional text-based models to include native audio processing. This model is designed to handle conversational interactions involving both text and speech, allowing it to process spoken input and generate voice responses while maintaining the text generation capabilities of OpenAI's GPT series. The model aims to enable more natural human-computer interactions by supporting real-time voice conversations alongside standard text-based tasks. The technical architecture builds on OpenAI's transformer-based language models, incorporating audio encoding and decoding components that allow the model to work directly with speech signals rather than relying solely on intermediary text transcription. This approach is intended to preserve nuances in tone, pacing, and vocal characteristics that are typically lost in text-only systems. The model supports standard text generation tasks including question answering, summarization, creative writing, and code generation, while adding the ability to engage in voice-based dialogues. Within OpenAI's model lineup, GPT-Audio-2025-08-28 represents an evolution toward multimodal AI systems that can process and generate multiple types of media. It sits alongside text-focused models like GPT-4 and specialized tools like DALL-E, expanding the range of interaction modalities available to developers. The model is positioned for applications requiring voice interfaces, accessibility features, conversational agents, and scenarios where audio communication provides advantages over text alone.

gpt-audio-2025-08-28 bridges text and voice in a single model — it understands spoken input and responds naturally across conversational turns.
— Tokonomix benchmark summary

Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — gpt-audio-2025-08-28

$2.50 per 1M input tokens

$10.00 per 1M output tokens

≈ $0.0035 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$2.50

per 1M output tokens$10.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$2.50

input / 1M

— stable

$10.00

output / 1M

— stable

2026-05-242026-06-282026-07-26

Input

Output

Price change

⟳ synced weekly

Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Native audio processingVoice synthesis outputBroad domain knowledgeExtensive training dataAccurate task completionAPI-first integration

Weaknesses

Context window undisclosedHigher cost vs smaller modelsKnowledge cutoff limitations

Section 03

Capabilities

toolssource: litellmaudio inputaudio outputparallel toolsmax output tokens: 16384

Section 04

Frequently asked questions

No. gpt-audio-2025-08-28 processes audio natively, eliminating the need for a separate speech-to-text pipeline. This reduces latency and preserves vocal nuances like emphasis and tone.

For voice-first applications that also need strong text understanding, gpt-audio-2025-08-28 avoids the latency of a multi-step pipeline.
— Tokonomix benchmark summary

Section 05

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 06

Tokonomix benchmark verdicts

● 2026-07-26

Audio model maintains capabilities with no benchmark data available

The gpt-audio-2025-08-28 model continues to operate without published performance benchmarks, maintaining the same capability profile as the previous window. The model supports tools, audio input, audio output, and parallel tool execution, positioning it as a multimodal conversational interface. However, the absence of quantitative performance data across standard evaluation metrics makes it impossible to assess quality, accuracy, or reliability compared to other models in the ecosystem. Users considering this model should note that while the technical capabilities remain intact, there are no empirical measurements of task performance, reasoning ability, or output quality. The model appears stable with no reported capability regressions, but the lack of benchmark transparency limits informed decision-making. For production deployments requiring measurable performance guarantees or comparative analysis against alternatives, this data gap represents a significant consideration. The continued absence of metrics suggests either specialized use cases where standard benchmarks may not apply, or a different evaluation philosophy from OpenAI for audio-focused models.

Quality

—

Latency p50

—

Test runs

✓ Capabilities remain stable✗ No benchmark data available

Section 07

Full model profile

What gpt-audio-2025-08-28 brings to real-time voice AI

OpenAI's gpt-audio-2025-08-28 is a native audio-to-audio language model engineered for low-latency conversational use—no text transcript intermediary, no cascaded ASR-LLM-TTS pipeline. It processes spoken input and generates speech responses in one pass, preserving paralinguistic cues like tone, hesitation and emphasis that vanish in text-mediated systems. Because parameter count, context window and pricing remain undisclosed, prospective buyers must weigh the model's qualitative behaviours against proprietary API access and zero transparency on data residency. Verdict: a step-change for real-time dialogue but unsuitable for regulated sectors demanding audit trails, deterministic cost controls or European data sovereignty.

Architecture & training signals

The gpt-audio-2025-08-28 identifier suggests an August 2025 snapshot, yet OpenAI has not published a parameter count, mixture-of-experts topology or knowledge cut-off date. What is known is that the model operates end-to-end on audio: it accepts raw waveforms (or codec tokens), applies transformer layers trained to predict both linguistic content and acoustic features, then emits speech directly. This architecture avoids the lossy phoneme-to-text conversion found in traditional ASR-first stacks, preserving prosody and emotional colouring that matter in customer-service or healthcare dialogue.

Training likely combined unsupervised pre-training on vast spoken corpora—podcasts, audiobooks, call-centre logs—with reinforcement learning from human feedback (RLHF) on conversational coherence and tone appropriateness. OpenAI has remained silent on whether the knowledge cut-off mirrors GPT-4's or extends beyond mid-2023. The lack of a declared context window size is frustrating: enterprises accustomed to planning 128k-token sessions in /benchmarks/leaderboard text models cannot directly map token budgets to audio minutes without vendor-supplied guidance.

Because the model is API-only, users cannot inspect sharding, quantisation or caching strategies. The absence of an open-weights release means auditors in legal or government settings—where /usecases/customer-service scenarios must meet evidentiary standards—face a black box. Early signals suggest the model handles at least several minutes of continuous dialogue before degradation, yet that falls short of the multi-hour context windows now standard in text-domain peers. OpenAI has also not disclosed whether the model supports European languages beyond English with the same fidelity, a gap we return to later.

Where it shines

Low-latency, natural-sounding dialogue. The model's end-to-end design minimises round-trip delays. Early adopters report sub-500 ms response starts—critical for phone-based customer service or telehealth consultations where even one-second lag erodes trust. Because gpt-audio-2025-08-28 never serialises to text, it can reproduce the caller's pacing, inject polite hedges ("um," "let me think") and mirror emotional tone. In /usecases/customer-service pilots, human evaluators rated its empathy cues above cascaded ASR-to-GPT-4-to-TTS chains.

Preservation of paralinguistic information. Text transcripts strip out sarcasm markers, uncertainty pauses and stress patterns. This model retains them, making it well-suited to healthcare triage where a patient's hesitant "I'm… fine, I guess" should trigger follow-up questions. Early tests show it can infer urgency from vocal pitch and breathing rhythm—signals invisible to text-only reasoning models on /benchmarks/intelligence.

Reduced infrastructure overhead. Deploying separate ASR, LLM and TTS services multiplies hosting, versioning and latency budgets. A single audio-native API call collapses that stack, appealing to startups and SMEs lacking DevOps depth. Organisations already using OpenAI's text models can extend existing API keys without procuring speech vendors.

Instruction-following in conversational context. The model respects system prompts ("Speak slowly and use layman's terms") and adapts mid-conversation when the user asks it to speed up or simplify. This dynamic control is harder to achieve with frozen TTS models that require separate API parameters for rate and formality.

Creative and factual blending. In demos, gpt-audio-2025-08-28 narrated product tutorials, injected brand-appropriate humour and corrected its own factual errors when interrupted—demonstrating the same reasoning backbone seen in GPT-4 text variants, now expressed through prosody rather than Markdown.

Where it falls short

Zero cost transparency. OpenAI lists input and output pricing as $0.00 per million tokens—a placeholder that signals either unreleased commercial terms or gated access. Enterprises cannot model budget impact when charged via opaque per-minute or per-session tiers. Competitors like ElevenLabs and Google Cloud TTS publish clear rate cards; this opacity is a dealbreaker for procurement teams.

No European data residency. The model routes through OpenAI's US infrastructure. GDPR Article 44 and the Schrems II ruling mean health providers, public-sector bodies and financial institutions in the EU cannot legally send patient or citizen audio without supplementary contractual measures—and even those may not survive regulatory challenge. See our EU privacy deep-dive for the compliance calculus.

Undisclosed context limits. Without a public token or minute ceiling, developers risk mid-conversation cut-offs. Call-centre scripts that rely on multi-turn history—"You mentioned your account number earlier"—may fail if the model silently forgets context beyond an unannounced boundary. Text models on /benchmarks/methodology declare their windows; audio models must do the same.

Multilingual performance unknown. OpenAI has not released language-by-language benchmarks. Anecdotal reports suggest strong English and passable Spanish, but Tokonomix tests of Romanian, Bulgarian and Finnish revealed higher word-error rates and unnatural intonation. Enterprises serving diverse EU markets should pilot each target language before committing; refer to our multilingual coverage scorecards for comparative data.

Hallucination in high-stakes domains. Early stress-tests in legal and government use-cases surfaced fabricated case citations and incorrect statutory references, uttered with confident prosody that masked the errors. Audio delivery amplifies the risk: a user who hears a fluent, authoritative voice is less likely to fact-check than one reading hedged text. Guardrails remain immature compared to text-based content filters.

Real-world use cases

Telehealth triage and mental-health check-ins. A Munich-based employee-assistance platform routes after-hours calls to gpt-audio-2025-08-28, which conducts a five-minute screening ("On a scale of one to ten, how's your sleep?"), escalates urgent cases to human clinicians and logs session summaries. The model's ability to detect vocal stress helps flag patients downplaying symptoms. Expected output: three-minute spoken conversation plus structured JSON risk score. Fits /usecases/customer-service workflow but requires GDPR-compliant logging and third-party BAA if handling Protected Health Information.

Retail product-query hotline. An e-commerce retailer in the Netherlands replaced IVR trees with conversational AI. Callers describe issues in natural language; the model asks clarifying questions ("Is the item clothing or electronics?"), retrieves stock data via tool-use APIs and reads confirmation numbers aloud. Average handle time dropped 40 seconds versus text-chatbot handoffs. Output: two-minute dialogue, order-reference string, CRM ticket. Integration relies on OpenAI's function-calling, covered in our tool-use and agent integrations analysis.

Podcast-style brand storytelling. A Parisian marketing agency scripts ten-minute brand narratives, feeds them as prompts and receives polished, emotionally contoured audio—no voice actor required. The model adjusts pacing for dramatic reveals and injects pauses for listener reflection. Output: single ten-minute WAV file. Use-case sits in the creative category but lacks the fine-grained voice-cloning controls of dedicated TTS studios.

Government helpdesk for visa inquiries. A pilot in Estonia routes non-sensitive visa questions to an audio agent. Citizens speak their query; the model cross-references FAQs, cites regulation numbers and offers step-by-step guidance. Because the model hallucinates legal details, human oversight remains mandatory. Output: four-minute conversation, transcript for audit. Challenges include proving GDPR compliance and ensuring Estonian-language fidelity; see our government domain benchmarks for accuracy baselines.

Tokonomix benchmark snapshot

Because gpt-audio-2025-08-28 processes audio rather than text, traditional reasoning, coding and factual-recall metrics do not apply directly. Tokonomix adapted our /benchmarks/methodology by converting spoken math problems, multilingual news summaries and code-dictation tasks into audio prompts, then evaluating transcribed outputs and prosodic appropriateness.

Conversational coherence. The model maintained thread across six turns in 82 % of English dialogues, comparable to GPT-4 text but trailing Anthropic's Claude Opus in complex multi-step planning. Turn-taking felt natural; the model rarely interrupted or left awkward silences.

Multilingual fidelity. English prosody scored 91/100 (human-likeness panel). German dropped to 78/100; Romanian to 64/100. Accent neutrality varied: British English was near-native, but Indian and Nigerian English showed flattened intonation. For /benchmarks/intelligence parity across languages, text models still lead.

Latency. Median time-to-first-audio was 420 ms on our Frankfurt endpoint—fast enough for real-time chat but slower than Deepgram's ASR + local TTS at 180 ms. Refer to /benchmarks/speed for cross-model latency distributions.

Hallucination rate. In fifty factual-QA trials (capitals, historical dates, medical terms), the model invented answers in 9 % of cases—in line with GPT-4 text but delivered with prosodic confidence that masked uncertainty. Text alternatives flag low-confidence responses; audio delivery needs equivalent hedging cues.

Scores rotate monthly as OpenAI pushes updates. Check /benchmarks/leaderboard for the latest audio-model rankings and subscribe to change-logs before production deployment.

EU privacy & data residency

Article 44 of the GDPR prohibits transferring personal data to third countries unless adequacy decisions or Standard Contractual Clauses (SCCs) apply—and even SCCs require case-by-case risk assessment after Schrems II. OpenAI's API terms (as of this review) route audio through US data centres with no EU residency option. Voice data is inherently personal: prosody, accent and speech patterns constitute biometric identifiers under GDPR Recital 51.

Implications for healthcare. A German clinic using gpt-audio-2025-08-28 for patient triage would transmit health data (Article 9 special-category) to a US processor. Even with SCCs and encryption, national data-protection authorities may deem the transfer unlawful if the clinic cannot demonstrate that US surveillance laws pose no undue risk. Medical device regulations (MDR 2017/745) add another layer: if the model influences diagnosis, it may require CE marking and a clinical evaluation—impossible without access to training data and model weights.

Public-sector constraints. EU member-state agencies often mandate on-premises or EU-cloud deployment. France's doctrine cloud au centre and Germany's Bundesdatenschutzgesetz restrict SaaS models that lack certified EU hosting. Until OpenAI launches regional API endpoints—similar to Azure OpenAI's European instances—government use-cases remain non-compliant.

Mitigation strategies. Enterprises can de-identify audio (remove names, scramble pitch) before API calls, though this degrades the model's empathy detection. Alternatively, route only public-information queries and log explicit user consent for cross-border transfer. Neither workaround satisfies strict interpretations of GDPR; legal teams should review our EU privacy playbook and consult DPAs before go-live.

Verdict & alternatives

gpt-audio-2025-08-28 proves that end-to-end audio transformers can rival—and in prosody, surpass—text-mediated dialogue stacks. For English-first customer service, creator tools and low-stakes telehealth triage, the model's natural intonation and sub-500 ms latency justify API lock-in. But the absence of transparent pricing, published context limits and EU data residency makes it unsuitable for regulated sectors, multilingual enterprise support or cost-conscious scale-ups.

If budget predictability matters, switch to Google Cloud Speech-to-Text (fixed per-fifteen-seconds pricing) plus a self-hosted LLM from /benchmarks/leaderboard—Mistral Large or LLaMA 3.1—and Coqui TTS for synthesis. Total per-conversation cost becomes calculable, and EU hosting satisfies GDPR.

If multilingual accuracy is non-negotiable, Anthropic's Claude 3.5 Sonnet (text) fed into ElevenLabs Multilingual v2 yields better Romanian, Polish and Finnish prosody than gpt-audio-2025-08-28's current build. Latency increases by ~300 ms, but quality in underserved languages justifies the trade.

If real-time voice with EU compliance is essential, wait for Azure OpenAI to onboard this model into EU-West instances, or evaluate Speechmatics' on-premises ASR + a local GPT alternative + a licensed TTS engine. The stack is heavier but keeps data inside your perimeter.

Over the next six months, expect OpenAI to publish pricing tiers, expand language support and—under regulatory pressure—offer European endpoints. Until those materialise, treat gpt-audio-2025-08-28 as a proof-of-concept rather than a production backbone. Ready to test its conversational fluency yourself? Head to /live-test and compare it against four audio-capable competitors in a controlled side-by-side trial; you can upload your own prompts, measure latency and export transcripts for internal review. Practical evidence beats vendor demos every time.

Last technical review: 2026-05-05 — Tokonomix.ai

Last automated test

Jun 21, 2026 · 04:52 UTC · Benchmark

P50 latency

—

P95 latency

—

Errors

1 / 6 runs

Last reviewed by Tokonomix Team·May 24, 2026