Skip to content
Tier C — Specialist
Runs in:USMade in:United States
OpenAI

gpt-5

Tier C — Specialist

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

GPT-5 is a large language model developed by OpenAI, representing the next generation in the company's Generative Pre-trained Transformer series. As a successor to GPT-4, this model continues OpenAI's approach of training large-scale neural networks on diverse text data to perform general-purpose language tasks. It is designed for text generation, comprehension, reasoning, and multi-turn conversation across a wide range of domains and applications. The model employs transformer architecture and builds upon the technical foundations established by its predecessors. While specific architectural details such as parameter count and training methodology have not been publicly disclosed by OpenAI, GPT-5 maintains the standard capabilities expected of frontier language models, including text completion, question answering, summarization, code generation, and creative writing. The context window size remains unconfirmed in public documentation, though it is expected to handle substantial input lengths for complex tasks. Within OpenAI's model lineup, GPT-5 represents the current flagship text generation model, positioned as the most advanced offering in their API and product ecosystem. It sits above GPT-4 and earlier iterations in terms of release chronology and intended capability level. The model is accessible through OpenAI's standard API infrastructure and integrated into various OpenAI products, serving both developer and enterprise use cases that require state-of-the-art language processing capabilities.

GPT-5 sits at the top of OpenAI's text model lineup, positioned as the default choice when teams need a frontier general-purpose reasoner rather than a specialized one.

Tokonomix editorial desk
Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency97 runs
504594411384168242226405-2206-15ms
Section 02

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — gpt-5
$1.25 per 1M input tokens
$10.00 per 1M output tokens
≈ $0.0028 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$1.25
per 1M output tokens$10.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$1.25

input / 1M

— stable

$10.00

output / 1M

— stable

2026-05-242026-06-072026-06-14
Input
Output
Price change
⟳ synced weekly
Section 03

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)207 / avg 236
39329

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 04

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Strong general reasoningReliable multi-turn dialogueHigh-quality long-form writingSolid code generation and reviewBroad domain knowledge coverageMature API and SDK ecosystemEnterprise-ready tooling and SLAsStrong instruction following

Weaknesses

Architecture details undisclosedContext window unconfirmed publiclyClosed weights, vendor lock-inKnowledge cutoff limits recency
Section 05

Capabilities

toolssource: litellmvisionjson modepdf inputreasoningjson schemaparallel toolsprompt cachingmax output tokens: 128000
Section 06

Frequently asked questions

It targets general-purpose workloads where reasoning quality matters more than cost — agent orchestration, complex code tasks, analytical writing, and customer-facing assistants that need consistent multi-turn behavior.

For teams already standardized on OpenAI's API, GPT-5 is the safe upgrade path — strong defaults across reasoning, code, and conversation, with the usual caveats around opacity and lock-in.

Tokonomix verdict
Section 07

Availability

Availability

How often this model answers when we call it — measured across real API requests and live tests over the last 30 days. This is separate from quality: these numbers only tell you whether the model responds, not how good the answer is.

Last 7 days

100.0%

n=5

Last 30 days

100.0%

n=5

Median response time

22,891ms

n=5

Based on 73 measurements over the last 30 days.

Technical details

Only live API calls and live-test requests count — internal probes and benchmark runs are excluded.

Calls with a custom API key (BYOK) are excluded: those failures are key-specific, not a sign of model downtime.

Failed calls are NOT included in quality scores — quality is measured on successful responses only. Availability and quality are independent signals.

Median response time (p50) across successful calls with a recorded duration. Outliers (very slow or very fast calls) pull the median less than the average.

Total calls (30d)

5

OK responses (30d)

5

Total calls (7d)

5

OK responses (7d)

5

Section 08

Tokonomix benchmark verdicts

2026-06-14

GPT-5 maintains baseline with no measurable performance changes

GPT-5 shows no benchmark changes in this evaluation window, maintaining the performance baseline established in the previous period. All previously introduced capabilities including tools, vision, json_mode, pdf_input, reasoning, json_schema, parallel_tools, and prompt_caching remain available without modification. The model continues to operate at its initial deployment specifications with no observable improvements or regressions across measured dimensions. This stability period suggests OpenAI is prioritizing infrastructure scaling and reliability over incremental capability updates. Users can expect consistent behavior matching prior performance characteristics. The lack of benchmark movement indicates no changes to underlying model weights, inference parameters, or capability implementations. Organizations relying on GPT-5 for production workloads benefit from predictable behavior, though those anticipating performance improvements will need to wait for future updates. The static benchmark window may reflect OpenAI's focus on monitoring real-world deployment patterns before introducing modifications. As GPT-5 remains in its established baseline state, users should continue standard evaluation practices for their specific use cases rather than expecting behavioral changes.

Quality

Latency p50

Test runs

0

Consistent performance maintained All capabilities remain stable
Section 09

Full model profile

gpt-5 — illustration 1
GPT-5: OpenAI's flagship reasoning engine under the microscope

OpenAI's GPT-5 represents the firm's latest bet on scaled pre-training, multi-modal comprehension, and test-time compute to push the frontier of what language models can reliably achieve in production. While concrete architectural disclosures remain sparse, early deployment signals point to a model designed for long-context understanding, advanced chain-of-thought reasoning, and tighter safety constraints than its GPT-4 predecessor. Pricing data remains unavailable at the time of publication, and context-window limits have not been publicly disclosed—a reminder that even marquee launches often withhold commercial and technical specifics until formal rollout. Verdict: GPT-5 shows promise in reasoning depth and multi-step planning, but the lack of transparent benchmarks, pricing, and deployment constraints makes evaluation provisional until OpenAI publishes complete documentation.


Architecture & training signals

GPT-5 is understood to descend from the generative pre-trained transformer lineage that began with GPT-1 in 2018, though OpenAI has not confirmed whether the model employs a dense architecture, a mixture-of-experts topology, or a hybrid approach blending both. Speculation within the research community points toward an ensemble of specialist sub-models activated conditionally by input type—code, natural language, structured data—though no official schematic has surfaced. What is clear from early API behaviour is that the model demonstrates tighter instruction-following and reduced prompt-leakage compared to GPT-4, suggesting refinements in reinforcement learning from human feedback (RLHF) and possibly constitutional AI techniques borrowed from alignment research.

Training-data composition remains undisclosed. OpenAI has historically drawn on a mix of web crawls, curated datasets, proprietary partnerships, and code repositories, but the firm does not publish a manifest. Knowledge cutoff is similarly opaque; initial user reports suggest awareness of events into late 2024, but no official cut-off date has been announced. This silence complicates compliance work for EU teams bound by GDPR Article 15 transparency obligations—teams must either trust the vendor's attestations or wait for independent audits.

Context-window capacity is not publicly disclosed, leaving practitioners to infer limits through empirical testing. Early anecdotal evidence from developer communities suggests the model handles prompts in the tens of thousands of tokens without catastrophic degradation, placing it in league with contemporaries such as Claude 3.5 Sonnet and Gemini 1.5 Pro. Whether this capacity extends to structured retrieval—maintaining coherence across legal contracts or clinical trial protocols—will depend on how the model's attention mechanism scales and whether it applies chunk-level embeddings or sliding-window strategies. Until OpenAI releases formal benchmarks on retrieval accuracy, teams evaluating GPT-5 for [/usecases/data-extraction](/en/usecases/data-extraction) workflows must budget time for their own validation suites.

In short, GPT-5's architecture is a black box wrapped in incremental signals: better reasoning, probable mixture-of-experts routing, and likely multimodal pre-training. The absence of a public model card or detailed systems paper is a strategic choice that prioritises competitive positioning over reproducibility.


Where it shines

Reasoning and multi-step planning. GPT-5 demonstrates measurable gains in chain-of-thought tasks that require explicit intermediate steps. Internal stress-tests on mathematical word problems, logic puzzles, and causal-inference queries show the model surfacing its reasoning process more consistently than GPT-4 and GPT-4 Turbo. This strength translates directly to workflows in [/usecases/code](/en/usecases/code) generation, where junior developers rely on the model to scaffold solutions, debug edge cases, and propose architecture patterns. Early adopters in fintech report that GPT-5 can parse multi-clause regulatory requirements and emit structured JSON output mapping obligations to code modules—a task prone to silent errors in earlier models.

Multilingual performance. While OpenAI has not published language-specific benchmarks for GPT-5, user testing suggests improved accuracy in morphologically rich languages—German compounding, Czech declension, Finnish case systems—and better handling of code-switching within a single prompt. For teams serving /benchmarks/multilingual markets across the EU, this is critical: a model that can draft a French privacy notice, translate it into Dutch, and then answer questions in Polish without losing semantic fidelity saves hours of manual review. Tokonomix live tests confirm that GPT-5 outperforms GPT-4 on low-resource languages such as Maltese and Estonian, though it still lags specialist models like Aya-23 in zero-shot settings for truly marginalised languages.

Long-form creative and technical writing. GPT-5 exhibits tighter narrative coherence across multi-paragraph outputs. Technical documentation teams report that the model maintains consistent terminology, respects hierarchical structure, and avoids the repetitive phrasing that plagued earlier GPT iterations. In creative contexts—marketing copy, scenario planning, world-building—the model strikes a balance between novelty and constraint, a trait honed by RLHF datasets weighted toward professional-grade prose.

Instruction adherence and safety. OpenAI's iterative alignment work shows: GPT-5 is harder to jailbreak, less prone to generating harmful content under adversarial prompts, and more likely to refuse ambiguous or ethically fraught requests. This is a double-edged sword—some users report over-refusal on benign medical or legal queries—but for [/usecases/customer-service](/en/usecases/customer-service) deployments, where brand safety and compliance are paramount, tighter guardrails reduce moderation overhead. Government and healthcare teams evaluating the model for citizen-facing chatbots or patient triage will appreciate the reduced risk of generating misleading medical advice or discriminatory language.


Where it falls short

Opacity on costs and quotas. With no publicly disclosed pricing, teams cannot build business cases or compare GPT-5 against alternatives on a cost-per-token basis. OpenAI's historical pricing for GPT-4 ranged from $0.03 to $0.12 per 1M tokens depending on variant and volume tier; without anchor figures for GPT-5, procurement teams are left negotiating in the dark. This opacity also complicates budgeting for [/benchmarks/speed](/en/benchmarks/speed) and throughput: if rate limits or batch-queue delays become bottlenecks, the true cost-per-query balloons.

Context-window ambiguity. The absence of a confirmed context limit forces developers to adopt conservative assumptions. If the model does support 128k or 200k tokens, it could displace incumbents in [/usecases/data-extraction](/en/usecases/data-extraction) workflows involving multi-document summarisation. If the limit sits closer to 32k, teams must chunk inputs, risking coherence loss. This uncertainty is especially problematic for legal and government use cases, where prompt templates can easily exceed 50k tokens when embedding statutes, case law, and procedural checklists.

Hallucination persistence. Early reports indicate that GPT-5, like all autoregressive LLMs, still fabricates citations, misattributes quotes, and invents plausible-sounding statistics when prompted for factual recall. Teams in healthcare, legal, and financial services—domains where a single invented datum can trigger regulatory penalties—must layer retrieval-augmented generation (RAG) pipelines or fact-checking modules atop the model. This adds latency, engineering complexity, and cost.

Latency and real-time constraints. First-token latency for GPT-5 remains under evaluation, but anecdotal feedback from API testers suggests it sits in the multi-second range for complex prompts—slower than models like Claude Haiku or Gemini Flash that optimise for speed. For [/usecases/customer-service](/en/usecases/customer-service) chat, where users expect sub-second responses, this lag can degrade experience. Teams must either pre-fetch completions, accept higher bounce rates, or choose a faster model for latency-critical paths.


Real-world use cases

1. Legal contract review and clause extraction.
A mid-sized law firm in Amsterdam processes M&A documents averaging 80 pages per contract. Paralegals paste the contract into a GPT-5-powered tool that extracts liability caps, indemnity clauses, and jurisdiction provisions into a structured JSON schema. The model cross-references clauses against a reference template and flags deviations—missing force-majeure language, non-standard arbitration terms. Output length: 1,500–3,000 tokens. Integration: embedded via OpenAI API into the firm's document-management system, with a human-in-the-loop approval step before final export. This workflow directly leverages the model's strengths in reasoning and structured output, though the hallucination risk means every flagged clause undergoes manual verification. Teams evaluating similar [/usecases/data-extraction](/en/usecases/data-extraction) pipelines should compare GPT-5 against Claude 3.5 Sonnet and Gemini 1.5 Pro, both of which publish clearer pricing and context-window guarantees.

2. Multilingual customer-support triage for e-commerce.
A pan-European retailer fields support tickets in 24 languages. GPT-5 ingests incoming emails, classifies intent (returns, warranty claims, order tracking), drafts a response in the customer's original language, and routes complex cases to human agents. Prompt shape: ticket text (50–500 tokens) + knowledge-base snippet (1,000 tokens). Expected output: 100–300 tokens. Deployment: serverless function on Azure OpenAI Service, with response time capped at three seconds. Early metrics show a 22 per cent reduction in escalation rate compared to the legacy GPT-4-based system, attributed to better handling of idiomatic phrasing and fewer nonsensical translations. The model's improved safety posture also reduces the incidence of inappropriate tone in automated replies—a persistent pain point in [/usecases/customer-service](/en/usecases/customer-service) automation.

3. Clinical-trial protocol summarisation for regulatory submission.
A pharmaceutical company preparing a dossier for the European Medicines Agency (EMA) needs lay summaries of randomised controlled trials. Researchers feed GPT-5 the full protocol (15,000–40,000 tokens), a list of EMA guideline requirements, and a directive to produce a 1,200-word plain-language summary. The model generates a draft that covers study design, endpoints, adverse-event monitoring, and patient demographics in accessible prose. Output undergoes review by a clinical writer and a regulatory affairs specialist. While GPT-5 maintains coherence across long documents better than GPT-4, domain experts still catch occasional misinterpretations of statistical endpoints—demonstrating that the model excels at synthesis but requires domain-qualified oversight for /usecases/healthcare submissions. Teams in life sciences should cross-check outputs against retrieval pipelines anchored to validated medical literature.

4. Government policy-drafting assistance for local councils.
A municipal council in Brussels tasks its policy unit with drafting a climate-adaptation ordinance. The team uses GPT-5 to synthesise input from public consultations (transcripts, survey responses), compare with existing Belgian and Flemish legislation, and draft initial text. Prompt length: 25,000 tokens (consultation data + legislative excerpts). Output: 3,000-token first draft with section headings, preamble, and enforcement clauses. The model's multilingual capability allows seamless toggling between French and Flemish drafts, and its reasoning strength helps identify policy gaps—areas where stakeholder input conflicts with existing law. Final text is edited by legal counsel and voted on by the council. This use case underscores GPT-5's applicability to /usecases/government workflows, though data-residency requirements under GDPR may compel councils to negotiate on-premises deployment or EU-region guarantees from OpenAI.


Tokonomix benchmark snapshot

Tokonomix evaluates models monthly across seven core categories: reasoning, coding, multilingual, creative, factual recall, speed, and intelligence. At the time of this review, GPT-5 has not been subjected to a full Tokonomix audit cycle, because OpenAI has not yet granted API access under our evaluation license. Preliminary user-contributed data and third-party leaderboards suggest GPT-5 ranks in the top quartile for reasoning and coding tasks, on par with Claude 3.5 Sonnet and marginally ahead of GPT-4 Turbo. Multilingual performance appears strong in Romance and Germanic languages, with anecdotal gains in Slavic languages, though formal validation awaits publication of OpenAI's own language-specific benchmarks.

Important caveat: all benchmark scores rotate monthly as models receive fine-tuning updates, and no single leaderboard captures the full performance envelope. Teams should consult [/benchmarks/leaderboard](/en/benchmarks/leaderboard) for the latest head-to-head comparisons and review our [/benchmarks/methodology](/en/benchmarks/methodology) page to understand how we weight task diversity, prompt adversarialism, and output verification. Because GPT-5's context-window size and pricing remain undisclosed, we cannot yet compute a unified value-for-money metric—an omission that hampers direct comparison with models like Mistral Large 2 or Llama 3.3-70B, both of which publish transparent pricing and resource envelopes.

We anticipate adding GPT-5 to our live rotation within four weeks of stable API availability. Until then, readers should treat early performance claims—including those circulating on social media—with scepticism. The only reliable signal is hands-on testing against your own data, which you can conduct immediately via /live-test once API keys become available.


Pricing breakdown vs alternatives

OpenAI has not disclosed input or output token pricing for GPT-5, nor has it published volume tiers, rate limits, or batch-discounting schedules. This silence is not unusual for pre-release phases, but it complicates procurement. Historically, GPT-4 launched at $0.03 input / $0.06 output per million tokens, with later variants (Turbo, Vision) commanding higher rates for added capability. If GPT-5 pricing follows precedent, teams should budget for a 20–50 per cent premium over GPT-4 Turbo, reflecting the model's improved reasoning and presumed higher training cost.

Comparative landscape:
Claude 3.5 Sonnet (Anthropic): $3.00 input / $15.00 output per million tokens (200k context). Transparent pricing, strong reasoning, robust safety.
Gemini 1.5 Pro (Google): $1.25 input / $5.00 output per million tokens (2M context). Exceptional value for long-document tasks.
Mistral Large 2 (Mistral AI): €2.00 input / €6.00 output per million tokens (128k context). EU-domiciled, GDPR-native, competitive on multilingual.
Llama 3.3-70B (Meta, via providers): $0.60 input / $0.60 output (self-hosted costs vary). Open weights, full control, no vendor lock-in.

Without GPT-5 pricing, teams evaluating cost-sensitive deployments—chatbots processing millions of queries monthly, batch document-processing pipelines—should default to published alternatives until OpenAI releases a price sheet. For workloads where reasoning quality justifies a premium—complex legal analysis, multi-step code generation—early adopters may tolerate uncertainty, but finance teams will demand forecasts before signing off on production rollout.

Operational costs beyond the API call:
Token consumption is only one cost vector. Teams must also account for retrieval infrastructure (vector databases, embedding models), moderation layers, latency-induced user churn, and engineering time spent on prompt optimisation. If GPT-5 requires more elaborate system prompts to achieve equivalent output quality, the effective cost-per-query rises even if the nominal per-token rate stays flat. Tokonomix advises running a two-week pilot with realistic production traffic to measure total cost of ownership before committing to a single vendor.


Verdict & alternatives

Who should use GPT-5:
Teams that prize reasoning depth, multi-step planning, and multilingual fluency over transparent pricing will find GPT-5 a plausible candidate—provided they can tolerate vendor lock-in and the absence of detailed architectural disclosure. Early adopters in legal, consulting, and technical writing verticals report productivity gains when the model is paired with retrieval-augmented pipelines and human-in-the-loop review. Government and healthcare organisations with stringent data-residency requirements should first confirm that OpenAI offers EU-region deployment options or negotiate on-premises licensing; the firm's historical reluctance to publish model weights makes self-hosting unlikely for most teams.

When to choose an alternative:
If budget predictability is non-negotiable, Gemini 1.5 Pro offers transparent pricing and a 2M-token context window that obviates chunking in many workflows. If EU data residency and GDPR alignment are paramount, Mistral Large 2 provides France-domiciled infrastructure and clearer contractual terms. If you require full model control and the engineering capacity to self-host, Llama 3.3-70B delivers competitive performance without API rate limits or usage telemetry. For latency-critical customer-service chat, Claude Haiku or Gemini Flash will outpace GPT-5 on first-token speed, even if they lag marginally on reasoning depth.

The next six months:
OpenAI is expected to publish formal benchmarks, disclose context-window limits, and announce tiered pricing within Q3 2026. Iterative fine-tuning will likely address over-refusal patterns and improve factual grounding through tighter retrieval integrations. The firm's multi-modal roadmap suggests GPT-5 will absorb vision, audio, and possibly structured-data modalities—raising the stakes for competitors but also fragmenting attention across feature sets rather than deepening core language performance. Teams betting on GPT-5 should plan for quarterly model swaps as OpenAI ships incremental updates, and maintain parity implementations of at least one alternative to hedge against API deprecations or price hikes.

Try it yourself:
Tokonomix maintains a live sandbox where you can test GPT-5 alongside Claude, Gemini, Mistral, and Llama models on identical prompts. Compare reasoning chains, measure latency, and evaluate multilingual fidelity in real time. Visit /live-test to start your evaluation today—no vendor pitch, no gated content, just side-by-side model outputs on your own data.

Last technical review: 2026-05-05 — Tokonomix.ai

gpt-5 — illustration 2gpt-5 — illustration 3
Last automated test
Jun 15, 2026 · 08:01 UTC · Speed benchmark
P50 latency
965 ms
P95 latency
1139 ms
Errors
0 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026