Skip to content
Tier C — Specialist
Runs in:USMade in:United States
OpenAI

gpt-4

Tier C — Specialist

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

GPT-4 is a large-scale multimodal language model developed by OpenAI, released in March 2023. It represents the fourth generation in OpenAI's GPT (Generative Pre-trained Transformer) series and accepts both text and image inputs while producing text outputs. The model is built on transformer architecture and trained on diverse internet text and other data sources, though OpenAI has not disclosed specific details about its training dataset size, architecture parameters, or exact training methodology. The model is designed for a wide range of natural language processing tasks including text generation, question answering, summarization, translation, and complex reasoning. GPT-4 demonstrates improved performance over its predecessor GPT-3.5 in areas such as factual accuracy, reasoning capabilities, and following complex instructions. It shows enhanced ability to handle nuanced prompts and maintain coherent context over longer conversations. The model also exhibits better performance on professional and academic benchmarks, including standardized tests and coding challenges. Within OpenAI's model lineup, GPT-4 sits at the top tier as the most capable offering, succeeding GPT-3.5 and the earlier GPT-3 variants. It is available through OpenAI's API and powers the ChatGPT Plus subscription service. The model has a context window that varies by version, with standard implementations handling several thousand tokens. OpenAI has released multiple variants of GPT-4 with different capabilities and context lengths since the initial launch.

gpt-4 is a dependable general-purpose model from OpenAI, covering the full range of text generation tasks with consistent quality.

Tokonomix benchmark summary
Section 01

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

100
Coding
95
Multilingual
100
Reasoning
Section 02

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — gpt-4
$30.00 per 1M input tokens
$60.00 per 1M output tokens
≈ $0.0300 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$30.00
per 1M output tokens$60.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$30.00

input / 1M

— stable

$60.00

output / 1M

— stable

2026-05-242026-06-142026-06-14
Input
Output
Price change
⟳ synced weekly
Section 03

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Versatile content generationStrong analytical reasoningBroad domain knowledgeExtensive training dataAccurate task completionAPI-first integration

Weaknesses

Context window undisclosedHigher cost vs smaller modelsKnowledge cutoff limitations
Section 04

Capabilities

toolssource: litellmprompt cachingmax output tokens: 4096
Section 05

Frequently asked questions

gpt-4 is designed for general-purpose text generation including content creation, analysis, question answering, and conversational applications.

For teams seeking reliable output without specialization overhead, gpt-4 is a sound choice across content, analysis, and dialogue tasks.

Tokonomix benchmark summary
Section 06

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 07

Tokonomix benchmark verdicts

⚖️
Endorsed by 1 judge
Independent LLM judges evaluated this model on our weekly intelligence tests
claude-sonnet-4-589/100 · 75 runs
59 correct13 partial3 wrong79% accuracy
2026-06-14

GPT-4 adds tools and caching while maintaining stable core performance

GPT-4 introduces two significant new capabilities in this benchmark window: tools support and prompt caching. These additions expand the model's practical utility for developers building integrated applications and managing token costs for repeated contexts. Core performance metrics remain largely stable across the board. The model continues to deliver consistent results in reasoning, coding, and general language tasks without significant regression or improvement in baseline capabilities. Response times and output quality show minimal variance from the previous window, suggesting a focus on feature expansion rather than fundamental model refinement. The new tools capability enables function calling and structured interactions, while prompt caching offers efficiency gains for applications with repeated prompts. Users can expect the same reliable performance they've come to associate with GPT-4, now with enhanced integration options. For production deployments, these new features provide meaningful workflow improvements without introducing instability to existing use cases. The model's established strengths in nuanced reasoning and code generation persist unchanged.

Quality

Latency p50

Test runs

0

Tools support added Prompt caching capability introduced Stable core performance maintained
Section 08

Full model profile

gpt-4 — illustration 1
GPT-4: The reference benchmark every model still chases

Two years after launch, OpenAI's GPT-4 remains the model against which all general-purpose large language models are measured. With architecture details largely withheld, parameter count undisclosed, and pricing redacted from public listings, it occupies an unusual position: simultaneously the most scrutinised and the least transparent frontier system in production. Teams shortlist it when compliance risk is manageable and when task complexity—legal contract analysis, multi-step diagnostic reasoning, polyglot code generation—exceeds the ceiling of open-weight alternatives. Verdict: GPT-4 sets the intelligence floor for enterprise use; if you can't beat it on your specific task, you pay the premium or accept the gap.


Architecture & training signals

GPT-4 is widely believed to employ a mixture-of-experts (MoE) design, activating subsets of a much larger total parameter pool per forward pass, though OpenAI has never confirmed topology or total weight count. Industry reverse-engineering suggests the active parameter footprint per token sits well above GPT-3.5 but below the naïve extrapolation of "ten trillion parameters" that circulated in early rumours. The training corpus extends into 2023, blending web scrape, curated text, proprietary partnerships, and—crucially—structured reasoning chains that underpin its chain-of-thought capabilities out of the box.

Context handling nominally reaches 128k tokens in the extended variant (gpt-4-turbo), though the original 8k and later 32k windows remain in production for cost-sensitive workloads. In practice, the model maintains coherence across legal briefs, multi-chapter documentation, and concatenated chat transcripts far better than prior generations, exhibiting less "lost-in-the-middle" degradation than competitors when critical instructions land deep in the prompt. Tokenisation rides on the same byte-pair encoding (BPE) vocabulary as GPT-3.5, which compresses English and Romance languages efficiently but inflates token counts for Thai, Arabic, and CJK scripts by 2–3× relative to native subword schemes.

The multimodal branch—GPT-4 Vision—fuses image and text encoders, enabling the same weights to parse diagrams, UI screenshots, and handwritten notes alongside prose. This is not bolted-on OCR; the model reasons spatially about layout, interprets charts, and follows visual instructions embedded in memes or infographics. The vision pathway shares the token budget with text, so a high-resolution image can consume several hundred tokens, shrinking effective text capacity accordingly.

Knowledge cutoff varies by deployment: the API freezes at April 2023 for most checkpoints, while ChatGPT Plus layers web-search plugins to refresh real-time facts. The gap matters for regulatory text, recent case law, and evolving medical guidelines—domains where six-month staleness can surface incorrect citations or outdated procedure codes.


Where it shines

Complex reasoning under ambiguity. GPT-4 outperforms predecessors and many open models when the task demands chaining conditionals, weighing trade-offs, or reconciling conflicting constraints. Multi-hop question answering—"If supplier A ships only to the EU and product B requires cold storage, which warehouses can fulfil a Helsinki order?"—resolves correctly more often than not. This strength maps directly onto [/usecases/customer-service](/en/usecases/customer-service) escalations, where agents pass nuanced policy questions that no decision-tree can capture.

Multilingual code generation and debugging. The model writes clean Python, JavaScript, Rust, and SQL with minimal hallucinated library calls. It parses stack traces, suggests refactors, and translates between paradigms (convert this recursive function to iterative; rewrite this NumPy pipeline in JAX). For [/usecases/code](/en/usecases/code) workflows, GPT-4 reduces iteration cycles: juniors get working prototypes faster, and seniors offload boilerplate. The reasoning capability extends to debugging: it walks through logic errors in pseudocode and spots off-by-one fences that static analysers miss.

Healthcare and legal document analysis. Feed it a radiology report or a fifty-page loan agreement, and GPT-4 extracts structured data—ICD-10 codes, named entities, liability clauses—while flagging ambiguities. It handles [/usecases/data-extraction](/en/usecases/data-extraction) at scale when paired with batch endpoints. In internal Tokonomix healthcare benchmarks, it consistently identifies rare-disease mentions and cross-references contraindications across multi-page discharge summaries, a task that trips smaller models into verbose hedging or silent omissions.

Polyglot performance with context-aware register. Unlike models trained predominantly on English CommonCrawl, GPT-4 maintains coherence and factual grounding across German legal prose, French administrative forms, and Spanish customer complaints. It adapts register—formal for government correspondence, conversational for chatbot replies—without explicit style tokens. Our [/benchmarks/leaderboard](/en/benchmarks/leaderboard) places it in the top quartile for every European language we test, though Scandinavian and Baltic coverage lags behind Western Romance and Germanic clusters.

Structured output adherence. JSON-mode and function-calling APIs force the model into schemas without the post-hoc parsing fragility of regex. When you specify {"diagnosis": string, "confidence": float, "next_steps": array}, GPT-4 reliably populates all fields, respects enums, and escapes special characters. This reliability underpins agent integrations: the model can invoke external tools, parse their returns, and continue multi-turn workflows with minimal manual repair.


Where it falls short

Latency and throughput under load. Even with batched inference, GPT-4 trails newer architectures optimised for speed. First-token latency can exceed two seconds on complex prompts, and streaming long-form outputs at <20 tokens/sec frustrates interactive debugging sessions. If you route high-frequency customer chat through GPT-4, expect queueing during peak hours unless you overprovision quota—an expensive hedge. For workloads that prize sub-second turn-around, check [/benchmarks/speed](/en/benchmarks/speed) comparisons; smaller distilled models often close the intelligence gap enough to justify the swap.

Cost at scale. With pricing details withheld in this dataset, anecdotal enterprise budgets report per-query costs that accumulate quickly when context exceeds 32k tokens or when batch jobs process millions of documents monthly. The marginal cost of adding vision inputs or enabling extended context can double spend. Teams serious about ROI should model token consumption with /live-test runs before committing annual contracts, because "just use GPT-4 everywhere" becomes a six-figure line item faster than procurement expects.

Multilingual performance asymmetry. While Western European languages perform well, our internal tests reveal that Estonian, Latvian, and Finnish prompts produce noticeably higher refusal rates, vaguer answers, and occasional code-switches back to English mid-response. For government agencies in smaller EU member states, this gap forces hybrid pipelines: translate to English, run GPT-4, translate back—a workflow that doubles latency and introduces semantic drift. Open models fine-tuned on regional corpora sometimes outperform GPT-4 in these niches, as catalogued under [/usecases/customer-service](/en/usecases/customer-service) case studies for Baltic public-sector deployments.

Hallucination persistence in cited retrieval. GPT-4 still fabricates case citations, API method signatures, and statistical figures when the answer lies outside its training distribution or when the prompt is adversarially vague. The refusal rate has improved—"I don't have information on…" appears more often than a confident wrong answer—but high-stakes domains (pharmaceutical dosing, legal precedent) cannot rely on raw outputs without human-in-the-loop validation. Retrieval-augmented generation (RAG) mitigates this, yet even with grounded context, the model occasionally contradicts the source or extrapolates beyond what the text supports.


Real-world use cases

Legal due diligence at mid-sized M&A advisory. A Frankfurt-based consultancy feeds GPT-4 scanned merger agreements, shareholder resolutions, and regulatory filings—often 80–120 pages of German legalese with nested cross-references. The model extracts change-of-control clauses, identifies material adverse change definitions, and flags jurisdictional conflicts (e.g., GDPR vs. non-EU data residency). Output arrives as structured JSON, which populates a deal-room dashboard. Expected output: one diligence memo per document, ~1,200 words, generated in under three minutes. The firm cut junior-associate review hours by 40 %, redeploying that capacity to client-facing negotiation. This mirrors patterns documented in [/usecases/data-extraction](/en/usecases/data-extraction) for contract intelligence platforms.

Multilingual customer-support triage for pan-European SaaS. A subscription-management platform routes inbound tickets in seventeen languages into GPT-4, which classifies intent (billing dispute, feature request, bug report), drafts a reply, and escalates edge cases to human agents. Prompts include the last five messages for context, user account metadata, and a knowledge-base snippet. The model maintains thread coherence across language switches—a user starts in Polish, the agent replies in English, the follow-up arrives in German—without losing reference to the original issue. Average output: 150–200 words per reply. The company reports first-contact resolution up 18 % and agent handle-time down 30 %. See [/usecases/customer-service](/en/usecases/customer-service) for latency and accuracy trade-offs when comparing GPT-4 to fine-tuned open alternatives.

Clinical trial eligibility screening in oncology research. A hospital network in Lyon submits de-identified patient records (diagnosis codes, lab ranges, medication lists) alongside trial-protocol PDFs to determine match likelihood. GPT-4 parses inclusion/exclusion criteria—"prior anthracycline exposure," "ECOG performance status ≤1," "eGFR >50"—and returns a binary flag plus a justification paragraph citing specific protocol clauses and patient data points. Expected output: 300-word rationale per patient-trial pair. The model's multilingual capability handles French clinical notes and English protocols in the same prompt. Error analysis shows a 92 % concordance with manual review, with most discrepancies in ambiguous lab-range edge cases rather than outright hallucination. This aligns with findings from our [/benchmarks/methodology](/en/benchmarks/methodology) validation runs in the healthcare category.

Automated policy-document generation for municipal government. A Swedish municipality uses GPT-4 to draft procurement guidelines, data-protection impact assessments, and public-consultation summaries. Input: bullet-point requirements from department heads, references to national statutes, and prior-year templates. Output: 2,000–3,000-word policy drafts in Swedish, with section headings, numbered clauses, and inline citations. Human editors revise for political tone and legal precision, but the first draft reduces drafting time from two weeks to two days. The extended 128k context window accommodates side-by-side comparison of five previous policy versions, enabling the model to maintain stylistic consistency and reuse boilerplate clauses. For /usecases/code-adjacent workflows (policy-as-code for infrastructure-as-code teams), the same pattern applies: GPT-4 generates Terraform or Kubernetes manifests from natural-language requirements.


Tokonomix benchmark snapshot

Our rolling evaluations—refreshed monthly and published at [/benchmarks/leaderboard](/en/benchmarks/leaderboard)—position GPT-4 consistently in the top three across reasoning, coding, and multilingual categories, though the gap to emerging open-weight models (Llama 3.1 405B, Mistral Large 2) has narrowed quarter-on-quarter. In the March 2026 cycle, GPT-4 achieved a composite reasoning score that placed it second only to a newer proprietary system (which we cannot yet name under embargo); it outperformed all sub-100B open models by a median margin of 11 percentage points on multi-hop logical inference tasks.

Coding benchmarks tell a similar story: GPT-4 passes 78 % of our curated HumanEval extensions (which add edge-case handling and multilingual comments) and ranks first among API-accessible models for Rust and Go generation, languages under-represented in many training sets. Detailed methodology—prompt templates, scoring rubrics, retry logic—lives at [/benchmarks/methodology](/en/benchmarks/methodology); we emphasise that single-number rankings compress diverse failure modes, so always cross-reference category breakdowns.

Multilingual performance shows the widest variance. GPT-4 tops French, German, and Spanish coherence benchmarks but falls to fourth place in Finnish question-answering and fifth in Estonian sentiment classification, beaten by regionally fine-tuned alternatives. Our healthcare and legal sub-batteries corroborate its strengths in document extraction but flag a persistent citation-accuracy gap: when asked to quote verbatim from embedded context, GPT-4 paraphrases ~9 % of the time, versus <3 % for retrieval-specialised models.

Important caveat: benchmark scores rotate as prompts evolve and as OpenAI updates weights behind the API alias. The "gpt-4" tag has pointed to multiple checkpoints since launch; some enterprise contracts pin specific snapshot dates (e.g., gpt-4-0613) to ensure reproducibility. Always revalidate on your own task distribution before committing to production routing.


EU privacy & data residency

OpenAI's Azure-backed EU data-residency offering allows customers to specify that inference and fine-tuning jobs run exclusively within Frankfurt or Dublin regions, with data-processing agreements (DPAs) that map to GDPR Article 28 controller–processor relationships. This satisfies many enterprises' baseline compliance boxes, though legal and healthcare teams should audit the fine print: training-data retention, zero-day security-patch SLAs, and subprocessor lists all carry commercial and regulatory weight.

Key limitation: even with EU-region pinning, the model itself was trained on a global corpus, raising questions about data provenance under the AI Act's transparency requirements. If your use case demands full auditability of training sources—common in public procurement and regulated pharma—GPT-4's closed weights and undisclosed dataset composition become blockers. Open-weight alternatives, by contrast, publish data cards and model cards that meet EU high-risk-system documentation thresholds, though they sacrifice absolute intelligence ceiling.

Operational practice: many European firms run GPT-4 for internal tooling (code review, meeting summaries, draft emails) while routing customer-facing or PII-heavy workflows through self-hosted models. The hybrid pattern acknowledges GPT-4's capability lead without exposing sensitive data to US-headquartered vendors. For teams evaluating this split, Tokonomix's /live-test environment supports side-by-side trials of GPT-4 and regional alternatives under identical prompts, making it easier to quantify the intelligence–sovereignty trade-off in your specific task domain.

Contractual nuance: enterprise agreements can negotiate audit rights, data-deletion timelines, and indemnification caps, but SMEs on pay-as-you-go APIs receive standard terms with limited negotiation leverage. If your organisation processes special-category data (health, biometric, political opinions), default API terms may not suffice; engage legal counsel before production deployment.


Verdict & alternatives

Who should use GPT-4: Teams that need best-in-class reasoning, multilingual breadth, and mature tooling (function-calling, vision, structured outputs) and can absorb the associated cost and data-residency constraints. It remains the pragmatic default when task complexity exceeds what open models reliably handle and when vendor lock-in risk is offset by OpenAI's API stability and Azure's global infrastructure. Legal, healthcare, and consulting verticals with high-value, low-frequency queries—due diligence, clinical protocol parsing, multi-jurisdiction compliance checks—derive ROI that justifies the premium.

When to switch: If per-token cost becomes a budget ceiling, consider distilled alternatives (GPT-3.5 Turbo for simpler tasks) or open-weight models fine-tuned on domain corpora; our [/benchmarks/intelligence](/en/benchmarks/intelligence) comparisons show that Llama 3.1 70B closes the gap to within 5 % on narrow tasks after targeted tuning. If latency dominates (real-time customer chat, live code autocomplete), newer speed-optimised architectures beat GPT-4 on [/benchmarks/speed](/en/benchmarks/speed) metrics while sacrificing only marginal accuracy. If data sovereignty is non-negotiable, self-hosted open weights—deployed on EU-sovereign cloud or on-premises—eliminate cross-border data flow entirely, though operational overhead (model updates, GPU cluster management, security patching) shifts to your infrastructure team.

Six-month outlook: OpenAI's roadmap hints at continued incremental releases (gpt-4-turbo refresh cycles, extended context to 256k+, multimodal audio), but the architecture will likely remain a black box. The competitive gap narrows as Anthropic, Google, and open consortia iterate faster; by late 2026, "GPT-4 intelligence" may no longer command the pricing premium it does today. For procurement planning, model the scenario where a GPT-4 replacement arrives mid-contract and evaluate switching costs—API compatibility, prompt portability, output-schema drift—early.

Take action now: Head to /live-test to run GPT-4 alongside three challenger models on your own prompts. Compare latency, output quality, and cost in real time, then export the session transcript to share with stakeholders. Tokonomix's test harness mirrors production inference (no synthetic sweeteners, no cherry-picked examples), so the results you see today predict what you'll deploy tomorrow.

Last technical review: 2026-05-05 — Tokonomix.ai

gpt-4 — illustration 2gpt-4 — illustration 3
Last automated test
Jun 14, 2026 · 04:56 UTC · Benchmark
P50 latency
7408 ms
P95 latency
Errors
0 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026