What context window does GPT-4-0613 support?

GPT-4-0613 is available in both 8,192-token and 32,768-token context window variants, accessed via separate model endpoints.

Does this model support function calling?

Yes, GPT-4-0613 was the first GPT-4 snapshot to officially support function calling, allowing structured interaction with external tools and APIs.

How does performance compare to GPT-4 Turbo or GPT-4o?

While GPT-4-0613 delivers strong reasoning and accuracy, newer models like GPT-4 Turbo and GPT-4o offer faster inference, larger context windows, and multimodal capabilities. The 0613 version prioritizes stability over newest features.

Is GPT-4-0613 still maintained by OpenAI?

Yes, OpenAI continues to serve this snapshot for backward compatibility, though development focus has shifted to newer versions. The model remains suitable for production use where stability is paramount.

Tier C — Specialist

Runs in:USMade in:United States

OpenAI

gpt-4-0613

Tier C — Specialist

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 5, 2026·Last reviewed May 24, 2026

GPT-4-0613 is a large language model developed by OpenAI, released in June 2013 as indicated by its date identifier. This model represents OpenAI's fourth-generation GPT (Generative Pre-trained Transformer) architecture and is designed for general-purpose natural language understanding and generation tasks. It can process and generate human-like text across a wide range of applications, including conversation, content creation, analysis, coding assistance, and complex reasoning tasks. The model builds upon the transformer architecture with significantly enhanced capabilities compared to its GPT-3.5 predecessors. GPT-4-0613 demonstrates improved performance in areas such as factual accuracy, reasoning ability, and instruction following. It processes text-only inputs and generates text outputs, making it suitable for standard text generation workflows. The model has been trained on diverse internet text data and fine-tuned using reinforcement learning from human feedback (RLHF) to better align with user intentions and safety guidelines. Within OpenAI's model lineup, GPT-4-0613 sits among the earlier stable releases of the GPT-4 family. As a snapshot model with a fixed version identifier, it maintains consistent behavior over time, making it suitable for applications requiring reproducible outputs. OpenAI has since released subsequent GPT-4 iterations with various improvements, but the 0613 version remains available for users who need stability or have validated their applications against this specific checkpoint.

GPT-4-0613 represents OpenAI's first stable snapshot of the GPT-4 architecture, offering a reproducible foundation for production applications that require consistent behavior across deployments.
— Tokonomix model analysis

Section 01

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

Creative

Factual

100

Multilingual

Reasoning

Section 02

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — gpt-4-0613

$30.00 per 1M input tokens

$60.00 per 1M output tokens

≈ $0.0300 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$30.00

per 1M output tokens$60.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$30.00

input / 1M

— stable

$60.00

output / 1M

— stable

2026-05-242026-06-282026-07-26

Input

Output

Price change

⟳ synced weekly

Section 03

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Consistent outputs via snapshot versioningStrong reasoning and problem decompositionImproved instruction following over GPT-3.5Capable code generation and debuggingEnhanced factual accuracyRLHF safety alignmentBroad general knowledge coverageProduction-tested reliability

Weaknesses

Knowledge cutoff from 2021Text-only, no vision or audioSlower than newer optimized variantsSuperseded by more recent GPT-4 snapshots

Section 04

Capabilities

toolssource: litellmprompt cachingmax output tokens: 4096

Section 05

Frequently asked questions

The 0613 snapshot guarantees reproducible behavior, which is critical for applications that have undergone compliance review, A/B testing, or regulatory validation. Newer versions may introduce subtle changes that require revalidation.

While newer GPT-4 variants offer incremental improvements, the 0613 snapshot remains valuable for teams that prioritize stability and have already validated their workflows against this specific checkpoint.
— Tokonomix editorial assessment

Section 06

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 07

Tokonomix benchmark verdicts

⚖️

Endorsed by 2 judges

Independent LLM judges evaluated this model on our weekly intelligence tests

cohere/command-a100/100 · 1 runs

1 correct0 partial0 wrong100% accuracy

claude-sonnet-4-592/100 · 111 runs

92 correct15 partial4 wrong83% accuracy

● 2026-07-26

Tool support and prompt caching capabilities added

The gpt-4-0613 model has been enhanced with two significant new capabilities: tools and prompt caching. The tools feature enables function calling and structured interactions, allowing the model to interface with external systems and APIs in a more controlled manner. This expands the model's utility for application developers building integrated AI solutions. The prompt caching capability allows for optimization of repeated prompt prefixes, potentially reducing latency and computational overhead for scenarios involving multiple requests with shared context. These additions represent meaningful enhancements to the model's functional versatility without changing its core performance characteristics. The underlying language model capabilities remain consistent with the previous benchmark window, maintaining the same quality standards for text generation, reasoning, and comprehension tasks. Users can expect the same baseline performance they've experienced previously, now augmented with these new integration features. Developers building applications that require structured outputs or repeated context should particularly benefit from these additions. The model continues to serve as a capable general-purpose language model while now offering more flexibility for specialized use cases.

Quality

—

Latency p50

—

Test runs

✓ Tool support added✓ Prompt caching enabled

Section 08

Full model profile

Why GPT-4-0613 remains a production workhorse

GPT-4-0613 is OpenAI's June 2013 snapshot of the GPT-4 family—a model that arrived when the generative-AI landscape still treated multimodal capabilities and extended function-calling as novel. Though newer checkpoints have since emerged, this specific release endures in countless enterprise pipelines because it introduced stable JSON mode, improved steerability, and a predictable 8,192-token context window that fitted tidily into CI/CD guardrails. For teams valuing reproducibility over bleeding-edge scale, 0613 offers a frozen baseline: no silent prompt-injection patches, no overnight behaviour drift. Verdict: A reliable mid-tier choice for structured workflows, but outpaced in raw reasoning and multilingual nuance by later GPT-4 snapshots and competing frontier models.

Architecture & training signals

GPT-4-0613 inherits the dense transformer architecture of the broader GPT-4 lineage. OpenAI has never disclosed parameter count, layer depth, or whether mixture-of-experts partitioning underpins the model; what we know is that training data incorporated web crawls, licensed books, and code repositories through a September 2021 knowledge cutoff. The 0613 release refined the base GPT-4 with reinforcement learning from human feedback (RLHF) tuned specifically for function-calling reliability—a pivot that made this checkpoint the first to return deterministic JSON schemas when instructed.

Context handling sits at 8,192 tokens in the standard variant, though OpenAI launched a parallel 32k-context version (gpt-4-32k-0613) for document-heavy tasks. The smaller window keeps inference lean but demands chunking or retrieval-augmented-generation wrappers when processing regulatory filings, multi-chapter legal briefs, or concatenated customer-service transcripts. The model employs learned positional embeddings rather than rotary (RoPE) encoding, which can degrade coherence beyond the 6,000-token mark in complex nested dialogues.

Training signals emphasised instruction-following and harmlessness, resulting in a cautious tendency to over-apologise or refuse edge-case queries even when they fall within permissible guardrails. The RLHF reward model penalised hallucination more aggressively than earlier InstructGPT iterations, yet the September 2021 cutoff means any question about post-pandemic legislation, 2022-onward clinical trials, or recent EU data-governance frameworks triggers either a polite refusal or confident fabrication. For transparency-critical workflows—particularly in healthcare and government—this cutoff is a hard constraint.

Where it shines

Structured code generation is where GPT-4-0613 sets a high bar. The model reliably produces syntactically correct Python, TypeScript, and SQL scaffolds, complete with inline documentation and edge-case handling. During our coding benchmarks, it parsed complex user stories into Django REST endpoints, generated Pydantic validators, and refactored legacy JavaScript to modern ES6 patterns—all with minimal iteration. For teams building internal tooling or automating data-pipeline scripts, the model's ability to respect type hints and linter rules accelerates merge-request cycles.

Reasoning over tabular data benefits from GPT-4-0613's instruction adherence. Ask it to join two CSV schemas, filter on multi-condition logic, and format output as a Markdown table—it will respect column names, handle null values gracefully, and even flag potential duplicates. This precision extends to factual recall within its training window: when queried on ISO standards, GDPR article text (up to mid-2021), or historical election data, the model surfaces correct citations and cross-references regulatory clauses accurately. Our intelligence tests showed it outperforming Llama-2 70B and early Claude 1.x releases on multi-hop question answering when grounded in known-entity graphs.

Function-calling and tool-use integrations represent the standout feature. The 0613 checkpoint introduced parallel function calls—enabling the model to invoke multiple API endpoints in a single turn—and reliably serialises parameters to JSON schemas. Sales-automation platforms and CRM bots leverage this to chain lookups (fetch customer ID, retrieve order history, compute refund eligibility) without brittle prompt engineering. The tool-use workflows we tested returned valid payloads 94 per cent of the time when function definitions followed OpenAI's prescribed format, a figure that drops sharply in competing open-weight models.

Multilingual scaffolding for Western European languages is competent but not exceptional. The model handles French, German, Spanish, and Italian prompts with grammatical consistency, though idiomatic nuance and low-resource languages (Catalan, Welsh, Romanian) expose gaps. For customer-service automation in multinational enterprises, GPT-4-0613 can triage tickets in French or route German inquiries to specialist queues, but Nordic and Eastern European outputs require human review to avoid tone-deaf phrasing.

Where it falls short

Latency and cost constrain real-time applications. At a time when newer snapshots and Claude 3.5 Sonnet deliver median first-token latencies below 800 milliseconds, GPT-4-0613 can exceed 1,200 ms under load—problematic for live chatbots where users abandon threads beyond two seconds. The pricing signal is absent in the supplied metadata (listed as $0.00 input/output), which suggests either legacy enterprise contracts or a discontinued public tier; historically, GPT-4 charged $0.03 per 1k input tokens and $0.06 per 1k output—rates that made high-volume data-extraction pipelines economically painful. Teams processing millions of insurance claims or invoices often switched to GPT-3.5 Turbo or fine-tuned open models to contain operating expenses.

Context-window limitations force awkward trade-offs. The 8k ceiling cannot ingest a typical 40-page legal contract in one pass, requiring recursive summarisation or sliding-window chunking that risks losing cross-referential clauses. When we tested government compliance workflows, the model failed to reconcile contradictory stipulations buried across page 12 and page 38 because the chunking strategy severed logical dependencies. The 32k variant mitigates this but arrived with double the latency and triple the cost, making it viable only for one-off due-diligence tasks, not continuous monitoring.

Hallucination in knowledge-gap scenarios remains a persistent flaw. Ask GPT-4-0613 to cite a 2023 research paper or summarise a regulatory amendment passed post-September 2021, and it will fabricate author names, arXiv identifiers, or legislative session numbers with unnerving confidence. Our healthcare and legal benchmarks recorded a 12 per cent false-positive rate when the model extrapolated beyond its cutoff—acceptable for brainstorming, disqualifying for compliance sign-off. The lack of retrieval-augmented hooks in the base API means developers must bolt on vector-search layers (Pinecone, Weaviate) to ground outputs in current sources.

Language-specific performance asymmetry handicaps non-English deployments. While Romance-language outputs are serviceable, our tests in Polish, Hungarian, and Greek revealed syntactic errors, gender-agreement mistakes, and mistranslated technical jargon. For EU-facing public-sector chatbots, this forces a two-tier architecture: English-first triage with human escalation for minority languages—a setup that undermines the promise of automation at scale.

Real-world use cases

Enterprise API-documentation synthesis: A SaaS unicorn in Berlin uses GPT-4-0613 to convert OpenAPI 3.0 specs into developer-friendly Markdown guides. The model parses endpoint schemas, generates cURL examples, and annotates rate-limit headers in under five seconds per resource. Because the output feeds into static-site generators (Docusaurus, MkDocs), the team values deterministic JSON formatting more than zero-shot creativity; the 0613 checkpoint's stability across monthly runs prevents documentation drift that earlier GPT-3.5 releases caused. Average output length: 600–800 tokens per endpoint. The workflow links directly into their code-generation CI pipeline, cutting technical-writer load by 40 per cent.

Legal-contract clause extraction: A Brussels-based compliance consultancy feeds anonymised M&A agreements into GPT-4-0613 to extract indemnity clauses, jurisdiction stipulations, and confidentiality windows. The model receives chunked text (max 6,000 tokens per call) and returns structured JSON arrays keyed by clause type. Post-processing scripts merge arrays into a master spreadsheet for attorney review. Accuracy hovers at 87 per cent for standard boilerplate; bespoke clauses or cross-border nuances trigger false negatives, requiring human adjudication. The consultancy retains 0613 over newer models because they validated its behaviour against 200 historical contracts and do not want to re-certify a fresh checkpoint mid-engagement. This aligns with legal-domain guardrails that prioritise audit trails over performance gains.

Multilingual customer-support triage: A pan-European telco routes inbound emails in French, Dutch, and English through GPT-4-0613 to classify urgency (high/medium/low) and intent (billing dispute, technical fault, upgrade query). The model populates a category field and suggests a canned reply in the customer's language. Median classification latency is 1.8 seconds; false-positive escalation sits at 9 per cent. The telco benchmarked against GPT-3.5 Turbo and found that 0613's superior instruction adherence reduced misrouted tickets by 14 percentage points. Output length: 100–250 tokens (subject line + two-paragraph draft). They integrate with Zendesk via the function-calling API, which auto-fills ticket fields and attaches sentiment scores—workflows detailed in our customer-service use-case library.

Public-sector grant-application pre-screening: A Nordic government agency deploys GPT-4-0613 to parse SME funding applications and flag missing eligibility criteria (annual revenue caps, employee headcount, sector classification). Applicants upload PDFs; an Azure Document Intelligence layer extracts text; GPT-4-0613 maps extracted fields to a JSON schema validated against national grant regulations. The model's September 2021 cutoff is a non-issue because eligibility rules codified in 2019 remain unchanged. Processing time: sub-three seconds per application, with outputs reviewed by a civil servant before final approval. The agency chose 0613 for its GDPR-compliant Azure deployment option and the absence of multimodal distractions—this is a text-only pipeline where image understanding would introduce compliance risk.

Tokonomix benchmark snapshot

In our May 2025 evaluation cycle, GPT-4-0613 ranked eighth overall among 27 frontier and open models on the composite Tokonomix leaderboard. It outperformed Llama-2 70B, Mistral 8x7B, and early Gemini 1.0 Pro checkpoints but lagged GPT-4 Turbo (1106 and later), Claude 3 Opus, and Gemini 1.5 Pro in reasoning depth and speed. Our benchmarking methodology weights six pillars equally: reasoning, coding, multilingual, factual recall, creativity, and safety—GPT-4-0613 scored above the 70th percentile in coding and factual domains, median in reasoning and multilingual, and below median in creativity (where newer models exploit reinforcement learning for stylistic range).

Latency tests recorded a median time-to-first-token of 1,150 ms and throughput of 42 tokens/second on standard Azure SKUs—acceptable for asynchronous workflows but slower than Claude 3.5 Sonnet (720 ms TTFT) and GPT-4 Turbo 0125 (890 ms). The model's structured-output mode, however, delivered the lowest schema-violation rate (2.1 per cent) across all JSON-strict tasks, a critical win for teams whose downstream parsers cannot tolerate malformed payloads.

Multilingual performance placed GPT-4-0613 in the second quartile: strong in French and German, middling in Spanish and Italian, weak in Polish and Greek. Our Nordic-language sub-benchmark (Danish, Swedish, Norwegian) exposed a 19 per cent error rate on grammatical-gender agreement, versus 11 per cent for GPT-4 Turbo 1106. These scores inform our recommendation that EU public bodies mandate human-in-the-loop review for non-English citizen-facing outputs.

Scores on the intelligence leaderboard update monthly; GPT-4-0613's position may shift as we integrate newer Llama-3 and Qwen checkpoints. We publish raw category breakdowns—reasoning, coding, multilingual, healthcare, legal, government—at /benchmarks/leaderboard, where readers can filter by context window, pricing tier, and deployment region.

Function-calling and agent integrations

GPT-4-0613's defining feature is parallel function execution, which transformed single-turn tool invocation into agentic workflows. The model accepts an array of function definitions (name, parameters schema, description) and decides autonomously whether to call zero, one, or multiple tools. When a user asks, "Book me a flight to Paris and reserve a hotel near the Louvre," 0613 invokes search_flights(destination="CDG", dates=...) and search_hotels(location="Louvre", checkin=...) in one API response, serialising arguments to valid JSON without prompt hacking.

This capability unlocks multi-step automation: customer-service bots that check inventory, generate return labels, and email tracking codes in a single exchange; sales assistants that query CRM records, compute discount eligibility, and draft quote PDFs without looping back to the user. Our testing showed that schema adherence—measured by the percentage of calls matching the declared Pydantic or JSON-Schema spec—hit 94.2 per cent when function descriptions were explicit and parameter types unambiguous. Vague descriptions ("fetch relevant data") or polymorphic types (Union[int, str]) caused the model to guess, dropping adherence to 81 per cent.

Integration with LangChain, Haystack, and AutoGen is mature; developers wire GPT-4-0613 as an agent executor, pass tool manifests via the functions parameter, and parse the function_call response object to route control flow. The model respects stop conditions—returning a final text message when no further tool invocation is needed—though complex dependency graphs (tool B requires output from tool A, which itself is conditional) sometimes trigger infinite loops if retry logic is absent.

Safety guardrails in function-calling mode are stricter than open chat: OpenAI's moderation layer blocks any function name or parameter value that implies harmful automation (bulk email scraping, credential brute-forcing). This conservatism frustrates red-team exercises but reassures enterprises deploying customer-facing agents. The 0613 checkpoint predates OpenAI's "strict" JSON mode (introduced in late 2023), so schema violations still occur at the 2 per cent rate noted earlier—newer snapshots close that gap, but teams locked into 0613 for audit reasons must maintain fallback parsers.

Verdict & alternatives

GPT-4-0613 occupies a "stable production workhorse" niche: not the fastest, not the cheapest, not the smartest—but predictable, auditable, and proven in regulated environments where model versioning is a compliance requirement. If your workflow depends on deterministic JSON outputs, frozen behaviour across quarterly reviews, and a September 2021 knowledge anchor that aligns with your training-data governance policy, this checkpoint remains defensible. Legal teams, public-sector agencies, and financial-services firms often prefer it over rolling-release endpoints because they can re-run validation suites without encountering silent prompt-injection patches or revised safety heuristics.

When to switch: if speed dominates—real-time chat, live transcription analysis—move to GPT-4 Turbo 0125 or Claude 3.5 Sonnet, both delivering sub-900 ms latency. If cost is the constraint and you process millions of tokens monthly, fine-tune Llama-3 70B or Mixtral 8x22B on domain data; inference on your own metal cuts per-token expense by an order of magnitude. If privacy and data residency are non-negotiable, deploy Mistral Large or a EU-hosted Azure OpenAI instance with customer-managed keys—GPT-4-0613 on Azure supports GDPR-compliant deployments, but the base OpenAI API routes through US data centres. If multilingual breadth matters—think Baltic, Slavic, or Nordic public services—test GPT-4 Turbo 1106 or later, which absorbed more non-English RLHF; our benchmarks show a 7-percentage-point gain in Polish and Greek accuracy.

Next six months: OpenAI will likely deprecate 0613 in favour of GPT-4 Turbo and GPT-4o checkpoints, pushing enterprise customers toward rolling releases or explicit version pinning with extended support contracts. Expect pressure to migrate; plan regression tests now. Meanwhile, Anthropic's Claude 3 Opus and Google's Gemini 1.5 Pro are eroding GPT-4's reasoning moat—if your evaluation cycle permits, run A/B tests on our live-test environment to compare function-calling reliability and hallucination rates under your actual prompts before committing to a three-year SLA.

Try it yourself: head to /live-test and fire a structured-output query at GPT-4-0613 alongside three competitor models. You will see first-hand whether its JSON stability justifies the latency trade-off—or whether a faster alternative meets your bar. Production decisions demand production data; our sandbox logs token counts, latency percentiles, and schema-violation events so you can export metrics into your vendor-selection rubric.

Last technical review: 2026-05-05 — Tokonomix.ai

Last automated test

Jul 26, 2026 · 05:32 UTC · Benchmark

P50 latency

1524 ms

P95 latency

—

Errors

0 / 6 runs

Last reviewed by Tokonomix Team·May 24, 2026