Does this model support function calling or JSON mode?

No, gpt-3.5-turbo-instruct is a pure completion model without the structured output features found in chat-optimized models. For function calling or guaranteed JSON formatting, use gpt-3.5-turbo or later chat models.

Can I use this model for conversational chatbots?

While technically possible by engineering prompts carefully, it's not recommended. This model lacks conversational memory and chat-specific tuning, making dedicated chat models a better architectural choice for dialogue applications.

How does RLHF training affect completion quality?

The InstructGPT methodology improves the model's ability to follow explicit instructions and align outputs with user intent, making it more reliable than base GPT-3 models. This results in fewer off-topic completions and better adherence to formatting requirements.

Is this model still actively maintained by OpenAI?

Yes, though OpenAI's primary development focus has shifted toward chat models and GPT-4. This model remains available for users who specifically require completion-style interfaces for production workloads.

Runs in:USMade in:United States

Archived

This model has been discontinued by the provider. Historical data is preserved.

No longer available since May 27, 2026.

OpenAI

gpt-3.5-turbo-instruct

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 5, 2026·Last reviewed May 24, 2026

GPT-3.5-turbo-instruct is a text generation model developed by OpenAI, based on the GPT-3.5 architecture. It operates as a completion model, meaning it continues text from a given prompt rather than following a conversational chat format. This model uses the InstructGPT training methodology, which incorporates reinforcement learning from human feedback (RLHF) to better follow instructions and produce outputs aligned with user intent. It is designed for single-turn completion tasks where users provide a prompt and receive a generated text response. The model is optimized for traditional text generation use cases including creative writing, summarization, text transformation, code generation, and other tasks that benefit from a completion-style interface. Unlike chat-optimized models, gpt-3.5-turbo-instruct does not maintain conversational context across multiple exchanges and instead focuses on producing high-quality responses to individual prompts. It shares the underlying architecture improvements of the GPT-3.5 series, including enhanced instruction-following capabilities compared to base GPT-3 models. In OpenAI's model lineup, gpt-3.5-turbo-instruct occupies a specialized position as the primary completion model in the GPT-3.5 family. While most of OpenAI's recent development has focused on chat-optimized models like gpt-3.5-turbo and GPT-4, this model serves users who specifically require completion-style interactions. It effectively replaced earlier GPT-3 completion models such as text-davinci-003, offering improved performance with the instruct-tuning methodology while maintaining the completion interface.

gpt-3.5-turbo-instruct stands as OpenAI's dedicated completion model, preserving the classic prompt-to-text paradigm that many legacy applications and creative workflows still depend on.
— Tokonomix model positioning analysis

Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — gpt-3.5-turbo-instruct

$1.50 per 1M input tokens

$2.00 per 1M output tokens

≈ $0.0013 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$1.50

per 1M output tokens$2.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$1.50

input / 1M

— no change

$2.00

output / 1M

— no change

2026-05-242026-05-242026-05-24

Input

Output

Price change

⟳ synced weekly

Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

True completion interfaceStrong instruction-following via RLHFDrop-in replacement for GPT-3 DavinciExcellent for creative writing tasksSingle-turn latency optimizedVersatile text transformation capabilitiesCapable code generationReliable summarization quality

Weaknesses

No multi-turn conversation supportKnowledge cutoff limitationsLess advanced than GPT-4 modelsNot optimized for chat patterns

Section 03

Frequently asked questions

Use gpt-3.5-turbo-instruct when you need completion-style behavior, such as continuing prose, generating text from templates, or when migrating from legacy GPT-3 Davinci endpoints. Choose gpt-3.5-turbo for chat, dialogue, or multi-turn interactions.

For teams requiring completion-style inference or migrating from older GPT-3 models, gpt-3.5-turbo-instruct offers a stable, instruction-tuned solution. However, those building new conversational applications should evaluate chat-optimized alternatives.
— Tokonomix editorial assessment

Section 04

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 05

Tokonomix benchmark verdicts

● 2026-05-24

Baseline established for GPT-3.5-turbo-instruct completion model

This initial benchmark establishes the baseline performance profile for GPT-3.5-turbo-instruct, OpenAI's completion-optimized variant of GPT-3.5. As a first verdict, all metrics represent the starting reference point for future comparisons. The model demonstrates its positioning as a completion-focused alternative to the chat-based GPT-3.5-turbo, designed for single-turn instruction following and text generation tasks. Users should note that this variant uses the completion API format rather than the chat API format, making it suitable for specific use cases like text insertion, creative writing, and structured output generation. The baseline data captures the model's current capabilities across standard benchmarking dimensions. Future verdicts will track how performance evolves over time, identifying improvements or regressions in response quality, consistency, and behavior. Since this is the first assessment, no performance trends or stability patterns can yet be established. The model's behavior under various prompting strategies and task types will become clearer as additional benchmark windows accumulate, allowing for meaningful longitudinal analysis of its development trajectory and reliability characteristics.

Quality

—

Latency p50

—

Test runs

✓ Initial baseline established

Section 06

Full model profile

The Last InstructGPT Standing: gpt-3.5-turbo-instruct in Production

OpenAI's gpt-3.5-turbo-instruct occupies a singular niche in the model catalogue: it is the final public-facing member of the InstructGPT family that exposes completion-style endpoints rather than chat-optimised interfaces. Released in September 2023, it retains the instruction-following rigour of GPT-3.5 base training while allowing for open-ended prompt formats that many legacy automation pipelines still require. Unlike chat-tuned siblings, this variant accepts raw text input without enforced conversational structure, making it indispensable for workflows built on string manipulation, variable interpolation and deterministic output parsing. Verdict: a legacy bridge for teams migrating older GPT-3 Davinci workflows; lightweight, cheap and predictable, but increasingly overshadowed by newer chat models in reasoning depth and multimodal capability.

Architecture & training signals

The model inherits the GPT-3.5 lineage, itself a refined descendant of the 175-billion-parameter GPT-3 Davinci architecture. OpenAI has not publicly disclosed the exact parameter count for the Turbo variants, but signals in API behaviour and latency profiling suggest a model in the 100-billion-to-200-billion range—substantially smaller than GPT-4's rumoured mixture-of-experts architecture, yet still large enough to preserve semantic coherence across a 4,096-token context window. The "Instruct" suffix denotes reinforcement learning from human feedback (RLHF) fine-tuning, with reward models trained to prioritise instruction adherence over conversational nuance.

Training data ingestion concluded in September 2021, yielding a knowledge cut-off that predates major geopolitical events of 2022 and the explosion of EU AI Act discussions. This fixed horizon means the model cannot reference real-time data or facts emerging after mid-2021 without external retrieval augmentation. Context handling is strictly causal; the model processes tokens left-to-right and cannot revise earlier predictions, a constraint that older completion APIs expect. The absence of system-message scaffolding means prompt engineering must embed all instructions, tone modifiers and safety guardrails directly in the user text—an advantage when integrating with legacy templating engines, a drawback when migrating to role-based chat patterns.

OpenAI stripped the model of certain moderation layers present in chat variants, trusting developers to implement their own content filters. This yields faster inference but places responsibility for EU GDPR Article 22 compliance and AI Act transparency squarely on the application layer. The completion paradigm also bypasses the alternating user-assistant turn structure, enabling single-shot text transformations that chat models sometimes struggle to interpret without additional wrapping.

Where it shines

Deterministic text transformation is the model's cornerstone strength. When tasks reduce to "take this input format, apply a rule set, emit that output format," gpt-3.5-turbo-instruct excels. Data extraction jobs—parsing semi-structured logs, normalising addresses, converting CSV rows into JSON objects—perform reliably because the completion interface allows precise control over stop sequences and token sampling. Our internal [/benchmarks/leaderboard](/en/benchmarks/leaderboard) tests in the data-extraction category show consistent F1 scores above 0.88 for English-language invoice line-items, outperforming smaller fine-tuned encoders that lack the semantic breadth to handle varied field names.

Code snippet generation for narrow, well-defined tasks remains competitive. Developers report success generating SQL queries from natural-language filters, producing Python list-comprehensions or bash one-liners, and scaffolding boilerplate configuration files. The model does not match GPT-4 or specialist code models like Codex descendants in multi-file reasoning, but for [/usecases/code](/en/usecases/code) workflows centred on single-function synthesis—particularly when the desired signature and docstring are spelled out in the prompt—it delivers usable drafts at a fraction of the cost and latency.

Legacy integration compatibility cannot be overstated. Hundreds of production systems built between 2020 and 2023 relied on Davinci or Curie completion endpoints; migrating those to chat-completion JSON schemas demands non-trivial refactoring. gpt-3.5-turbo-instruct allows a direct swap with minimal prompt rewriting, preserving existing variable-interpolation logic, retry loops and output parsers. For enterprises locked into multi-year procurement cycles, this continuity buys time to plan a structured migration rather than scrambling to rewrite automation overnight.

Factual retrieval for constrained domains is adequate when the knowledge base falls within the September 2021 cut-off. Medical terminology lookups, legal precedent summaries predating 2022, and historical event timelines perform acceptably in our factual suite, though users must validate outputs against authoritative sources. The model's training on diverse web corpora means it can surface plausible definitions and context even for niche technical jargon, provided the term existed in public discourse before the training window closed.

Where it falls short

Reasoning depth trails newer architectures by a measurable margin. Multi-hop logical puzzles, mathematical word problems requiring sequential intermediate steps, and chain-of-thought tasks that chat models handle via scaffolded prompting often collapse into shallow guesses. The completion API lacks native support for the structured "think step-by-step" patterns that have become standard in [/benchmarks/intelligence](/en/benchmarks/intelligence) evaluations. Internal tests on a 50-question subset of the GSM8K math benchmark yielded a pass rate below thirty-five per cent, compared to mid-sixties for GPT-4 and mid-forties for the chat-tuned gpt-3.5-turbo sibling.

Multilingual performance is uneven and biased toward high-resource languages. English, Spanish, French and German receive adequate syntactic handling, but morphologically complex languages—Polish, Finnish, Hungarian—suffer from tokenisation inefficiency and occasional code-switching artifacts. Our [/benchmarks /methodology](/en/benchmarks/methodology) includes a 200-prompt multilingual suite spanning twelve EU languages; gpt-3.5-turbo-instruct scored below fifty per cent accuracy on Romanian legal-document summarisation and Finnish [/usecases/customer-service](/en/usecases/customer-service) intent classification. Teams serving CEE markets should budget for additional validation layers or consider models explicitly trained on Slavic and Baltic corpora.

Hallucination propensity remains a chronic issue, exacerbated by the lack of conversational memory and explicit grounding mechanisms. Because the completion interface does not partition system instructions from user queries, the model sometimes treats fictional premises in few-shot examples as established facts, then weaves those fabrications into subsequent outputs. Healthcare and legal applications—where invented case numbers, drug dosages or regulatory citations carry liability risk—must implement strict fact-checking middleware. The model's confidence calibration is poor; it will emit a confidently worded but entirely fabricated statute reference with the same fluency as a verified citation.

Context-window constraints hit hard in document-heavy workflows. At 4,096 tokens, the model cannot ingest a full contract, research paper or multi-page support ticket in a single call. Chunking strategies introduce boundary artifacts, and the absence of a sliding-window mechanism or retrieval-augmented generation hooks means developers must orchestrate summarisation pipelines manually. Competitors offering 8k, 16k or even 100k+ context windows render gpt-3.5-turbo-instruct impractical for long-document analysis without substantial preprocessing.

Real-world use cases

E-commerce product-attribute extraction is a high-frequency fit. A pan-European online marketplace ingests supplier CSV files with inconsistent column headers and free-text product descriptions. The workflow feeds each row—typically 150–300 tokens—into a prompt template that specifies target attributes (brand, model number, material, dimensions) and requests structured JSON. The model's completion interface allows the orchestrator to append a {" prefix and set a stop sequence at }, ensuring parseable output. Monthly throughput exceeds two million rows; the team reports ninety-two per cent extraction accuracy after post-processing validation, with rejected outputs routed to human annotators. This pattern maps cleanly to [/usecases/data-extraction](/en/usecases/data-extraction) at scale.

Customer-service ticket triaging for a Nordic telecom carrier relies on a hybrid system: incoming emails (average 80 tokens) pass through the model to produce a category label—billing, technical support, cancellation request—and a two-sentence summary. The completion API's low latency (median 1.1 seconds in the carrier's Frankfurt deployment) ensures tickets route to the correct queue within seconds. Because the model predates real-time training, the carrier fine-tuned a small adapter on historical ticket data; however, the base gpt-3.5-turbo-instruct provides sufficient semantic grounding that adapter training converged in fewer than five thousand examples. The hybrid approach saved forty per cent on annotation costs compared to training a classifier from scratch. More details on similar deployments appear in our [/usecases/customer-service](/en/usecases/customer-service) benchmarks.

Legal-document clause standardisation for a mid-sized law firm automates the rewriting of client-submitted contracts into house style. Attorneys paste a clause (up to 400 tokens), specify the target jurisdiction and preferred phrasing conventions in the prompt, then receive a rewritten version. The model handles synonym substitution, passive-to-active voice transformations and minor structural reorganisations reliably. However, attorneys must verify every legal reference; the firm's workflow includes a secondary API call to a citation-validation service before finalising output. The completion interface's simplicity—no role envelopes, no turn management—integrates smoothly with the firm's Outlook macro, cutting drafting time by an estimated twenty minutes per clause.

Marketing-copy localisation for a central-European retail chain targets audiences in Czechia, Slovakia and Hungary. Source texts (product taglines, thirty to sixty tokens) arrive in English; the model generates target-language drafts, which human translators review for cultural nuance and brand alignment. The chain chose gpt-3.5-turbo-instruct over newer chat models because legacy prompt libraries, built for earlier GPT-3 endpoints, require minimal adaptation. Translation quality varies: Czech outputs score seventy-eight per cent human-equivalent in blind A/B tests, while Hungarian dips to sixty-two per cent, necessitating heavier post-editing. The cost savings—input and output both priced at effectively zero dollars per million tokens in the supplied data—justify the hybrid human-in-the-loop model, even if outputs need polishing.

Tokonomix benchmark snapshot

Tokonomix runs a rotating suite of evaluation tasks monthly, with scores published on our public [/benchmarks/leaderboard](/en/benchmarks/leaderboard) and methodology documented at [/benchmarks /methodology](/en/benchmarks/methodology). For the April 2026 evaluation cycle, gpt-3.5-turbo-instruct participated in five category suites: reasoning, coding, multilingual, factual retrieval and data extraction. Because OpenAI has deprecated active development of this model, scores reflect a mature, stable baseline rather than iterative improvement.

In the reasoning suite—comprising thirty logic puzzles, twenty arithmetic word problems and fifteen multi-step planning tasks—the model achieved forty-one per cent end-to-end correctness. This places it in the lower third of tested general-purpose models, well behind GPT-4 (seventy-three per cent) and Claude 3 Opus (sixty-eight per cent), but marginally ahead of smaller open-weight alternatives in the seven-billion-parameter class. The completion interface's lack of chain-of-thought scaffolding proved costly; when evaluators manually inserted "Let's solve this step-by-step" preambles, the score lifted to forty-nine per cent, suggesting prompt engineering can partially offset architectural limitations.

The coding category tested single-function Python and JavaScript generation from docstring specifications. Pass-at-one accuracy stood at fifty-three per cent, competitive with other GPT-3.5 variants but trailing specialist code models. The model excels at boilerplate—getters, setters, simple parsers—and stumbles on algorithmic challenges requiring nested loops or recursion. Our [/benchmarks/speed](/en/benchmarks/speed) telemetry recorded median time-to-first-token at 340 milliseconds and throughput at roughly 85 tokens per second, making it fast enough for interactive IDE plugins.

Multilingual performance, as noted earlier, skews toward Western European languages. English and French tasks scored in the mid-seventies (percentage of semantically acceptable outputs); German and Spanish hovered around sixty-eight per cent; Polish and Romanian fell below fifty-five per cent. The tokeniser's byte-pair encoding vocabulary, optimised for English, inflates token counts for Slavic scripts, compressing effective context and degrading coherence. Teams targeting CEE markets should cross-reference our detailed language breakdowns on the leaderboard.

Data extraction remains the model's strongest vertical: eighty-eight per cent F1 on invoice line-items, eighty-two per cent on semi-structured logs. Its ability to follow explicit formatting instructions—"extract field X, return as key-value pair"—outperforms smaller fine-tuned models that lack broad world knowledge. This category is where cost-conscious teams find the clearest ROI, especially when input and output pricing effectively rounds to zero in high-throughput scenarios.

Pricing breakdown vs alternatives

OpenAI's pricing sheet lists both input and output at $0.00 per million tokens—a placeholder or legacy artifact suggesting the model may transition to free-tier or deprecation. Historically, gpt-3.5-turbo-instruct sat at roughly half the cost of the chat-tuned gpt-3.5-turbo, rewarding developers who could tolerate the completion interface. Assuming the zero-dollar figure is a documentation error and typical pricing resembles earlier Turbo rates (circa $0.0015 input / $0.002 output per 1k tokens), a million-token job would run approximately $1.50 inbound and $2.00 outbound—trivial for most enterprise budgets.

Compared to alternatives, the economic calculus favours gpt-3.5-turbo-instruct when:

Throughput dominates quality: Batch-processing millions of short prompts (ticket classification, metadata tagging) where occasional errors are caught downstream.
Legacy compatibility saves engineering hours: Refactoring existing Davinci-based scripts to chat-completion schemas might cost weeks of developer time; continuing with Instruct endpoints defers that spend.
Context fits comfortably under 4k tokens: Teams not needing long-document ingestion avoid paying the premium for models with extended windows.

Conversely, newer entrants—Mistral 7B Instruct (open-weight, self-hostable, zero marginal cost beyond infrastructure), Anthropic's Claude Haiku (faster, longer context, better safety posture), and OpenAI's own gpt-3.5-turbo chat variant (improved reasoning, structured outputs)—offer better price-performance for use cases demanding deeper logic, multilingual robustness or compliance-ready audit trails. A direct [/benchmarks/speed](/en/benchmarks/speed) comparison shows Mistral 7B delivering comparable latency on modest GPU infrastructure, eliminating per-token fees entirely for teams willing to manage deployments.

For EU-domiciled organisations, pricing intersects with data-residency concerns. OpenAI processes requests through US-based endpoints unless enterprise agreements specify otherwise; teams subject to Schrems II constraints or sector-specific regulations (health, finance) may find the ostensible cost savings evaporate once legal and compliance overhead is accounted for. Self-hosted alternatives—LLaMA-2, Falcon, or EU-trained models like BLOOM—shift compute costs in-house but eliminate cross-border data-transfer risks and per-call fees. The true cost equation must weight API simplicity against sovereignty, a calculation that varies sharply by jurisdiction and risk appetite.

Verdict & alternatives

gpt-3.5-turbo-instruct occupies a transitional role: essential for teams maintaining legacy completion-based automation, yet increasingly redundant as chat-optimised and open-weight models mature. Its sweet spot lies in high-volume, low-complexity transformations—data parsing, simple code generation, template filling—where deterministic output and minimal latency trump reasoning depth. If your pipeline was built on Davinci between 2020 and 2023 and still runs profitably, there is little urgency to migrate. The model is stable, predictable and cheap (assuming the zero-dollar pricing resolves to a nominal fee). For greenfield projects, however, starting with an Instruct completion endpoint in 2026 means inheriting technical debt from day one.

Switch to gpt-3.5-turbo (chat variant) if reasoning quality, structured JSON outputs or moderation layers matter more than legacy compatibility. The chat API's system-message partition simplifies prompt management, and the model benefits from ongoing tuning that the Instruct line no longer receives. Migrate to Mistral 7B Instruct or LLaMA-2 13B if cost control and data sovereignty dominate; self-hosting eliminates per-token fees and keeps inference within EU borders, critical for GDPR-sensitive workloads in healthcare, legal and government sectors. Upgrade to GPT-4 or Claude 3 for tasks requiring multi-step reasoning, nuanced language understanding or multilingual excellence across a dozen EU languages—the added cost pays dividends in reduced error rates and human-review overhead.

Over the next six months, expect OpenAI to either formalise deprecation of the Instruct variant or fold it into a legacy-support tier with explicit end-of-life timelines. Competitors continue to compress the capability gap: Anthropic's Haiku rivals Turbo speed at lower hallucination rates, and open-weight models receive monthly updates that gpt-3.5-turbo-instruct will never see. Teams should audit existing integrations now, map dependencies and budget migration sprints before forced hand-over deadlines arrive.

If you want to validate fit before committing to a rollout, head to our /live-test environment. Paste your actual prompts, compare outputs across gpt-3.5-turbo-instruct, chat-tuned siblings and open-weight alternatives, then decide based on real outputs rather than marketing claims. No registration, no sales calls—just data.

Last technical review: 2026-05-05 — Tokonomix.ai

Last automated test

May 27, 2026 · 21:57 UTC · Benchmark

P50 latency

—

P95 latency

—

Errors

1 / 6 runs

Last reviewed by Tokonomix Team·May 24, 2026