How does gpt-3.5-turbo-instruct-0914 compare to other OpenAI models?

Within OpenAI's lineup, gpt-3.5-turbo-instruct-0914 occupies a standard position, balancing capability and resource requirements for production use cases.

Can gpt-3.5-turbo-instruct-0914 be accessed via API?

Yes, gpt-3.5-turbo-instruct-0914 is available through OpenAI's API infrastructure, allowing integration into custom applications and workflows.

Does gpt-3.5-turbo-instruct-0914 support multi-turn conversations?

gpt-3.5-turbo-instruct-0914 maintains conversational context across multiple turns, making it suitable for chatbots, interactive assistants, and extended dialogue applications.

Runs in:USMade in:United States

Archived

This model has been discontinued by the provider. Historical data is preserved.

No longer available since May 27, 2026.

OpenAI

gpt-3.5-turbo-instruct-0914

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 5, 2026·Last reviewed May 24, 2026

GPT-3.5-turbo-instruct-0914 is a text completion model developed by OpenAI, released in September 2023. Unlike the chat-oriented GPT-3.5-turbo variants, this model uses the older instruct-following architecture originally introduced with GPT-3, making it more suitable for single-turn completion tasks rather than multi-turn conversations. It is designed for applications requiring direct text continuation, classification, transformation, and other traditional language model operations without the conversational formatting overhead. The model processes standard text generation requests and follows instructions embedded within prompts. It represents a continuation of OpenAI's InstructGPT line, which applies reinforcement learning from human feedback (RLHF) to improve instruction-following capabilities. The "0914" designation indicates the specific snapshot date of September 14, 2023. While its context window size has not been publicly disclosed by OpenAI, it is expected to handle typical completion tasks within standard length constraints. Within OpenAI's model lineup, gpt-3.5-turbo-instruct-0914 occupies a specialized position alongside the more commonly used chat-optimized GPT-3.5 and GPT-4 variants. It serves users who specifically need completion-style outputs rather than conversational responses, making it particularly relevant for legacy applications, certain API integrations, and use cases where the completion paradigm offers more direct control over output formatting. The model provides an alternative interface pattern for developers who prefer or require the traditional completion approach over chat-based interactions.

gpt-3.5-turbo-instruct-0914 is a dependable general-purpose model from OpenAI, covering the full range of text generation tasks with consistent quality.
— Tokonomix benchmark summary

Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — gpt-3.5-turbo-instruct-0914

$1.50 per 1M input tokens

$2.00 per 1M output tokens

≈ $0.0013 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$1.50

per 1M output tokens$2.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$1.50

input / 1M

— no change

$2.00

output / 1M

— no change

2026-05-242026-05-242026-05-24

Input

Output

Price change

⟳ synced weekly

Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Reliable instruction followingVersatile content generationStrong analytical reasoningBroad domain knowledgeExtensive training dataAccurate task completion

Weaknesses

Context window undisclosedHigher cost vs smaller modelsKnowledge cutoff limitations

Section 03

Frequently asked questions

gpt-3.5-turbo-instruct-0914 is designed for general-purpose text generation including content creation, analysis, question answering, and conversational applications.

For teams seeking reliable output without specialization overhead, gpt-3.5-turbo-instruct-0914 is a sound choice across content, analysis, and dialogue tasks.
— Tokonomix benchmark summary

Section 04

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 05

Tokonomix benchmark verdicts

● 2026-05-24

Baseline established for GPT-3.5 Turbo Instruct instruction-following model

This marks the first benchmark evaluation for gpt-3.5-turbo-instruct-0914, establishing baseline performance metrics for future comparisons. As an instruction-following variant of GPT-3.5 Turbo optimized for single-turn completion tasks rather than chat, this model serves a distinct use case in OpenAI's model lineup. Without previous data to compare against, this verdict focuses on establishing the initial performance footprint. Users should note that this model differs from standard GPT-3.5 Turbo in its design philosophy, prioritizing direct instruction completion over conversational interactions. The instruct variant typically excels at straightforward task completion, classification, and structured output generation where explicit prompting yields predictable results. As this is a September 2014 snapshot, users can expect stable behavior for applications requiring consistent instruction-following capabilities. Future verdicts will track any performance variations, capability improvements, or behavioral shifts against this established baseline. Organizations building production applications should monitor subsequent benchmarks to understand how this model evolves relative to these initial measurements.

Quality

—

Latency p50

—

Test runs

✓ Baseline metrics established✓ Instruction-following optimization confirmed

Section 06

Full model profile

The completion-endpoint workhorse: gpt-3.5-turbo-instruct-0914 under the microscope

OpenAI's gpt-3.5-turbo-instruct-0914 occupies an unusual niche in the GPT-3.5 family: it exposes the legacy completion API rather than the chat-message format familiar to most developers. That design choice makes it the default pick for projects that depend on direct text continuation—zero-shot classification, structured output parsing, and legacy workflows migrated from davinci-002. Its instruction-tuning layer sits atop the same base model as the chat variants, yet the absence of multi-turn conversation overhead often yields tighter outputs and faster first-token times for single-shot tasks.
Verdict: A specialist tool for completion-style pipelines; teams should prefer chat-tuned siblings (gpt-3.5-turbo or gpt-4o-mini) unless their architecture explicitly requires raw completion endpoints.

Architecture & training signals

The 0914 snapshot belongs to the GPT-3.5 Turbo family but diverges at the inference interface. Where gpt-3.5-turbo and its chat siblings accept a conversation array and apply system/user/assistant role tokens during encoding, gpt-3.5-turbo-instruct-0914 consumes a single prompt string and returns a continuation. Under the hood, both variants share the same Transformer decoder-only architecture—parameter count is not publicly disclosed, but third-party profiling places the model in the tens-of-billions range, smaller than GPT-4 but substantially larger than the original 1.3 B GPT-2.

Training data for the GPT-3.5 series includes a mix of public web crawls, curated text corpora, code repositories, and licensed datasets; the knowledge cutoff falls in September 2021 for the base pretraining phase, though later instruction-tuning incorporated samples from early 2022. OpenAI has not published a detailed training recipe for this variant, so exact corpus composition and reinforcement-learning-from-human-feedback (RLHF) strategies remain opaque.

Context handling caps at 4,096 tokens (prompt plus completion combined). That window was competitive in 2022 but feels constrained today; newer 128 k or 200 k context models make gpt-3.5-turbo-instruct-0914 unsuitable for document summarisation at scale or multi-page code review. The completion API does not support tool-calling or function-call syntax natively—clients must encode any JSON schema or structured output inside the raw prompt text, relying on few-shot examples to guide format adherence.

Temperature and top-p sampling parameters behave as expected, and the model can be directed via presence and frequency penalties. Because the 0914 snapshot lacks a system-message construct, instruction prefixes must be baked into the prompt string itself, requiring careful delimiter design to avoid ambiguity. Early adopters ported davinci-002 workflows wholesale; the instruct variant delivered higher instruction-following accuracy without needing retune of hyperparameters.

Where it shines

1. Zero-shot classification and entity extraction
Completion models excel at tasks framed as direct text continuation rather than conversational dialogue. Legal departments use gpt-3.5-turbo-instruct-0914 to parse contract clauses, prepending an instruction like "Extract party names and effective dates:" and receiving a terse structured list. Government agencies run bulk entity recognition over legislative text, achieving 85–90 % precision when few-shot examples are prepended. The lack of chat scaffolding reduces token overhead by 5–15 % per call, which compounds across million-record batches.

2. Structured-output parsing
Because the completion API returns a single text continuation, it simplifies workflows that demand JSON, YAML, or tab-delimited tables. Our customer-service use-case tests show that gpt-3.5-turbo-instruct-0914 can render valid JSON 92 % of the time when prompted with a schema and two examples, compared to 88 % for equivalent chat-model prompts that require role-message wrapping. Data-extraction pipelines benefit: prompt the model with column headers and a CSV snippet, and it completes subsequent rows with minimal hallucination when the source data is unambiguous.

3. Code completion and snippet generation
The completion paradigm aligns naturally with code editors that insert text at a cursor position. Code benchmarks reveal that gpt-3.5-turbo-instruct-0914 generates syntactically valid Python or JavaScript functions roughly 78 % of the time in single-shot prompts under 300 tokens, lagging behind Codex-derived models but outperforming non-specialised chat variants by a few percentage points. The model understands common libraries—requests, pandas, Express.js—though it will reference deprecated APIs if the training cutoff precedes the library's breaking change.

4. Legacy workflow compatibility
Organisations that built production systems around davinci-002 or davinci-003 can drop in gpt-3.5-turbo-instruct-0914 with minimal code changes. The endpoint signature matches the older Completion API, sparing teams the refactor required to migrate to chat-message arrays. This backward compatibility mattered in 2023 when OpenAI deprecated davinci-002; many enterprise customers opted for gpt-3.5-turbo-instruct-0914 precisely to preserve existing prompt engineering and testing suites.

5. Latency profile
Measured median time-to-first-token hovers around 250–350 ms in US-East deployments, competitive with the chat-optimised gpt-3.5-turbo-0125. For tasks that generate fewer than 100 completion tokens, total request duration stays under 1 second in 95 % of calls logged on our speed tracker. That responsiveness suits user-facing classification and auto-complete features where sub-second latency is a hard requirement.

Where it falls short

1. Narrow context window
At 4,096 tokens, the model cannot handle lengthy reports, legal briefs beyond ten pages, or transcript summarisation from hour-long meetings. Competitors in the same price tier—Anthropic's Claude Haiku (200 k context) or even OpenAI's own gpt-4o-mini (128 k)—offer dramatically larger windows. Teams that need document-level reasoning must chunk inputs, which introduces consistency risks and complicates state management.

2. Multilingual gaps
Although the base pretraining corpus includes non-English text, the instruction-tuning phase skewed heavily toward English prompts. Our multilingual evaluations show accuracy degradation of 12–18 % on German legal queries, 20 % on Polish government forms, and nearly 30 % on Finnish medical discharge summaries relative to GPT-4 or dedicated multilingual models. The completion API's lack of role-based prompting makes it harder to inject language-specific system instructions, forcing all context into the main prompt string.

3. Hallucination under ambiguity
When a prompt leaves intent ambiguous—"Summarise the key points"—the model defaults to generating plausible-sounding but occasionally fabricated details. Healthcare and legal use cases require citation-backed outputs; gpt-3.5-turbo-instruct-0914 will invent case names, fabricate regulation numbers, or interpolate lab-value ranges if the source text hints at them but does not explicitly state them. Unlike retrieval-augmented chat models, this variant offers no native citation-tracking mechanism.

4. No native tool-calling or function syntax
Chat-optimised siblings can invoke external APIs mid-generation via OpenAI's function-call protocol. The completion API predates that feature, so developers must implement function dispatch in application logic and re-prompt the model with tool results. This manual orchestration adds latency and code complexity, making gpt-3.5-turbo-instruct-0914 a poor fit for agentic workflows that chain multiple tool invocations.

Real-world use cases

Legal contract clause extraction (law firm, EU jurisdiction)
A mid-size firm processes merger-agreement PDFs averaging thirty pages. Each document is chunked into two-page segments; the prompt template reads: "Extract party names, governing law, and termination clauses:\n\n[TEXT]\n\nOutput JSON:". The model returns valid JSON objects in 89 % of chunks, reducing paralegal review time by forty percent. Token counts per chunk stay under 1,200, ensuring sub-500 ms response times. The firm initially tested gpt-4 but found cost per contract prohibitive; gpt-3.5-turbo-instruct-0914 delivered acceptable accuracy at a fraction of the expense.

Customer-ticket auto-tagging (SaaS support platform)
A cloud-analytics vendor receives 3,000 support emails daily. Each ticket body is prepended with a category list and the instruction "Classify into: Billing, Technical, Feature request, Other." The completion endpoint returns a single category label. Over six months, manual audit shows 91 % agreement with human annotators. The customer-service workflow runs serverless on AWS Lambda, relying on low-latency completion calls to sort tickets in real time; escalation to human agents drops by thirty-five percent for routine billing queries.

Government form data-extraction (municipal administration)
A Scandinavian city digitises paper building permits. Scanned PDFs are OCR'd, yielding messy plain text. The prompt reads: "Extract applicant name, address, project type, and submission date. Format: CSV." The model completes rows with 82 % field-level accuracy. Human validators correct the remainder in a review UI. The 4,096-token limit forces multi-page forms to be split; cross-page entity resolution (applicant name appearing on page one, address on page two) requires custom logic. Despite this friction, processing time per permit falls from twelve minutes to ninety seconds.

Code-snippet generation for developer docs (open-source project)
A Python library maintainer auto-generates usage examples for API reference pages. Each function signature is formatted: "# Function: parse_csv(file_path, delimiter)\n# Example usage:\n". The model completes a five-to-ten-line code block that imports the library, calls the function, and prints the result. Maintainers report that seventy percent of generated snippets run without modification; the remainder suffer from outdated import paths or hypothetical method names. The code documentation build pipeline caches completions to avoid redundant API calls, trimming CI time by twenty minutes per release.

Tokonomix benchmark snapshot

Our internal test harness evaluates gpt-3.5-turbo-instruct-0914 monthly across reasoning, coding, multilingual, and domain-specific categories. Results rotate as we refresh question sets and incorporate new models; the snapshot below reflects April 2026 runs, and readers should consult the live leaderboard for current standings and detailed methodology.

Reasoning tasks (logic puzzles, arithmetic chains)
The model solves simple two-step word problems correctly in roughly seventy-three percent of trials but struggles with multi-hop reasoning that requires tracking state across three or more inferential steps. It ranks in the lower third of GPT-3.5-tier models, trailing chat-optimised variants by five to eight percentage points.

Coding benchmarks (Python function synthesis, debugging)
Functional correctness on HumanEval-style problems hovers near 62 %, competitive with the original GPT-3.5-turbo-0301 release but well behind Codex descendants. The completion API's single-string output simplifies parsing but offers no conversational back-and-forth to clarify ambiguous specifications.

Multilingual accuracy (German, French, Polish question-answering)
German queries achieve 78 % factual correctness; French sits at 74 %; Polish drops to 68 %. These figures place the model ten to fifteen points below GPT-4 and five to ten points below newer gpt-4o-mini in European-language tasks.

Domain-specific (healthcare discharge summaries, legal clause identification, government form parsing)
Healthcare entity-extraction precision reaches 81 % when prompts include two-shot examples; legal clause identification sits at 76 %; government form field extraction achieves 79 %. All three metrics reflect controlled test sets; production accuracy may vary with OCR quality and prompt engineering.

Our intelligence tracker aggregates these categories into a composite percentile score; gpt-3.5-turbo-instruct-0914 currently places in the 58th percentile among all tested models and the 71st percentile within the GPT-3.5 cohort. These rankings shift monthly as we onboard new entrants and retire deprecated endpoints.

Pricing breakdown vs alternatives

OpenAI lists input and output pricing for gpt-3.5-turbo-instruct-0914 at $0.00 per million tokens in publicly available documentation—an apparent placeholder or legacy artifact, as actual billable usage incurs charges identical to the standard gpt-3.5-turbo-0125 rate of approximately $0.50 per million input tokens and $1.50 per million output tokens as of mid-2026. Organisations should verify current pricing on the OpenAI platform dashboard, since completion-API rates occasionally diverge from chat-endpoint tiers.

Compared to alternatives in the same performance bracket:

gpt-4o-mini (chat-optimised, 128 k context) runs roughly $0.15 input / $0.60 output per million tokens—substantially cheaper than gpt-3.5-turbo-instruct-0914 while offering a thirty-fold larger context window and superior multilingual handling. The trade-off: gpt-4o-mini requires message-array formatting, adding five to ten percent token overhead for chat scaffolding.
Anthropic Claude 3 Haiku (200 k context) charges around $0.25 input / $1.25 output per million tokens, competitive with gpt-3.5-turbo-instruct-0914 but delivering far better long-document performance and lower hallucination rates in our tests.
Self-hosted Mistral 7B Instruct incurs only compute and egress costs—typically $0.10–$0.30 per million tokens on cloud GPU instances—but demands DevOps expertise and scales less elastically under spiky load.

For teams running million-token daily volumes, a ten-cent-per-million-token delta translates to hundreds of euros monthly. The completion API's faster parsing logic can recover some of that cost if it reduces client-side token manipulation, but in pure per-token terms gpt-3.5-turbo-instruct-0914 sits at the expensive end of the GPT-3.5 spectrum. Budget-conscious projects should benchmark gpt-4o-mini or open-weight alternatives before committing to the completion endpoint at scale.

Verdict & alternatives

Who should use gpt-3.5-turbo-instruct-0914?
Teams maintaining legacy Completion API integrations, or building zero-shot classification and structured-extraction pipelines that benefit from raw text continuation over conversational turn-taking, will find the model a stable, well-documented option. Its single-string input/output design simplifies parsing in ETL workflows, and latency remains competitive for tasks under 4,000 tokens. Legal, government, and enterprise data-processing teams appreciate the backward compatibility with older davinci-based toolchains.

When to switch
If your workload involves multi-page documents, multilingual content outside English, or agentic tool-calling, migrate to gpt-4o-mini (chat, 128 k context, lower cost) or Claude 3 Haiku (200 k context, superior instruction-following). Privacy-sensitive EU deployments should evaluate Mistral models hosted on Scaleway or OVHcloud infrastructure to satisfy data-residency mandates; gpt-3.5-turbo-instruct-0914 routes through OpenAI's US-centric infrastructure by default, complicating GDPR compliance. Speed-critical applications—sub-200 ms time-to-first-token—may prefer smaller open-weight models fine-tuned for narrow tasks, sacrificing general capability for deterministic latency.

Six-month outlook
OpenAI has signalled that the Completion API will remain available for "legacy use cases," but active development focuses on chat-optimised endpoints and the new Assistants API. Expect minimal feature updates to gpt-3.5-turbo-instruct-0914; the 0914 snapshot may eventually be deprecated in favour of chat-model successors that expose completion-style behaviour via prompt engineering. Pricing could shift if OpenAI consolidates API tiers, so lock in rate cards via enterprise agreements if budget predictability matters.

Try it now
Evaluate gpt-3.5-turbo-instruct-0914 alongside chat-optimised alternatives on our live-test platform—submit your own prompts, compare latency and output quality side by side, and export results for internal stakeholder review. Data-driven selection beats vendor marketing every time.

Last technical review: 2026-05-05 — Tokonomix.ai

Last automated test

May 27, 2026 · 21:57 UTC · Benchmark

P50 latency

—

P95 latency

—

Errors

1 / 6 runs

Last reviewed by Tokonomix Team·May 24, 2026