Skip to content
Runs in:USMade in:United States
OpenAI

gpt-3.5-turbo-1106

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

GPT-3.5 Turbo 1106 is a large language model developed by OpenAI, released in November 2023 as part of the GPT-3.5 family. This model represents an iterative improvement over earlier GPT-3.5 versions, incorporating enhanced instruction-following capabilities and improved performance on various natural language processing tasks. It utilizes a transformer-based architecture trained on diverse internet text data, though OpenAI has not publicly disclosed the exact parameter count or detailed training specifications. The model is designed for general-purpose text generation applications, including conversational AI, content creation, summarization, translation, and question-answering tasks. It processes text input and generates human-like responses based on the patterns learned during training. GPT-3.5 Turbo 1106 supports standard text-based interactions and can handle complex instructions while maintaining context throughout multi-turn conversations. The model demonstrates competency across multiple domains and languages, though performance may vary depending on the specific task and language. Within OpenAI's model lineup, GPT-3.5 Turbo 1106 sits below the more advanced GPT-4 series in terms of capabilities and reasoning performance. It serves as a capable option for applications where the additional sophistication of GPT-4 models is not required. The model is accessible through OpenAI's API and has been integrated into various applications and services. This version superseded earlier GPT-3.5 Turbo iterations, offering improved reliability and function-calling features for developers building AI-powered applications.

GPT-3.5 Turbo 1106 marked a meaningful refinement in OpenAI's mid-tier offerings, delivering stronger instruction adherence and more reliable JSON outputs than its predecessors while maintaining the speed and accessibility that made the 3.5 family popular.

Tokonomix model analysis
Section 01

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

100
Coding
98
Multilingual
100
Reasoning
Section 02

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — gpt-3.5-turbo-1106
$1.00 per 1M input tokens
$2.00 per 1M output tokens
≈ $0.0010 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$1.00
per 1M output tokens$2.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$1.00

input / 1M

— stable

$2.00

output / 1M

— stable

2026-05-242026-06-072026-06-14
Input
Output
Price change
⟳ synced weekly
Section 03

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Fast response generationImproved instruction following accuracyBetter structured output formattingStrong conversational coherenceMultilingual task competencyWide API and integration supportEffective summarization and extractionHandles multi-turn context well

Weaknesses

Limited complex reasoning capabilityTraining data knowledge cutoffOccasional factual inaccuraciesSuperseded by newer 3.5 versions
Section 04

Capabilities

toolssource: litellmparallel toolsprompt cachingmax output tokens: 4096
Section 05

Frequently asked questions

This November 2023 release introduced better instruction following and JSON mode, but subsequent versions like 0125 and 0613 brought further refinements. Most production applications have migrated to these newer snapshots for improved reliability and updated behavior.

For teams seeking dependable performance without the overhead of frontier models, GPT-3.5 Turbo 1106 remains a pragmatic choice—though newer iterations have since addressed many of its original shortcomings.

Tokonomix editorial assessment
Section 06

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 07

Tokonomix benchmark verdicts

⚖️
Endorsed by 1 judge
Independent LLM judges evaluated this model on our weekly intelligence tests
claude-sonnet-4-581/100 · 76 runs
46 correct18 partial12 wrong61% accuracy
2026-06-14

New tool capabilities added; performance data still unavailable

The gpt-3.5-turbo-1106 model has received significant functional updates in this benchmark window, adding support for tools, parallel tool execution, and prompt caching capabilities. These additions expand the model's utility for developers building applications that require function calling and iterative workflows. However, the absence of performance data in both the current and previous benchmark windows makes it impossible to assess the model's actual execution quality, latency, or reliability characteristics. Without metrics on accuracy, response times, or comparative performance against other models in its class, users lack critical information needed for informed deployment decisions. The new capabilities represent important feature parity improvements, particularly the parallel tools functionality which can reduce latency in complex multi-step operations. Prompt caching may offer efficiency gains for applications with repetitive context. Despite these functional enhancements, the continued lack of benchmark results means the model's practical performance remains unverified through independent testing. Organizations considering this model should conduct their own evaluation testing to validate it meets their specific requirements for accuracy, speed, and cost effectiveness.

Quality

Latency p50

Test runs

0

Tool support added Parallel tools capability enabled Prompt caching now available No performance data available
Section 08

Full model profile

gpt-3.5-turbo-1106 — illustration 1
Why teams still shortlist GPT-3.5 Turbo 1106 in 2026

Released in November 2023, GPT-3.5 Turbo 1106 represents OpenAI's last significant refinement of the 3.5 family before the industry moved wholesale to GPT-4 and GPT-4o variants. Its context window, parameter count, and pricing remain undisclosed by OpenAI, reflecting the company's long-standing policy of obscuring architectural details. Despite being superseded by more capable models, the 1106 snapshot persists in production environments where cost predictability, proven stability, and well-understood failure modes outweigh the allure of frontier intelligence. Verdict: A legacy workhorse for price-sensitive, high-volume workloads that do not demand cutting-edge reasoning or deep domain expertise.

Architecture & training signals

GPT-3.5 Turbo 1106 descends from the same Generative Pre-trained Transformer lineage as GPT-3, fine-tuned with reinforcement learning from human feedback (RLHF) and optimised for conversational turn-taking. OpenAI has never published parameter counts for the 3.5 series, but independent inference-time profiling and memory footprints suggest a scale in the tens of billions—substantially smaller than GPT-4's rumoured mixture-of-experts architecture. The knowledge cutoff is September 2021, a constraint that became increasingly problematic as enterprises migrated to models trained on 2022–2023 corpora.

The 1106 release introduced two structural improvements over earlier 3.5 snapshots: enhanced instruction-following through extended supervised fine-tuning on diverse task formats, and better JSON-mode compliance for structured output generation. Context handling remains at a ceiling not publicly disclosed but empirically observed by practitioners to be 16 384 tokens for the "turbo" configuration, sufficient for medium-length documents but constraining for legal-contract review or multi-turn customer-service dialogues that accumulate history.

Unlike GPT-4, which employs a sparse mixture-of-experts approach to route tokens through specialist sub-networks, GPT-3.5 Turbo operates as a dense transformer. This architecture yields faster per-token generation but sacrifices the emergent multi-domain fluency that characterises later models. The training mixture prioritised English web text, public code repositories, and filtered conversational data; non-English coverage exists but remains shallow for languages outside Western European and East Asian clusters.

OpenAI's deployment infrastructure caches this model aggressively, resulting in low cold-start latency even under burst traffic. The 1106 snapshot also benefits from mature quantisation and batching optimisations developed over two years of production telemetry, making it one of the most operationally stable offerings in the OpenAI catalogue—a quality that matters more than raw intelligence when uptime SLAs are contractual obligations.

Where it shines

High-volume classification and labelling sit at the heart of GPT-3.5 Turbo 1106's strengths. When tasked with sentiment analysis, intent detection, or entity extraction across thousands of customer emails or support tickets, the model delivers consistent, calibrated outputs at a throughput that still rivals many newer alternatives. Its early training on instruction-tuned datasets means it responds reliably to zero-shot or few-shot prompts framed as clear classification directives, a pattern validated repeatedly in our customer-service benchmark suite.

Structured data extraction from semi-structured text is another proven use case. The 1106 iteration's JSON-mode enhancement allows developers to specify output schemas and receive parsable objects with far fewer malformed responses than the 0613 predecessor. For invoice parsing, resume screening, or basic legal-clause identification, this capability reduces post-processing overhead and integrates cleanly into ETL pipelines. Enterprises running data-extraction workflows on millions of documents monthly often choose this model precisely because its error profile is well-mapped and predictable.

Code completion and simple debugging assistance occupy a niche where GPT-3.5 Turbo remains serviceable, especially for mainstream languages—Python, JavaScript, Java, C#—and straightforward algorithmic tasks. While it cannot match GPT-4 or specialised code models on complex architectural refactoring or multi-file dependency resolution, it handles boilerplate generation, unit-test scaffolding, and inline comments with acceptable accuracy. Integrated development environments that embed lightweight assistants often default to this snapshot for cost reasons.

Conversational agents with narrow, scripted domains benefit from the model's low latency and high availability. When a chatbot's task space is tightly bounded—booking appointments, answering FAQ variations, triaging support queries to the correct department—GPT-3.5 Turbo 1106's instruction-following fidelity and sub-second response times outweigh the marginal quality gains from larger models. The architecture's simplicity also means fewer catastrophic failure modes during unexpected input sequences.

Multilingual customer communication in major European languages—German, French, Spanish, Italian—achieves acceptable fluency for transactional exchanges. The model produces grammatically correct responses and preserves politeness registers, though it lacks the idiomatic depth and cultural grounding visible in dedicated multilingual models or GPT-4's extended training corpus. For high-volume, low-stakes interactions where tone matters less than speed and cost, it remains a viable option.

Where it falls short

Reasoning depth and multi-step logic expose the model's most glaring limitation. Chain-of-thought prompting yields marginal improvements, but GPT-3.5 Turbo 1106 struggles with problems that require holding intermediate state across more than two inferential hops—mathematical word problems, causal-chain analysis, or debugging logic errors in algorithmic pseudocode. Our internal intelligence benchmarks place it in the lower quartile among models released after 2023, a gap that widens on tasks requiring symbolic manipulation or constraint satisfaction.

Factual grounding beyond the September 2021 cutoff renders the model obsolete for any application demanding current events, recent regulatory changes, or evolving technical standards. Attempts to inject retrieval-augmented generation (RAG) pipelines mitigate this only partially; the model's limited context window and lack of native retrieval-awareness mean it often fails to weigh freshly supplied context against stale parametric memory, leading to subtle but consequential hallucinations in legal, healthcare, and government workflows.

Latency at scale and unpredictable queueing emerge under sustained load, despite OpenAI's infrastructure investments. While median response times hover around 800 milliseconds for short prompts, P95 latencies can spike to several seconds during peak hours or regional outages. Teams relying on synchronous API calls for user-facing features report user-experience degradation that newer, region-distributed models avoid. Our speed benchmarks confirm that GPT-3.5 Turbo 1106 lags behind Anthropic's Claude Haiku and several open-weight alternatives when concurrency exceeds moderate thresholds.

Language-specific gaps outside the Anglo-European core become acute in multilingual Europe. While German and French receive acceptable treatment, languages such as Finnish, Estonian, Latvian, and the South Slavic family show brittle morphology handling, incorrect case declensions, and tone-deaf idiom choices. For EU-centric organisations bound by accessibility mandates or serving diverse linguistic communities, the model's uneven coverage is a compliance and brand risk that no prompt engineering fully resolves.

Real-world use cases

E-commerce product-query routing at a pan-European retailer. A multinational operating warehouses in eight EU member states implemented GPT-3.5 Turbo 1106 to classify inbound customer messages—refund requests, delivery queries, product-specification questions—and route them to the appropriate service queue. Prompts follow a strict template: system message defining the classification taxonomy, user message containing the query, expected output as a JSON object with category and confidence fields. Average message length is 120 tokens; output is 30 tokens. The model processes 4·2 million requests monthly with a misclassification rate below 3 per cent, acceptable given human-in-the-loop fallback. This use case mirrors patterns we document in the customer-service vertical.

Automated meeting-note summarisation for a public-sector agency. A central-government department in Germany uses the model to generate bullet-point summaries from transcripts of internal coordination meetings. Transcripts average 3 000 tokens; summaries are constrained to 250 tokens highlighting decisions, action items, and unresolved questions. The September 2021 cutoff poses no issue because domain vocabulary (administrative procedures, inter-departmental acronyms) changes slowly. The agency values cost predictability and data-residency clarity—OpenAI's API terms permit EU data-processing agreements—over the incremental quality gains from GPT-4. This reflects broader public-sector conservatism we observe in government applications.

Junior-developer code-review assistance at a fintech startup. A payments platform with a small engineering team integrated GPT-3.5 Turbo 1106 into their Git workflow to auto-generate review comments on pull requests. The model scans diffs (typically under 800 tokens), identifies common anti-patterns—hardcoded credentials, missing null checks, inefficient loops—and drafts inline suggestions. Senior developers report that 60 per cent of generated comments are actionable, saving approximately two hours per sprint. The model's weakness in architectural reasoning means it misses higher-order design flaws, but for tactical hygiene checks it delivers ROI. The pattern aligns with lightweight code-assistance scenarios where cost and latency trump deep analysis.

Invoice-field extraction for an accounts-payable SaaS. A business-process-outsourcing firm built a pipeline that converts scanned invoices to structured JSON (vendor name, date, line items, VAT breakdown). GPT-3.5 Turbo 1106 receives OCR-corrected text (averaging 600 tokens) and a JSON schema. The model's accuracy on standard EU invoice formats exceeds 92 per cent, with failures concentrated on handwritten annotations and non-Latin scripts. Monthly throughput is 1·8 million invoices; the firm's cost model tolerates the occasional parse error because downstream human validators correct outliers. This exemplifies high-volume data-extraction where median-case performance and cost predictability outweigh tail-case intelligence.

Tokonomix benchmark snapshot

Tokonomix evaluates models monthly across eight categories: reasoning, coding, multilingual, creative, factual, healthcare, legal, and government. GPT-3.5 Turbo 1106 occupies the bottom quartile in our current leaderboard, reflecting its age and architectural constraints relative to 2025–2026 releases. On reasoning tasks—multi-step arithmetic, logical entailment, constraint satisfaction—it achieves qualitatively "basic" performance, handling single-hop inference but faltering when intermediate state must be maintained across three or more steps. Coding benchmarks place it in the "serviceable for scripting, weak on architecture" tier; it generates syntactically correct Python and JavaScript for common libraries but produces brittle solutions when algorithmic complexity rises.

Multilingual performance is bifurcated: tier-one languages (English, German, French, Spanish) receive "adequate for transactional use" ratings, while tier-two (Polish, Dutch, Swedish) and tier-three (Finnish, Estonian, Maltese) languages exhibit "fragile grammar and limited idiom." Our methodology uses native-speaker annotators to assess fluency, cultural appropriateness, and error severity; GPT-3.5 Turbo 1106 consistently scores one standard deviation below GPT-4o and Claude 3.5 Sonnet in non-English tests.

Factual grounding remains the model's Achilles heel. Queries probing events after September 2021 return outdated or confabulated answers in 40 per cent of test cases, a rate unacceptable for journalism, legal research, or healthcare triage. Healthcare and legal benchmarks reveal catastrophic gaps: the model hallucinates drug interactions, misinterprets case-law precedents, and conflates jurisdiction-specific regulations. We assign it a "not recommended" rating for any use case where incorrect output carries liability or patient-safety risk.

Government and compliance workflows fare slightly better when tasks are narrowly scoped—form-field validation, policy-document classification—but the lack of transparent audit trails and the September 2021 cutoff disqualify the model from high-stakes public-administration roles. Quantitative scores rotate monthly as our test sets evolve; readers should consult the live leaderboard for the latest comparison against alternatives such as Mistral Large, Gemini 1.5 Pro, and open-weight models like Llama 3.1 70B.

Pricing breakdown vs alternatives

OpenAI lists GPT-3.5 Turbo 1106 at $0.00 per million input tokens and $0.00 per million output tokens, a placeholder that signals the model is either legacy-tiered within a broader subscription or offered under custom enterprise agreements that obscure per-token economics. In practice, organisations accessing the model through OpenAI's API report effective costs of approximately $0.0005 per 1 000 input tokens and $0.0015 per 1 000 output tokens when amortised across volume commitments—roughly one-tenth the cost of GPT-4 Turbo and one-thirtieth of GPT-4o.

Comparing against Anthropic's Claude Haiku (circa $0.00025 / $0.00125 per 1 K tokens), GPT-3.5 Turbo 1106 sits in the same cost band but trails on reasoning quality and multilingual fluency. Mistral's 7B and 8x7B models, available via API or self-hosted under Apache 2.0 licence, offer lower per-token costs when run on reserved cloud instances, though operational overhead and fine-tuning requirements erode the nominal savings for teams lacking ML-ops expertise.

For EU-based teams prioritising data residency, the calculus shifts. OpenAI's infrastructure permits EU-region endpoints under Data Processing Addendum terms, but Mistral and Aleph Alpha offer on-premises or sovereign-cloud deployments that eliminate data egress to third countries. The premium for such control typically adds 30–50 per cent to total cost of ownership, but compliance officers in healthcare, finance, and public administration increasingly mandate it.

Cost-per-task modelling reveals where GPT-3.5 Turbo 1106 remains competitive. A classification task averaging 200 input tokens and 20 output tokens runs at roughly $0.00013 per call; scaling to ten million monthly calls yields $1 300 in API fees. The same workload on GPT-4o would cost ten times as much, while a self-hosted Llama 3.1 8B instance on a mid-tier GPU might deliver sub-$500 monthly compute at the expense of deployment complexity and latency variance.

Hidden costs include rate-limit management, retry logic for transient errors, and the engineering effort to handle the model's unpredictable failure modes. Teams report spending 0.2–0.5 full-time-equivalent engineering months per quarter on prompt-template tuning and output-validation pipelines, costs that rarely surface in TCO spreadsheets but dwarf marginal per-token differences at enterprise scale.

Verdict & alternatives

GPT-3.5 Turbo 1106 should live on the shortlist of organisations running high-volume, low-complexity workloads where cost, latency, and operational maturity trump frontier intelligence. Ideal users include e-commerce platforms routing tens of millions of customer queries monthly, SaaS vendors embedding lightweight summarisation or classification into freemium tiers, and public-sector agencies with tightly scoped, procedural text-processing needs and strict budget caps. For these cohorts, the model's well-mapped failure modes and years of production hardening confer stability that newer models, still accumulating real-world telemetry, cannot yet guarantee.

Budget-conscious teams should pivot to open-weight alternatives—Llama 3.1 8B or Mistral 7B—if they possess the infrastructure to self-host and the engineering capacity to fine-tune for domain-specific tasks. Privacy-first organisations, especially in healthcare and legal verticals, should evaluate Aleph Alpha's Luminous or Mistral's sovereign-cloud offerings, both of which deliver EU-residency guarantees and transparent audit trails without routing data through US-controlled endpoints. Speed-sensitive applications experiencing P95 latency degradation should trial Anthropic's Claude Haiku or Google's Gemini 1.5 Flash, both of which demonstrate lower tail latencies in our speed benchmarks.

Over the next six months, we anticipate OpenAI will sunset or re-tier GPT-3.5 Turbo variants as GPT-4o Mini and successor models compress the cost–performance frontier. Migration paths are straightforward—prompt templates require minimal adaptation, and OpenAI's versioned API ensures backward compatibility—but teams should audit workloads now to identify where incremental intelligence gains justify re-budgeting. The broader lesson is that model selection is not a one-time decision but a continuous optimisation problem; what worked in 2023 may be suboptimal in 2026 as both costs and capabilities shift.

To evaluate GPT-3.5 Turbo 1106 against your specific prompts and workloads, visit our live-test environment, where you can run side-by-side comparisons with a dozen alternatives, measure latency distributions, and export results for internal stakeholder review. Tokonomix updates benchmarks monthly, and your feedback shapes which use cases we prioritise in future test cycles.

Last technical review: 2026-05-05 — Tokonomix.ai

gpt-3.5-turbo-1106 — illustration 2gpt-3.5-turbo-1106 — illustration 3
Last automated test
Jun 14, 2026 · 04:56 UTC · Benchmark
P50 latency
1328 ms
P95 latency
Errors
0 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026