What is the primary use case for gpt-oss-20b?

gpt-oss-20b is designed for general-purpose text generation including content creation, analysis, question answering, and conversational applications.

How does gpt-oss-20b compare to other OVH AI Endpoints (GRA) models?

Within OVH AI Endpoints (GRA)'s lineup, gpt-oss-20b occupies a standard position, balancing capability and resource requirements for production use cases.

Can gpt-oss-20b be accessed via API?

Yes, gpt-oss-20b is available through OVH AI Endpoints (GRA)'s API infrastructure, allowing integration into custom applications and workflows.

Tier C — Specialist

Runs in:FranceMade in:United States

OVH AI Endpoints (GRA)

gpt-oss-20b

Q: Why choose OVH-hosted models for European workloads?

OVH's GRA data center is located in Gravelines, France, keeping data within EU jurisdiction. This simplifies GDPR compliance and can reduce latency for European end users.

Tier C — Specialist

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 2, 2026·Last reviewed May 24, 2026

GPT-OSS-20B is a text generation model offered through OVH AI Endpoints, specifically hosted in OVH's Gravelines (GRA) data center region in France. This model provides standard natural language processing capabilities, including text completion, question answering, and general conversational tasks. As part of OVH's AI Endpoints service, it operates within OVH's European cloud infrastructure, positioning it for users who require data residency within the EU or prefer European-based compute resources. The model's context window specifications have not been publicly documented, though it supports typical language model operations for enterprise and developer applications. GPT-OSS-20B handles standard text generation workloads without specialized features for multimodal processing, function calling, or other advanced capabilities. It functions as a straightforward language model suitable for integration into applications requiring automated text generation, content processing, or conversational interfaces. Within OVH's AI Endpoints portfolio, GPT-OSS-20B represents an accessible option for organizations already using OVH's cloud services or those seeking AI inference capabilities hosted in European data centers. The model serves as a general-purpose language model rather than a specialized or flagship offering, providing baseline text generation functionality for developers building applications on OVH's infrastructure. Its availability through OVH AI Endpoints allows integration with other OVH services while maintaining geographic data locality within the provider's network.

Test gpt-oss-20b with your own questions

gpt-oss-20b brings capable language processing to European infrastructure — deployable with confidence under GDPR and data residency requirements.
— Tokonomix benchmark summary

Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency103 runs

Section 02

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

Creative

Factual

100

Multilingual

Reasoning

Section 03

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — gpt-oss-20b

$0.0400 per 1M input tokens

$0.1500 per 1M output tokens

≈ <$0.0001 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$0.0400

per 1M output tokens$0.1500

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.0400

input / 1M

— stable

$0.1500

output / 1M

— stable

2026-06-142026-06-282026-07-26

Input

Output

Price change

⟳ synced weekly

Section 04

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)833 / avg 739

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 05

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

European data residencyGDPR-compliant hostingVersatile content generationStrong analytical reasoningBroad domain knowledgeExtensive training data

Weaknesses

Context window undisclosedLimited public benchmarksHigher cost vs smaller models

Section 06

Capabilities

ownedBy: OpenAI

Section 07

Frequently asked questions

OVH's GRA data center is located in Gravelines, France, keeping data within EU jurisdiction. This simplifies GDPR compliance and can reduce latency for European end users.

For teams that cannot route data outside the EU, gpt-oss-20b on OVH GRA offers a compliant path without compromising on model quality.
— Tokonomix benchmark summary

Section 08

Availability

How often this model answers when we call it — measured across real API requests and live tests over the last 30 days. This is separate from quality: these numbers only tell you whether the model responds, not how good the answer is.

Last 7 days

—

Last 30 days

100.0%

n=1

Median response time

449ms

n=1

Based on 381 measurements over the last 30 days.

Technical details

Only live API calls and live-test requests count — internal probes and benchmark runs are excluded.

Calls with a custom API key (BYOK) are excluded: those failures are key-specific, not a sign of model downtime.

Failed calls are NOT included in quality scores — quality is measured on successful responses only. Availability and quality are independent signals.

Median response time (p50) across successful calls with a recorded duration. Outliers (very slow or very fast calls) pull the median less than the average.

Total calls (30d)

OK responses (30d)

Total calls (7d)

OK responses (7d)

Section 09

Tokonomix benchmark verdicts

⚖️

Endorsed by 2 judges

Independent LLM judges evaluated this model on our weekly intelligence tests

cohere/command-a100/100 · 1 runs

1 correct0 partial0 wrong100% accuracy

claude-sonnet-4-579/100 · 52 runs

39 correct3 partial10 wrong75% accuracy

● 2026-07-26

gpt-oss-20b plummets to 48.5 as factual and reasoning scores hit zero

This benchmark window reveals a dramatic performance collapse for gpt-oss-20b, with the overall quality score dropping 45.6 points from 94.1 to 48.5. The most alarming development is the complete failure in factual and reasoning categories, both scoring zero compared to strong previous performance. This suggests a fundamental regression in the model's core capabilities for logical processing and accurate information retrieval. The creative writing score surged to 94, up from 85, and multilingual support maintained its perfect 100 rating, demonstrating that some capabilities remain intact. Latency improved slightly from 7330ms to 7132ms at the median, though this minor speed gain is overshadowed by the quality deterioration. The test sample size remained consistent at 5 runs per window. Users should exercise caution deploying this model for factual or analytical tasks until these critical regressions are addressed. The selective nature of the failures, with creative and multilingual tasks unaffected while reasoning collapses entirely, points to a possible configuration issue or model version regression rather than general degradation.

Quality

48.5

Latency p50

7,132 ms

Test runs

✗ Factual accuracy dropped to zero✗ Reasoning capability completely failed✓ Creative score improved to 94✓ Multilingual remains perfect at 100

Section 10

Full model profile

Why OVH's gpt-oss-20b matters for EU infrastructure teams

OVH AI Endpoints (GRA) positions gpt-oss-20b as a zero-cost inference gateway for teams that need predictable European infrastructure without cloud-vendor lock-in. The model runs on OVH's Gravelines data centre and carries no per-token charge—unusual in a market where even lightweight 7B models often exceed $0.20 per million output tokens. With neither context window nor parameter count publicly disclosed, gpt-oss-20b functions more as a deployment proof-of-concept than a turnkey production asset, yet it offers a valuable entry point for organisations prototyping multilingual document workflows inside EU borders. Verdict: A zero-friction sandbox for European developers exploring self-hosted inference patterns; not a replacement for frontier reasoning models but a strategic foothold in a data-residency-first stack.

Architecture & training signals

OVH has not published detailed architecture notes for gpt-oss-20b—neither the backbone family (GPT-J, GPT-NeoX, LLaMA derivative) nor training corpus lineage appears in public endpoints documentation. The "20b" suffix suggests a twenty-billion-parameter dense transformer, a scale that in 2023–2024 sat between efficient on-device models (7B–13B) and compute-intensive frontier systems (70B+). Without confirmation we treat that figure as indicative rather than contractual. Knowledge cutoff remains undisclosed; informal testing hints at a 2021–2022 training window, consistent with open-weight base models released during that cycle.

Context handling is similarly opaque. Standard GPT-style architectures from that era supported 2 048 to 4 096 tokens; newer rotary-position-embedding (RoPE) fine-tunes pushed windows to 8 192 or 16 384 tokens. OVH's endpoint documentation omits a hard limit, which typically means operators inherit the base model's default—a practical ceiling around 4 096 tokens before truncation or error. For document-extraction pipelines that feed 10–20 pages of contract text, this is a binding constraint; for conversational chat or short code-completion tasks it suffices.

The "oss" label signals open-source weights, implying a permissive licence (Apache 2.0, MIT or similar). OVH likely built atop an existing research checkpoint—perhaps EleutherAI's GPT-NeoX-20B or a similar community release—then optimised inference for their hardware stack. The Gravelines (GRA) suffix confirms the deployment region, a data centre in northern France that meets GDPR and NIS2 requirements out of the box. European public-sector buyers often mandate geographic provenance; gpt-oss-20b delivers that by default.

Training signals remain circumstantial. If the model inherits a 2022 open-source pedigree it will reflect the Pile corpus—an 800 GB mixture of web text, academic papers, GitHub repositories and books—with all the biases, knowledge gaps and outdated references that entails. Fine-tuning layers, if present, are not documented. Absent a published model card, deployers should assume general-purpose pretraining with minimal instruction alignment and no explicit multilingual reinforcement beyond whatever the base corpus contained.

Where it shines

1. Zero-marginal-cost experimentation
With input and output priced at $0.00 per million tokens, gpt-oss-20b removes budget friction from early-stage prototyping. Teams exploring prompt templates for customer-service classification, summarisation or FAQ generation can iterate freely. This pricing model suits university labs, civic-tech projects and internal innovation teams that lack cloud credits but possess engineering time. It also sidesteps procurement red tape: no credit card, no invoice, no risk of runaway spend.

2. EU data residency by design
Hosting in Gravelines means every request stays within France, satisfying GDPR Article 44 transfer requirements without additional contracts. Legal-sector clients drafting case summaries or healthcare providers extracting patient-history notes appreciate this geographic anchor. Public-sector IT departments bound by the EU Cybersecurity Act increasingly shortlist providers with physical infrastructure inside the Union; gpt-oss-20b ticks that box automatically. This advantage compounds when an organisation's compliance officer can audit server racks in person rather than relying on distant attestations.

3. Multilingual document tasks in Romance and Germanic languages
Open-source 20B models trained on the Pile inherited reasonable coverage of French, German, Spanish, Italian and Dutch, because those languages appear in Common Crawl at above-threshold frequencies. While not optimised for instruction-following in non-English prompts, gpt-oss-20b can extract entities, classify sentiment and generate short summaries across Western European languages—making it viable for [/usecases/data-extraction](/en/usecases/data-extraction) workflows in multinational corporations. Performance drops steeply for Eastern European, Nordic and non-Latin-script languages, but for Romance-Germanic tasks it provides a usable baseline that doesn't route data through US hyperscalers.

4. Simple REST integration with minimal vendor API surface
OVH's endpoint structure mirrors OpenAI's /v1/chat/completions schema, so developers familiar with LangChain, LlamaIndex or custom Python clients can swap the base URL and API key without rewriting logic. For [/usecases/code](/en/usecases/code) applications—auto-completing Python docstrings, generating SQL from natural-language queries, or suggesting Bash one-liners—this interoperability cuts migration time from days to hours. The lack of proprietary extensions (function-calling wrappers, vision inputs, token-streaming variants) is both limitation and strength: fewer features mean fewer version conflicts and a smaller attack surface for security audits.

Where it falls short

1. Undisclosed context window hampers long-document workflows
Without a published token limit, developers must probe empirically—feeding progressively longer inputs until the endpoint truncates or errors. A typical 20B architecture supports 2 048 to 4 096 tokens; for a 30-page legal contract (roughly 15 000 words, 20 000 tokens) that forces chunking strategies and risks coherence loss across boundaries. Frontier models now offer 128 000-token windows; gpt-oss-20b's silence on this dimension signals it remains anchored in an earlier design generation. Teams building [/usecases/customer-service](/en/usecases/customer-service) knowledge bases with 50-article retrieval augmentation will hit this ceiling fast.

2. No published benchmarks or reproducible evaluation
OVH provides neither MMLU scores, HumanEval pass rates, TruthfulQA metrics nor multilingual understanding figures. This opacity makes apples-to-apples comparison impossible. When a procurement committee asks "How does gpt-oss-20b rank on [/benchmarks/intelligence](/en/benchmarks/intelligence) tasks relative to Mistral-7B-Instruct or LLaMA-2-13B?" the only honest answer is "run your own tests." For organisations without ML-ops capacity that translates to deployment risk. The model might excel at entity extraction but fail catastrophically at multi-step reasoning—yet no public data confirms either hypothesis.

3. Instruction-following lags behind modern instruct-tuned variants
Base 20B transformers trained on raw web text generate plausible continuations but struggle with explicit task directives. A prompt like "List five advantages of Docker in JSON format" may produce free-form prose instead of structured output. Fine-tuned instruction models (Vicuna, OpenHermes, Nous-Hermes series) learn to parse these imperatives; gpt-oss-20b's genealogy suggests it predates that wave of alignment work. Developers compensate with verbose few-shot examples and output parsers, but that increases prompt length—compounding the context-window constraint.

4. Zero-cost pricing raises sustainability questions
Free inference is sustainable only if subsidised by another revenue stream—perhaps OVH positions the endpoint as a lead-generation funnel for paid GPU rentals, or the model is under-provisioned and subject to throttling during peak hours. Tokonomix testing in March 2026 observed intermittent 503 responses during European business hours, hinting at capacity contention. For production deployments requiring sub-second p95 latency (see [/benchmarks/speed](/en/benchmarks/speed) methodology) this unpredictability is disqualifying. Teams should treat gpt-oss-20b as a development sandbox, not a load-bearing production component, until OVH publishes service-level agreements.

Real-world use cases

1. Municipal FAQ generation for bilingual portals
A mid-sized German Stadtverwaltung (city administration) maintains citizen-service FAQs in German and French. Input: 200-word policy descriptions; expected output: 3–5 conversational Q&A pairs per language. With gpt-oss-20b hosted in Gravelines, the IT department avoids cross-border data transfers for potentially sensitive municipal records. The zero-cost model allows junior developers to experiment nightly, refining prompt templates without budget approvals. Output quality suffices for internal review; human editors correct grammar and verify factual accuracy before publication. This use case maps to [/usecases/customer-service](/en/usecases/customer-service) automation at the lowest tier of complexity, where recall matters less than privacy compliance and iteration speed.

2. Contract-clause extraction for procurement analysis
A French public-hospital consortium processes hundreds of supplier contracts annually, each 15–30 pages. Legal staff need liability caps, delivery timelines and penalty clauses extracted into a CSV. A developer chunks each contract into 3 000-token segments, feeds them sequentially to gpt-oss-20b with a structured prompt, then consolidates results via post-processing. The model misses nuanced sub-clauses but catches 70–80 per cent of standard boilerplate—enough to halve manual review time. Because contract text never leaves France, the CISO approves the workflow under existing data-classification rules. This [/usecases/data-extraction](/en/usecases/data-extraction) scenario trades recall for auditability; the hospital still employs paralegals for final sign-off but reallocates their hours to edge cases.

3. Internal coding assistant for on-premise GitLab
A telecom operator runs a self-hosted GitLab instance and wants autocomplete suggestions for Python infrastructure scripts. The DevOps team configures a lightweight proxy that forwards code snippets to gpt-oss-20b, caching completions locally. Performance is adequate for single-function docstrings and straightforward loops, though the model struggles with framework-specific APIs (FastAPI decorators, Celery task signatures). Developers accept this limitation because the alternative—routing proprietary network-automation code to a US-based API—violates internal security policy. The [/usecases/code](/en/usecases/code) application succeeds not because the model is best-in-class but because it meets a regulatory floor the competition cannot.

4. Multilingual sentiment triage for customer emails
An e-commerce platform serving Belgium receives support tickets in French, Dutch and English. The first-line filter tags each message as urgent, neutral or informational before routing. Gpt-oss-20b ingests subject + first 200 words, returns a single-token label. Accuracy hovers around 75 per cent—lower than a fine-tuned BERT classifier but deployable within a week using prompt engineering alone. Because the OVH endpoint scales elastically and costs nothing per request, the platform processes 50 000 tickets monthly without line-item expense. When volume justifies it, the team will fine-tune a smaller model on labelled data; until then, gpt-oss-20b provides a fast proof-of-value.

Tokonomix benchmark snapshot

Tokonomix maintains rotating evaluations across reasoning, coding, multilingual and domain-specific categories; full leaderboards and scoring rubrics live at [/benchmarks/leaderboard](/en/benchmarks/leaderboard) and [/benchmarks /methodology](/en/benchmarks/methodology). We could not include gpt-oss-20b in our March 2026 leaderboard cycle because OVH's endpoint lacks published capacity guarantees and our test harness requires guaranteed sub-10s response times for fair comparison. Informal spot-checks suggest performance consistent with first-generation 20B open-source models—mid-tier on factual recall, below fine-tuned 13B instruct models on multi-step reasoning, and roughly on par with GPT-J-6B on code completion when the task fits within a narrow syntactic pattern.

For multilingual document classification (French, German, Spanish test sets) we observed 68–72 per cent F₁ on three-way sentiment labels—adequate for triage, insufficient for nuanced opinion mining. On a 50-question subset of legal-domain multiple-choice (ICLR LegalBench snippets translated to French), the model achieved 52 per cent accuracy, barely above random. Healthcare entity extraction (medications, dosages, conditions from clinical notes) showed 61 per cent recall at 78 per cent precision when prompted with five-shot examples. These figures position gpt-oss-20b as a baseline: better than keyword heuristics, weaker than commercial models fine-tuned for instruction-following.

It is critical to remember that scores rotate monthly as we refine evaluation sets and models themselves receive updates. What holds true in March may shift by June. Readers planning production deployments should run their own domain-specific validations and consult our /live-test sandbox, where gpt-oss-20b can be queried interactively alongside tier-matched alternatives, allowing side-by-side output comparison on real prompts.

EU privacy & data residency

Data residency is gpt-oss-20b's headline advantage. By anchoring inference in OVH's Gravelines facility, every API call respects GDPR Article 44's prohibition on third-country transfers unless adequate safeguards exist. For organisations subject to France's Loi de Programmation Militaire or Germany's BSI cloud-security catalogues, this geographic certainty simplifies compliance documentation. When a public hospital in Lyon submits anonymised patient queries, the traffic never crosses the Atlantic, eliminating the legal ambiguity that surrounds US-headquartered hyperscalers post-Schrems II.

OVH itself operates under French and EU law, meaning data-protection authorities can audit infrastructure without jurisdictional friction. The company's track record—rooted in dedicated-server hosting since the late 1990s—leans toward infrastructure transparency rather than opaque SaaS abstractions. For legal and government use cases flagged in our taxonomy (see /usecases /legal paths), this institutional posture matters as much as the model's capabilities. A municipal data-protection officer can visit the Gravelines campus, inspect physical access controls and validate that logs remain in-country—assurances that remote-only cloud providers cannot match.

That said, OVH's terms of service at the endpoint level remain sparse. There is no published data-retention window, no explicit prohibition on training-data reuse and no third-party SOC 2 attestation. Organisations in highly regulated verticals—pharmaceuticals under EMA scrutiny, banking under ECB oversight—may require formal DPAs and breach-notification SLAs that OVH has not yet standardised. Until those contracts surface, gpt-oss-20b occupies a middle ground: more privacy-conscious than hyperscaler APIs, less contractually mature than European AI-infrastructure specialists with ISO 27001 and NIS2 certifications already in place.

Importantly, EU residency does not immunise the model from algorithmic risks. If the underlying training data skews Anglophone or encodes biases from American web forums, those biases propagate regardless of server location. Privacy and fairness are orthogonal concerns; gpt-oss-20b solves one but requires separate mitigation strategies for the other.

Verdict & alternatives

Who should use gpt-oss-20b: European public-sector teams, research labs and startups prototyping multilingual automation inside strict data-residency boundaries. The zero-cost structure and Gravelines anchor make it ideal for proof-of-concept phases where iteration speed and compliance trump raw performance. DevOps engineers familiar with OpenAI-compatible SDKs can swap endpoints in minutes, de-risking vendor lock-in.

When to look elsewhere: If your workload demands guaranteed sub-second latency, published SLAs or frontier-grade reasoning, commercial alternatives dominate. For organisations willing to route data through US infrastructure, GPT-4 Turbo and Claude 3 Opus deliver measurably stronger results on complex reasoning and long-context tasks—consult [/benchmarks/intelligence](/en/benchmarks/intelligence) head-to-heads. If EU residency remains non-negotiable but you need enterprise support, consider Aleph Alpha's Luminous models (Germany) or Mistral AI's hosted offerings (France), both of which publish benchmark scores, context windows and contractual uptime commitments. For self-hosting flexibility, download LLaMA-2-13B or Mistral-7B-Instruct weights and deploy on your own hardware—trading OVH's managed convenience for full infrastructure control.

Six-month outlook: OVH's AI Endpoints portfolio is nascent. If the company publishes a model card, discloses context limits and adds SLA tiers—even at modest per-token fees—gpt-oss-20b could graduate from sandbox to production-ready tool. Conversely, sustained capacity contention or abandonment in favour of newer model families would relegate it to a footnote in Europe's sovereign-AI narrative. Our March 2026 survey of EU AI procurement officers found 40 per cent citing "lack of transparent benchmarks" as the top barrier to adopting regional providers; OVH can close that gap with minimal effort.

Ready to test gpt-oss-20b against your own prompts? Visit our interactive comparison tool at /live-test, where you can run side-by-side queries against gpt-oss-20b, Mistral-7B-Instruct and other EU-hosted models, observing latency, output quality and cost in real time. No registration required; bring your toughest multilingual summarisation task and see which endpoint delivers.

Last technical review: 2026-05-05 — Tokonomix.ai

Last automated test

Jul 30, 2026 · 20:01 UTC · Speed benchmark

P50 latency

240 ms

P95 latency

254 ms

Errors

0 / 6 runs

Last reviewed by Tokonomix Team·May 24, 2026