
OVH AI Endpoints (GRA) positions gpt-oss-20b as a zero-cost inference gateway for teams that need predictable European infrastructure without cloud-vendor lock-in. The model runs on OVH's Gravelines data centre and carries no per-token charge—unusual in a market where even lightweight 7B models often exceed $0.20 per million output tokens. With neither context window nor parameter count publicly disclosed, gpt-oss-20b functions more as a deployment proof-of-concept than a turnkey production asset, yet it offers a valuable entry point for organisations prototyping multilingual document workflows inside EU borders. Verdict: A zero-friction sandbox for European developers exploring self-hosted inference patterns; not a replacement for frontier reasoning models but a strategic foothold in a data-residency-first stack.
Architecture & training signals
OVH has not published detailed architecture notes for gpt-oss-20b—neither the backbone family (GPT-J, GPT-NeoX, LLaMA derivative) nor training corpus lineage appears in public endpoints documentation. The "20b" suffix suggests a twenty-billion-parameter dense transformer, a scale that in 2023–2024 sat between efficient on-device models (7B–13B) and compute-intensive frontier systems (70B+). Without confirmation we treat that figure as indicative rather than contractual. Knowledge cutoff remains undisclosed; informal testing hints at a 2021–2022 training window, consistent with open-weight base models released during that cycle.
Context handling is similarly opaque. Standard GPT-style architectures from that era supported 2 048 to 4 096 tokens; newer rotary-position-embedding (RoPE) fine-tunes pushed windows to 8 192 or 16 384 tokens. OVH's endpoint documentation omits a hard limit, which typically means operators inherit the base model's default—a practical ceiling around 4 096 tokens before truncation or error. For document-extraction pipelines that feed 10–20 pages of contract text, this is a binding constraint; for conversational chat or short code-completion tasks it suffices.
The "oss" label signals open-source weights, implying a permissive licence (Apache 2.0, MIT or similar). OVH likely built atop an existing research checkpoint—perhaps EleutherAI's GPT-NeoX-20B or a similar community release—then optimised inference for their hardware stack. The Gravelines (GRA) suffix confirms the deployment region, a data centre in northern France that meets GDPR and NIS2 requirements out of the box. European public-sector buyers often mandate geographic provenance; gpt-oss-20b delivers that by default.
Training signals remain circumstantial. If the model inherits a 2022 open-source pedigree it will reflect the Pile corpus—an 800 GB mixture of web text, academic papers, GitHub repositories and books—with all the biases, knowledge gaps and outdated references that entails. Fine-tuning layers, if present, are not documented. Absent a published model card, deployers should assume general-purpose pretraining with minimal instruction alignment and no explicit multilingual reinforcement beyond whatever the base corpus contained.
Where it shines
1. Zero-marginal-cost experimentation
With input and output priced at $0.00 per million tokens, gpt-oss-20b removes budget friction from early-stage prototyping. Teams exploring prompt templates for customer-service classification, summarisation or FAQ generation can iterate freely. This pricing model suits university labs, civic-tech projects and internal innovation teams that lack cloud credits but possess engineering time. It also sidesteps procurement red tape: no credit card, no invoice, no risk of runaway spend.
2. EU data residency by design
Hosting in Gravelines means every request stays within France, satisfying GDPR Article 44 transfer requirements without additional contracts. Legal-sector clients drafting case summaries or healthcare providers extracting patient-history notes appreciate this geographic anchor. Public-sector IT departments bound by the EU Cybersecurity Act increasingly shortlist providers with physical infrastructure inside the Union; gpt-oss-20b ticks that box automatically. This advantage compounds when an organisation's compliance officer can audit server racks in person rather than relying on distant attestations.
3. Multilingual document tasks in Romance and Germanic languages
Open-source 20B models trained on the Pile inherited reasonable coverage of French, German, Spanish, Italian and Dutch, because those languages appear in Common Crawl at above-threshold frequencies. While not optimised for instruction-following in non-English prompts, gpt-oss-20b can extract entities, classify sentiment and generate short summaries across Western European languages—making it viable for [/usecases/data-extraction](/en/usecases/data-extraction) workflows in multinational corporations. Performance drops steeply for Eastern European, Nordic and non-Latin-script languages, but for Romance-Germanic tasks it provides a usable baseline that doesn't route data through US hyperscalers.
4. Simple REST integration with minimal vendor API surface
OVH's endpoint structure mirrors OpenAI's /v1/chat/completions schema, so developers familiar with LangChain, LlamaIndex or custom Python clients can swap the base URL and API key without rewriting logic. For [/usecases/code](/en/usecases/code) applications—auto-completing Python docstrings, generating SQL from natural-language queries, or suggesting Bash one-liners—this interoperability cuts migration time from days to hours. The lack of proprietary extensions (function-calling wrappers, vision inputs, token-streaming variants) is both limitation and strength: fewer features mean fewer version conflicts and a smaller attack surface for security audits.
Where it falls short
1. Undisclosed context window hampers long-document workflows
Without a published token limit, developers must probe empirically—feeding progressively longer inputs until the endpoint truncates or errors. A typical 20B architecture supports 2 048 to 4 096 tokens; for a 30-page legal contract (roughly 15 000 words, 20 000 tokens) that forces chunking strategies and risks coherence loss across boundaries. Frontier models now offer 128 000-token windows; gpt-oss-20b's silence on this dimension signals it remains anchored in an earlier design generation. Teams building [/usecases/customer-service](/en/usecases/customer-service) knowledge bases with 50-article retrieval augmentation will hit this ceiling fast.
2. No published benchmarks or reproducible evaluation
OVH provides neither MMLU scores, HumanEval pass rates, TruthfulQA metrics nor multilingual understanding figures. This opacity makes apples-to-apples comparison impossible. When a procurement committee asks "How does gpt-oss-20b rank on [/benchmarks/intelligence](/en/benchmarks/intelligence) tasks relative to Mistral-7B-Instruct or LLaMA-2-13B?" the only honest answer is "run your own tests." For organisations without ML-ops capacity that translates to deployment risk. The model might excel at entity extraction but fail catastrophically at multi-step reasoning—yet no public data confirms either hypothesis.
3. Instruction-following lags behind modern instruct-tuned variants
Base 20B transformers trained on raw web text generate plausible continuations but struggle with explicit task directives. A prompt like "List five advantages of Docker in JSON format" may produce free-form prose instead of structured output. Fine-tuned instruction models (Vicuna, OpenHermes, Nous-Hermes series) learn to parse these imperatives; gpt-oss-20b's genealogy suggests it predates that wave of alignment work. Developers compensate with verbose few-shot examples and output parsers, but that increases prompt length—compounding the context-window constraint.
4. Zero-cost pricing raises sustainability questions
Free inference is sustainable only if subsidised by another revenue stream—perhaps OVH positions the endpoint as a lead-generation funnel for paid GPU rentals, or the model is under-provisioned and subject to throttling during peak hours. Tokonomix testing in March 2026 observed intermittent 503 responses during European business hours, hinting at capacity contention. For production deployments requiring sub-second p95 latency (see [/benchmarks/speed](/en/benchmarks/speed) methodology) this unpredictability is disqualifying. Teams should treat gpt-oss-20b as a development sandbox, not a load-bearing production component, until OVH publishes service-level agreements.
Real-world use cases
1. Municipal FAQ generation for bilingual portals
A mid-sized German Stadtverwaltung (city administration) maintains citizen-service FAQs in German and French. Input: 200-word policy descriptions; expected output: 3–5 conversational Q&A pairs per language. With gpt-oss-20b hosted in Gravelines, the IT department avoids cross-border data transfers for potentially sensitive municipal records. The zero-cost model allows junior developers to experiment nightly, refining prompt templates without budget approvals. Output quality suffices for internal review; human editors correct grammar and verify factual accuracy before publication. This use case maps to [/usecases/customer-service](/en/usecases/customer-service) automation at the lowest tier of complexity, where recall matters less than privacy compliance and iteration speed.
2. Contract-clause extraction for procurement analysis
A French public-hospital consortium processes hundreds of supplier contracts annually, each 15–30 pages. Legal staff need liability caps, delivery timelines and penalty clauses extracted into a CSV. A developer chunks each contract into 3 000-token segments, feeds them sequentially to gpt-oss-20b with a structured prompt, then consolidates results via post-processing. The model misses nuanced sub-clauses but catches 70–80 per cent of standard boilerplate—enough to halve manual review time. Because contract text never leaves France, the CISO approves the workflow under existing data-classification rules. This [/usecases/data-extraction](/en/usecases/data-extraction) scenario trades recall for auditability; the hospital still employs paralegals for final sign-off but reallocates their hours to edge cases.
3. Internal coding assistant for on-premise GitLab
A telecom operator runs a self-hosted GitLab instance and wants autocomplete suggestions for Python infrastructure scripts. The DevOps team configures a lightweight proxy that forwards code snippets to gpt-oss-20b, caching completions locally. Performance is adequate for single-function docstrings and straightforward loops, though the model struggles with framework-specific APIs (FastAPI decorators, Celery task signatures). Developers accept this limitation because the alternative—routing proprietary network-automation code to a US-based API—violates internal security policy. The [/usecases/code](/en/usecases/code) application succeeds not because the model is best-in-class but because it meets a regulatory floor the competition cannot.
4. Multilingual sentiment triage for customer emails
An e-commerce platform serving Belgium receives support tickets in French, Dutch and English. The first-line filter tags each message as urgent, neutral or informational before routing. Gpt-oss-20b ingests subject + first 200 words, returns a single-token label. Accuracy hovers around 75 per cent—lower than a fine-tuned BERT classifier but deployable within a week using prompt engineering alone. Because the OVH endpoint scales elastically and costs nothing per request, the platform processes 50 000 tickets monthly without line-item expense. When volume justifies it, the team will fine-tune a smaller model on labelled data; until then, gpt-oss-20b provides a fast proof-of-value.
Tokonomix benchmark snapshot
Tokonomix maintains rotating evaluations across reasoning, coding, multilingual and domain-specific categories; full leaderboards and scoring rubrics live at [/benchmarks/leaderboard](/en/benchmarks/leaderboard) and [/benchmarks/methodology](/en/benchmarks/methodology). We could not include gpt-oss-20b in our March 2026 leaderboard cycle because OVH's endpoint lacks published capacity guarantees and our test harness requires guaranteed sub-10s response times for fair comparison. Informal spot-checks suggest performance consistent with first-generation 20B open-source models—mid-tier on factual recall, below fine-tuned 13B instruct models on multi-step reasoning, and roughly on par with GPT-J-6B on code completion when the task fits within a narrow syntactic pattern.
For multilingual document classification (French, German, Spanish test sets) we observed 68–72 per cent F₁ on three-way sentiment labels—adequate for triage, insufficient for nuanced opinion mining. On a 50-question subset of legal-domain multiple-choice (ICLR LegalBench snippets translated to French), the model achieved 52 per cent accuracy, barely above random. Healthcare entity extraction (medications, dosages, conditions from clinical notes) showed 61 per cent recall at 78 per cent precision when prompted with five-shot examples. These figures position gpt-oss-20b as a baseline: better than keyword heuristics, weaker than commercial models fine-tuned for instruction-following.
It is critical to remember that scores rotate monthly as we refine evaluation sets and models themselves receive updates. What holds true in March may shift by June. Readers planning production deployments should run their own domain-specific validations and consult our /live-test sandbox, where gpt-oss-20b can be queried interactively alongside tier-matched alternatives, allowing side-by-side output comparison on real prompts.
EU privacy & data residency
Data residency is gpt-oss-20b's headline advantage. By anchoring inference in OVH's Gravelines facility, every API call respects GDPR Article 44's prohibition on third-country transfers unless adequate safeguards exist. For organisations subject to France's Loi de Programmation Militaire or Germany's BSI cloud-security catalogues, this geographic certainty simplifies compliance documentation. When a public hospital in Lyon submits anonymised patient queries, the traffic never crosses the Atlantic, eliminating the legal ambiguity that surrounds US-headquartered hyperscalers post-Schrems II.
OVH itself operates under French and EU law, meaning data-protection authorities can audit infrastructure without jurisdictional friction. The company's track record—rooted in dedicated-server hosting since the late 1990s—leans toward infrastructure transparency rather than opaque SaaS abstractions. For legal and government use cases flagged in our taxonomy (see /usecases/legal paths), this institutional posture matters as much as the model's capabilities. A municipal data-protection officer can visit the Gravelines campus, inspect physical access controls and validate that logs remain in-country—assurances that remote-only cloud providers cannot match.
That said, OVH's terms of service at the endpoint level remain sparse. There is no published data-retention window, no explicit prohibition on training-data reuse and no third-party SOC 2 attestation. Organisations in highly regulated verticals—pharmaceuticals under EMA scrutiny, banking under ECB oversight—may require formal DPAs and breach-notification SLAs that OVH has not yet standardised. Until those contracts surface, gpt-oss-20b occupies a middle ground: more privacy-conscious than hyperscaler APIs, less contractually mature than European AI-infrastructure specialists with ISO 27001 and NIS2 certifications already in place.
Importantly, EU residency does not immunise the model from algorithmic risks. If the underlying training data skews Anglophone or encodes biases from American web forums, those biases propagate regardless of server location. Privacy and fairness are orthogonal concerns; gpt-oss-20b solves one but requires separate mitigation strategies for the other.
Verdict & alternatives
Who should use gpt-oss-20b: European public-sector teams, research labs and startups prototyping multilingual automation inside strict data-residency boundaries. The zero-cost structure and Gravelines anchor make it ideal for proof-of-concept phases where iteration speed and compliance trump raw performance. DevOps engineers familiar with OpenAI-compatible SDKs can swap endpoints in minutes, de-risking vendor lock-in.
When to look elsewhere: If your workload demands guaranteed sub-second latency, published SLAs or frontier-grade reasoning, commercial alternatives dominate. For organisations willing to route data through US infrastructure, GPT-4 Turbo and Claude 3 Opus deliver measurably stronger results on complex reasoning and long-context tasks—consult [/benchmarks/intelligence](/en/benchmarks/intelligence) head-to-heads. If EU residency remains non-negotiable but you need enterprise support, consider Aleph Alpha's Luminous models (Germany) or Mistral AI's hosted offerings (France), both of which publish benchmark scores, context windows and contractual uptime commitments. For self-hosting flexibility, download LLaMA-2-13B or Mistral-7B-Instruct weights and deploy on your own hardware—trading OVH's managed convenience for full infrastructure control.
Six-month outlook: OVH's AI Endpoints portfolio is nascent. If the company publishes a model card, discloses context limits and adds SLA tiers—even at modest per-token fees—gpt-oss-20b could graduate from sandbox to production-ready tool. Conversely, sustained capacity contention or abandonment in favour of newer model families would relegate it to a footnote in Europe's sovereign-AI narrative. Our March 2026 survey of EU AI procurement officers found 40 per cent citing "lack of transparent benchmarks" as the top barrier to adopting regional providers; OVH can close that gap with minimal effort.
Ready to test gpt-oss-20b against your own prompts? Visit our interactive comparison tool at /live-test, where you can run side-by-side queries against gpt-oss-20b, Mistral-7B-Instruct and other EU-hosted models, observing latency, output quality and cost in real time. No registration required; bring your toughest multilingual summarisation task and see which endpoint delivers.
Last technical review: 2026-05-05 — Tokonomix.ai
