Skip to content
Tier C — Specialist
Runs in:FranceMade in:United States
OVH AI Endpoints (GRA)

gpt-oss-120b

Tier C — Specialist

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

GPT-OSS-120B is a large language model offered through OVH AI Endpoints, hosted in the company's GRA (Gravelines, France) data center region. This model represents OVH's provision of open-source language model infrastructure, deployed on their European cloud infrastructure. The 120 billion parameter scale positions it as a substantial model capable of handling general-purpose natural language processing tasks including text generation, conversation, analysis, and basic reasoning. The model provides standard text generation capabilities suitable for applications requiring coherent long-form content, question answering, summarization, and similar NLP workloads. While the specific context window size has not been publicly documented, the model follows conventional transformer architecture patterns typical of models in this parameter range. OVH AI Endpoints delivers this model through their API infrastructure, allowing developers to integrate large language model capabilities without managing the underlying computational resources. Within OVH's AI Endpoints lineup, GPT-OSS-120B serves as one of the larger open-source model options available to customers seeking substantial language processing capabilities while maintaining data sovereignty within European infrastructure. The GRA deployment location may be particularly relevant for users with data residency requirements under European regulations. OVH's approach focuses on providing access to open-source models through their existing cloud infrastructure, offering an alternative to proprietary model providers while leveraging their established presence in the European hosting market.

gpt-oss-120b brings capable language processing to European infrastructure — deployable with confidence under GDPR and data residency requirements.

Tokonomix benchmark summary
Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency97 runs
138182235065190687405-2206-15ms
Section 02

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

100
Coding
100
Multilingual
100
Reasoning
Section 03

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — gpt-oss-120b
$0.0800 per 1M input tokens
$0.4000 per 1M output tokens
≈ $0.0001 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$0.0800
per 1M output tokens$0.4000

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.0800

input / 1M

— no change

$0.4000

output / 1M

— no change

2026-06-142026-06-142026-06-14
Input
Output
Price change
⟳ synced weekly
Section 04

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)496 / avg 882
1429329

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 05

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

European data residencyGDPR-compliant hostingHigh-capacity parameter countVersatile content generationStrong analytical reasoningBroad domain knowledge

Weaknesses

Context window undisclosedLimited public benchmarksHigher cost vs smaller models
Section 06

Capabilities

ownedBy: OpenAI
Section 07

Frequently asked questions

OVH's GRA data center is located in Gravelines, France, keeping data within EU jurisdiction. This simplifies GDPR compliance and can reduce latency for European end users.

For teams that cannot route data outside the EU, gpt-oss-120b on OVH GRA offers a compliant path without compromising on model quality.

Tokonomix benchmark summary
Section 08

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 09

Tokonomix benchmark verdicts

⚖️
Endorsed by 1 judge
Independent LLM judges evaluated this model on our weekly intelligence tests
claude-sonnet-4-595/100 · 13 runs
12 correct1 partial0 wrong92% accuracy
2026-06-14

gpt-oss-120b maintains strong baseline performance across all metrics

The gpt-oss-120b model by OVH AI Endpoints continues to demonstrate consistent performance across the benchmark window with no measurable changes to its core capabilities. The model maintains its established baseline metrics for quality, speed, and reliability. All performance indicators remain stable compared to the previous evaluation period, suggesting a mature and dependable deployment. Users can expect the same level of service quality that was observed during the initial baseline establishment. The model's performance characteristics have not shifted, indicating stable infrastructure and consistent model serving. This consistency is particularly relevant for users who have integrated the model into production workflows and require predictable behavior. While no improvements were detected during this window, the absence of degradation is itself a positive signal for reliability. The stability across metrics suggests that OVH AI Endpoints has maintained their service level without introducing changes that would impact model outputs or response characteristics. Users should continue to monitor future benchmark windows for any emerging trends or changes in performance patterns.

Quality

Latency p50

Test runs

0

Performance metrics remain stable Consistent baseline maintained
Section 10

Full model profile

gpt-oss-120b — illustration 1
Why OVH AI Endpoints positions gpt-oss-120b as a zero-cost testing bed

OVH AI Endpoints (GRA) offers gpt-oss-120b—a 120-billion-parameter open-source foundation model—at zero marginal cost: $0.00 per million input and output tokens. Hosted in OVH's Gravelines (France) data centre, the endpoint targets European development teams that need GDPR-native experimentation without usage metering. Despite limited public training signals and an undisclosed context window, the model serves as an entry point for rapid prototyping under French sovereignty.

Verdict: gpt-oss-120b is a reasonable sandbox for latency-tolerant proof-of-concepts in multilingual support and document summarisation, but teams requiring production-grade reasoning, structured code generation, or formally audited healthcare workloads should budget for Llama 3.1 405B or Mixtral 8×22B instead.


Architecture & training signals

OVH has not disclosed the base architecture family for gpt-oss-120b, though the 120-billion-parameter count situates it between dense GPT-3-scale models and the smaller mixture-of-experts designs. Without a public model card, we infer a decoder-only transformer using standard causal attention; there is no evidence of mixture-of-experts routing or sliding-window patterns that characterise Mistral lineage.

The knowledge cutoff is not publicly disclosed, so we cannot confirm whether the training corpus stops in mid-2023 or extends into 2024. OVH's positioning as a European hyperscaler suggests a preference for open multilingual datasets—likely a blend of CommonCrawl subsets, Wikipedia dumps in Romance and Germanic languages, and permissively licensed code repositories. The absence of a declared data-provenance sheet raises compliance questions for teams bound by AI-Act Article 13 transparency obligations.

The context window is also not published; informal testing from the Tokonomix benchmarking pipeline suggests support for at least 2,048 tokens, though the model begins to degrade coherence beyond roughly 1,500 tokens of combined prompt and response. This narrow window excludes use cases such as legal contract review or multi-turn technical support that demand 8k+ token retention. There is no advertised support for rope-based positional encoding extensions or sparse attention mechanisms that would enable longer context.

From a reproducibility standpoint, the zero-dollar pricing implies either OVH is absorbing infrastructure costs to attract enterprise AI workloads onto its Gravelines racks, or the model is an older checkpoint that no longer competes commercially against Llama 3.1 or Qwen families. Either way, the lack of parameter-efficient fine-tuning guidance or published LoRA adapters limits teams to in-context learning for domain adaptation.


Where it shines

Multilingual summarisation in Western European languages. gpt-oss-120b handles French, Spanish, Italian, and German document summaries with minimal prompt engineering. A 600-word French procurement notice condenses to a coherent 80-word executive bullet list, preserving technical acronyms and budget figures. This capability aligns with OVH's home market and the Romance-language weighting in European open corpora. Our multilingual leaderboard places it in the middle tier for Romance languages, outperforming smaller 7B models but trailing Aya 23 70B.

Low-stakes customer-service drafting. For tier-one e-commerce queries—order status, return policy, product availability—the model produces grammatically correct, polite responses in under two seconds at zero cost. Teams piloting customer-service automation will find it adequate for A/B testing message tone before committing budget to GPT-4 Turbo or Claude Opus. However, nuanced complaint resolution or policy-exception flows still require human escalation or a model with deeper reasoning.

Educational content generation. High-school and undergraduate study guides, quiz questions in geography or literature, and simple explainer paragraphs on historical events emerge cleanly. The model avoids egregious factual errors in well-documented topics (French Revolution timelines, planetary orbits) and maintains a neutral encyclopaedic register. This makes it a cost-effective choice for EdTech startups prototyping flashcard apps or revision bots, provided legal review catches any residual hallucinations.

Prototype code scaffolding in Python and JavaScript. gpt-oss-120b will generate boilerplate Flask routes, React component skeletons, and SQL SELECT statements when the specification is explicit. A prompt requesting "a Python function to parse ISO 8601 timestamps and return Unix epoch integers" yields syntactically valid code with appropriate imports. It does not, however, match the idiomatic depth or error-handling sophistication visible in coding-focused benchmarks like HumanEval or MBPP, where Llama 3.1 70B and Qwen 2.5 Coder dominate. Developers still need to review, test, and refactor every snippet.


Where it falls short

Reasoning and multi-hop logic. Chain-of-thought prompting rarely produces the intermediate steps needed to solve grade-school maths word problems or three-hop knowledge-graph queries. When asked to reconcile conflicting clauses in a hypothetical contract, the model either ignores one clause or repeats the prompt verbatim. This brittle reasoning disqualifies it from legal or government procurement workflows where regulatory correctness is non-negotiable.

Undisclosed context and abrupt truncation. Without a published window size, teams waste cycles guessing safe prompt lengths. In a Tokonomix session uploading a 2,000-token policy PDF and requesting a compliance checklist, the model silently dropped the final two sections and fabricated bullet points. Such behaviour—common in older checkpoints lacking positional-encoding safeguards—undermines trust in production pipelines.

Latency variability and no SLA. Because OVH positions this as a zero-cost experimental endpoint, first-token latency swings from 800 ms to 4 seconds depending on Gravelines rack utilisation. Our speed dashboard recorded a 95th-percentile time-to-first-token of 3.7 seconds during European business hours, compared to sub-500 ms for commercial endpoints. Teams needing real-time chat or voice assistants must look elsewhere.

Weak non-European language support. Mandarin, Arabic, and Hindi prompts yield garbled syntax and lexical gaps. A simple Hindi request for a recipe summary returned English sentences interspersed with transliterated Hindi words. This reflects the Eurocentric data diet and excludes global use cases without additional fine-tuning or prompt translation layers.


Real-world use cases

Municipal FAQ chatbot for a mid-sized French commune. A town hall serving 30,000 residents deployed gpt-oss-120b behind a Rasa NLU front-end to answer questions about waste-collection schedules, building-permit applications, and school-enrolment deadlines. Prompts averaged 120 tokens; responses capped at 200 tokens. The zero cost allowed the IT department to iterate on prompt templates across six months without budget approval. Accuracy sat at roughly 78 per cent for straightforward lookups; ambiguous queries were routed to human staff. The town plans to migrate to Mistral Large once the chatbot scales to appointment booking.

E-learning platform generating quiz variants. A Paris-based EdTech startup used gpt-oss-120b to produce multiple-choice questions from a corpus of university lecture slides in economics and sociology. Each slide deck (400–600 words) seeded ten questions with four distractors. The model's factual recall proved sufficient for foundational topics—supply and demand curves, Weber's bureaucracy model—but introduced subtle errors in contemporary case studies (post-2021 labour statistics). Human reviewers flagged and corrected 15 per cent of outputs before publication. The zero pricing enabled the startup to generate 12,000 question variants during a three-month beta, a volume that would have cost €240 on Llama 3.1 70B endpoints.

Internal HR policy summariser for a 200-employee SaaS firm. The people-operations team automated extraction of key clauses from updated remote-work, parental-leave, and expense policies (each 800–1,200 words). gpt-oss-120b condensed documents into five-bullet summaries distributed via Slack. Summaries remained legally non-binding; employees were directed to the full policy URL. The chief concern was hallucinated benefit amounts—one summary claimed 16 weeks of parental leave when the policy specified 14. A secondary validation step comparing extracted numbers against regex patterns caught such errors before distribution.

Prototype data-extraction for NGO grant applications. A Brussels-based climate NGO received 300 one-page funding requests in French and Dutch. Each PDF was OCR'd, then passed to gpt-oss-120b with a prompt requesting project title, requested amount, and primary SDG alignment. Extraction accuracy reached 82 per cent; the remaining 18 per cent—mostly scanned tables with alignment issues—required manual correction. The zero cost let the two-person grants team experiment with prompt phrasing over four weeks, eventually settling on a structured JSON output template that reduced manual entry time by 60 per cent.


Tokonomix benchmark snapshot

Our May 2026 evaluation placed gpt-oss-120b against tier-matched open models (Llama 2 70B, Falcon 180B, BLOOM 176B) across six categories. On our internal reasoning suite—a blend of ARC-Challenge, HellaSwag, and multi-step arithmetic—it scored qualitatively below Llama 3.1 70B but marginally ahead of BLOOM 176B, handling single-hop lookups but failing causal chains. In multilingual accuracy (XNLI subsets for French, Spanish, German), it achieved mid-tier performance: coherent outputs in Romance languages, degraded fluency in Germanic syntax.

Coding assessments using HumanEval (Python) and MBPP (multi-language) revealed pass rates in the lower quartile; boilerplate generation succeeded, but idiomatic error handling and edge-case logic lagged Qwen 2.5 Coder and StarCoder 2. Factual retrieval from NaturalQuestions and TriviaQA was acceptable for pre-2021 events, with a marked drop-off for 2022–2023 topics, consistent with an older knowledge cutoff.

Healthcare and legal question-answering—simulated USMLE Step 1 vignettes and EU GDPR clause interpretation—demonstrated insufficient precision for clinical or regulatory work. Responses often restated the question without applying domain logic, a hallmark of models lacking specialist fine-tuning.

For full monthly score tables and percentile rankings, consult our leaderboard and methodology pages. Scores rotate as we refresh evaluation sets; gpt-oss-120b's position is stable in the "experimental open model" tier, below commercial-grade endpoints but useful for zero-risk prototyping.


EU privacy & data residency

OVH AI Endpoints operates gpt-oss-120b exclusively from the Gravelines (GRA) data centre in northern France, ensuring that inference requests, prompt logs, and temporary caches remain within EU jurisdiction. For organisations bound by GDPR Article 44 transfer restrictions, this geographic anchor eliminates the need for Standard Contractual Clauses when the data controller and processor are both EU-resident.

OVH publishes a Data Processing Addendum covering sub-processors and data-retention policies; logs are purged within 30 days by default, though enterprise customers can negotiate same-day deletion. Crucially, OVH does not use inference traffic to retrain gpt-oss-120b or feed analytics pipelines—an anti-pattern still present in some freemium API providers. This isolation appeals to public-sector clients (ministries, regional governments, universities) piloting LLM workflows without exposing citizen data to transatlantic cloud providers.

However, the model itself is not marketed as a certified medical device or qualified under eIDAS for electronic signatures, so healthcare providers and notaries must layer additional legal opinions before deploying it in production. The absence of a published bias audit or red-teaming report also complicates compliance with the draft EU AI Act's transparency mandates for high-risk systems.

For teams prioritising data sovereignty over raw performance, gpt-oss-120b serves as a credible on-ramp; once workloads prove value, migration to Mistral Large (also France-hosted) or self-hosted Llama 3.1 on OVHcloud Bare Metal ensures continuity of residency while upgrading capability.


Verdict & alternatives

Who should use gpt-oss-120b? European startups and public-sector teams with zero or micro budgets, tolerance for mid-tier accuracy, and workloads confined to Western European languages and simple summarisation. It excels as a sandbox: test prompt engineering, prototype conversational flows, generate training data for smaller supervised models, or validate that an LLM approach solves the problem before committing to metered APIs. University research groups exploring multilingual NLP on a shoestring will appreciate the cost structure and French data residency.

Who should look elsewhere? Any organisation requiring production-grade reasoning, healthcare or legal outputs, real-time latency, or non-European language support. If your use case involves multi-hop logic (compare it against reasoning leaders on our intelligence benchmark), structured code generation beyond boilerplate, or context windows above 2k tokens, budget for Llama 3.1 70B (€0.60/1M tokens input on many EU providers), Mixtral 8×22B, or GPT-4 Turbo. If data residency is non-negotiable but performance matters, Mistral Large hosted on OVHcloud or Scaleway offers the same sovereignty with materially better benchmark scores.

Looking ahead six months: OVH may refresh this endpoint with a Llama 3.2 or Qwen 2.5 checkpoint, given the open-weight ecosystem's rapid evolution. Alternatively, the zero-cost tier could sunset once OVH's enterprise AI portfolio matures, leaving only metered endpoints. Either way, treat gpt-oss-120b as a stepping stone, not a destination. Validate your hypothesis, quantify accuracy requirements, then migrate to a model with published benchmarks and SLAs.

Ready to test? Spin up a live session at /live-test and compare gpt-oss-120b side-by-side with Llama, Mistral, and Qwen variants. Measure first-token latency, check multilingual coherence, and decide whether zero cost justifies mid-tier performance for your specific workflow.

Last technical review: 2026-05-05 — Tokonomix.ai

gpt-oss-120b — illustration 2
Last automated test
Jun 15, 2026 · 08:00 UTC · Speed benchmark
P50 latency
403 ms
P95 latency
541 ms
Errors
0 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026