Which open-weight model should you self-host?
Self-hosting a language model is the option that gets dismissed too early and adopted too late. Teams write it off as "behind the hosted frontier" when they could be running quality that was state-of-the-art twelve months ago at a fraction of the recurring cost. They adopt it in a panic after a compliance review finds a deal-breaker in someone else's terms of service. This guide picks the five open-weight models we would build a self-hosted stack on today, and the dimensions that decide which one matches your hardware.

Why self-hosted deserves a second look
The pitch against open-weight models used to be simple: the hosted frontier is so far ahead that running anything else is a false economy. That argument got weaker every quarter through 2024 and 2025. The strongest open models today match what was hosted-flagship a year ago, which is enough quality for almost every production workload that is not customer-facing chat. The gap to the cutting edge is real, but the gap to "good enough" is gone.
The case for going local is rarely about quality. It is about data residency, recurring cost, latency in regions the major vendors do not serve, and the ability to run a model that does not change underneath you when the provider deprecates a generation. A team that processes ten million internal documents a month for classification can save six figures a year on self-hosted infrastructure versus pay-per-token. A team that handles regulated data can avoid an entire procurement nightmare. A team in a region with high latency to US datacenters can serve users an order of magnitude faster.
The cost equation is not as simple as model weights are free. You pay for GPUs — bought or rented — and for the engineering time to operate them. The break-even depends on token volume: below roughly a hundred million tokens a month, hosted APIs almost always win on total cost; above a billion, self-hosting almost always does. The middle range is where the workload-specific details decide.
Five constraints define the choice: how much VRAM the model needs at a quality you can tolerate, the licence terms for your use case, the maturity of the surrounding ecosystem, and the latency the model can actually deliver on your hardware. The right model is the one that fits all five — not the one with the best paper benchmark score.

The five dimensions that decide which model fits
These are the axes our scorecard weights when picking an open-weight model for production self-hosting. The relative weighting shifts with your hardware budget, your jurisdiction and your tolerance for ecosystem rough edges — but every credible contender clears a minimum bar on all five.
- 01 — Hardware fit
Does it run on the cards you actually have?
A model that needs a multi-GPU node is a different proposition from one that runs on a single consumer card. Always calculate the VRAM requirement at the quantisation you plan to deploy and add a comfortable margin for the KV cache at your target context length. The cheapest mistake is over-buying hardware; the most expensive is under-buying.
- 02 — Quality at quantisation
How much does it lose at the quant level you can fit?
Quantisation trades quality for memory and speed. Some models hold up well to four-bit quants; others degrade noticeably below eight. The published full-precision benchmarks tell you little — measure on the quant level your hardware actually allows, and accept that the answer can flip the leaderboard.
- 03 — Licence terms
Can you actually use it the way you intend?
Open weights are not all open licences. Some allow broad commercial use with no obligations; others carry usage thresholds, attribution clauses or redistribution restrictions. Read the licence before you build, not after. A friendly licence with weaker quality usually beats a stricter one your legal team will eventually veto.
- 04 — Ecosystem support
Is the serving stack rough or polished?
A model with first-class support in vLLM, Ollama and llama.cpp will be cheaper to operate by orders of magnitude than one that ships with a single reference script and a hopeful README. Tooling maturity is the hidden cost most teams underestimate; it shows up in the engineer-hours you spend on incidents.
- 05 — Latency on your hardware
Does it generate fast enough for the use case?
A self-hosted model that produces ten tokens a second on the GPU you can afford is a model you cannot use for chat. Measure tokens-per-second under realistic concurrency on the exact card you plan to deploy on; numbers from someone else's H100 will not transfer to your L40S.
Tokonomix top 5 picks for self-hosted today
What follows is the set we would actually deploy on metal next week. Self-hosting rewards a different kind of selection than the hosted-API world does — the right main model is usually the largest one that still leaves you headroom on the GPU at the quant level you can tolerate. Add a smaller second model behind a router for the queries that do not need the big one, and the economics start to bend in your favour.
Meta-Llama-3_3-70B-Instruct
via OVH AI Endpoints (GRA)
The de-facto baseline that every open-weight discussion now starts from. Strong instruction-following, broad language coverage, and a community ecosystem (Ollama, vLLM, llama.cpp) deeper than any alternative. Needs serious hardware — two consumer GPUs or one datacenter card — but the quality at that size pays for itself.
- Input / 1M tokens
- $0.6700
- Output / 1M tokens
- $0.6700
- Context
- —
Qwen3-32B
via OVH AI Endpoints (GRA)
Fits comfortably on a single high-end consumer GPU at sensible quantisation, with quality close to the larger Llama for most workloads. The right pick when the budget is one card, not a cluster, and English is not the only language the model has to handle well.
- Input / 1M tokens
- $0.0800
- Output / 1M tokens
- $0.2300
- Context
- —
Mistral-Small-3.2-24B-Instruct-2506
via OVH AI Endpoints (GRA)
Permissively-licensed open weights from a European vendor, hosted on EU-resident infrastructure, and tuned for languages the US-origin models often handle thinly. A natural pick for teams whose procurement rules favour EU-origin models or whose users speak something other than the top three. Always re-read the licence note on the model card before shipping commercial use.
- Input / 1M tokens
- $0.0900
- Output / 1M tokens
- $0.2800
- Context
- —
gpt-oss-120b
via OVH AI Endpoints (GRA)
Strong general-purpose instruct model with permissive licensing and good multimodal support in the vision-enabled variants. Smaller than the Llama and Qwen flagships but punches well above its weight; a sensible default when ecosystem maturity matters more than chasing the absolute top of the leaderboard.
- Input / 1M tokens
- $0.0800
- Output / 1M tokens
- $0.4000
- Context
- —
Hosted price reference (when you do not self-host)
Self-hosting is one option; the other is buying inference from a provider that runs the same open-weight models on your behalf. The chart shows the live hosted price per million output tokens for the picks that publish one — useful as a sanity-check for your own self-hosted unit economics.

A field guide: which model for which hardware
The mapping below is the one we would use to advise a team picking their first self-hosted model. Treat it as a default, not a verdict — measuring tokens-per-second on your own GPU will beat any general recommendation.
Single consumer GPU (24-32 GB VRAM)
Workstation or developer laptop with one strong card. Mistral Small 3.2 or Qwen3-32B at four-bit quant give the best quality-per-card in this range. Serve via Ollama for ease of use or vLLM for higher throughput.
Datacenter inference node
An L40S, A100 or H100 dedicated to inference. Llama 3.3 70B is the safe default; step up to gpt-oss-120b if the quality gap matters and the hardware can hold it. vLLM with paged attention for serving.
CPU-only or edge device
Embedded device, on-laptop privacy mode, or a server without a GPU. Stick to small models — Gemma 3 4B or Mistral 7B — served via llama.cpp. Set realistic expectations: the quality will not match a hosted tier-A model.
Managed open-weight inference
You want the licence and provenance of open models without operating the GPUs. Providers like OVH AI Endpoints serve Llama, Mistral, Qwen and Gemma on EU-resident infrastructure with per-token pricing — middle ground between full self-host and hosted frontier.

Benchmark on your own hardware before you commit
Borrow the GPU you intend to deploy on. Load two candidates onto it at the quant level you actually plan to ship — not the full-precision version on a borrowed H100 — and run the same hundred prompts through both at realistic concurrency. You will learn more about which one fits you in an afternoon than any benchmark page can tell you in a quarter.
Then read what comes out. Did it cope with the quant? Did the throughput hold under concurrent load? Did the licence survive your legal team's first reading? Does your chosen serving stack treat it as a first-class citizen or as an afterthought? The model that wins on your hardware is the one to put into production — even when it is not the one any leaderboard puts at the top.
Open the live test tool →