Tier B — Productie

Draait in:FranceGemaakt in:United States

Meta-Llama-3_3-70B-Instruct

Tier B — Productie

Tokonomix-redactie·Gecontroleerd door Mes Kalkan·Gepubliceerd 27 mei 2026·Laatst gecontroleerd 30 juli 2026

Sectie 01

Snelheidsanalyse

Latency gemeten over alle benchmark-runs. P50 (mediaan) en P95 (95e percentiel) geven een realistisch beeld van de responssnelheid onder normale en piekbelasting.

P50 latency (mediaan)P95 latency101 runs

Sectie 02

Kwaliteitsscores

Evaluatieresultaten van judge-model beoordelingen over diverse taakcategorieën. Scores weerspiegelen coherentie, accuratesse en instructieopvolging.

Creatief

Feitelijk

100

Meertaligheid

100

Redeneren

Sectie 03

Prijsgeschiedenis

Directe provider-tarieven per miljoen tokens, plus een typische gespreks-kostschatting.

💰

API-tarieven — Meta-Llama-3_3-70B-Instruct

$0.6700 per 1M input-tokens

$0.6700 per 1M output-tokens

≈ $0.0005 per typisch gesprek (800 tokens)

Input vs output prijs (per 1M tokens)

per 1M input-tokens$0.6700

per 1M output-tokens$0.6700

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.6700

input / 1M

— stable

$0.6700

output / 1M

— stable

2026-06-142026-06-282026-07-26

Input

Output

Price change

⟳ synced weekly

Sectie 04

Tokens per seconde

Doorvoersnelheid in tokens per seconde, afgeleid uit gemeten P50-latency. Hogere waarden zijn beter; fluctuaties weerspiegelen serverbelasting bij de provider.

Doorvoer (tokens / s)1429 / avg 1555

Geschat uit P50-latency × 200 output-tokens — het absolute getal hangt af van deze aanname; de trend is wat telt.

Sectie 05

Mogelijkheden

ownedBy: meta-llama

Sectie 06

Beschikbaarheid

Hoe vaak dit model antwoordt als we het aanroepen — gemeten over echte API-aanvragen en live-tests in de afgelopen 30 dagen. Dit staat los van kwaliteit: deze cijfers laten alleen zien of het model reageert, niet hoe goed het antwoord is.

Afgelopen 7 dagen

—

Afgelopen 30 dagen

100.0%

n=83

Mediane responstijd

119,496ms

n=83

Gebaseerd op 473 metingen in de afgelopen 30 dagen.

Technische details

Alleen echte API-aanroepen en live-testverzoeken tellen mee — interne probes en benchmarkruns zijn uitgesloten.

Aanroepen met een eigen API-sleutel (BYOK) zijn uitgesloten: die fouten zijn sleutelspecifiek en geen teken van modelneergang.

Mislukte aanroepen worden NIET meegeteld in kwaliteitsscores — kwaliteit wordt gemeten op geslaagde responses. Beschikbaarheid en kwaliteit zijn onafhankelijke signalen.

Mediane responstijd (p50) over geslaagde aanroepen met een vastgelegde duur. Uitschieters trekken de mediaan minder dan het gemiddelde.

Totaal aanroepen (30d)

OK-reacties (30d)

Totaal aanroepen (7d)

OK-reacties (7d)

Sectie 07

Tokonomix benchmark-oordelen

⚖️

Endorsed by 2 judges

Independent LLM judges evaluated this model on our weekly intelligence tests

cohere/command-a100/100 · 1 runs

1 correct0 partial0 wrong100% accuracy

claude-sonnet-4-594/100 · 48 runs

44 correct1 partial3 wrong92% accuracy

● 2026-07-26

Quality drops 9.7 points to 88.0, factual performance weakens significantly

Meta-Llama-3.3-70B-Instruct on OVH AI Endpoints shows a concerning quality decline in this benchmark window, falling from 97.7 to 88.0 overall. The most dramatic shift appears in factual performance, which scored just 57 compared to strong performance in other categories. Creative writing maintains its previous excellence at 95, while multilingual capabilities remain perfect at 100. Reasoning performance is now tracked at 100, representing solid logical processing. The coding category, which scored 98 in the previous window, is no longer represented in current results, making direct comparison difficult. Latency remains essentially stable at 7649ms compared to 7683ms previously, indicating no performance regression in response times. This quality drop of nearly 10 points is substantial and warrants attention, particularly given the weak factual accuracy score that pulls down the overall rating. Users relying on this model for fact-based tasks should be aware of this limitation, while those focused on creative, multilingual, or reasoning applications can expect continued strong performance. The consistency in test runs at 5 samples suggests these results are preliminary but indicative of current capabilities.

Quality

88.0

Latency p50

7,649 ms

Test runs

✗ Quality dropped 9.7 points✗ Factual performance weak at 57✓ Reasoning excellence at 100✓ Latency remains stable

Laatste automatische test

30 jul 2026 · 08:04 UTC · Snelheidstest

P50 latency

140 ms

P95 latency

1892 ms

Fouten

0 / 6 runs

Laatst beoordeeld door Tokonomix-team·30 juli 2026