Tier B — Production

Runs in:FranceMade in:United States

Meta-Llama-3_3-70B-Instruct

Tier B — Production

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 27, 2026·Last reviewed July 30, 2026

Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency102 runs

Section 02

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

Creative

Factual

100

Multilingual

100

Reasoning

Section 03

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — Meta-Llama-3_3-70B-Instruct

$0.6700 per 1M input tokens

$0.6700 per 1M output tokens

≈ $0.0005 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$0.6700

per 1M output tokens$0.6700

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.6700

input / 1M

— stable

$0.6700

output / 1M

— stable

2026-06-142026-06-282026-07-26

Input

Output

Price change

⟳ synced weekly

Section 04

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)1504 / avg 1554

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 05

Capabilities

ownedBy: meta-llama

Section 06

Availability

How often this model answers when we call it — measured across real API requests and live tests over the last 30 days. This is separate from quality: these numbers only tell you whether the model responds, not how good the answer is.

Last 7 days

—

Last 30 days

100.0%

n=82

Median response time

123,720ms

n=82

Based on 472 measurements over the last 30 days.

Technical details

Only live API calls and live-test requests count — internal probes and benchmark runs are excluded.

Calls with a custom API key (BYOK) are excluded: those failures are key-specific, not a sign of model downtime.

Failed calls are NOT included in quality scores — quality is measured on successful responses only. Availability and quality are independent signals.

Median response time (p50) across successful calls with a recorded duration. Outliers (very slow or very fast calls) pull the median less than the average.

Total calls (30d)

OK responses (30d)

Total calls (7d)

OK responses (7d)

Section 07

Tokonomix benchmark verdicts

⚖️

Endorsed by 2 judges

Independent LLM judges evaluated this model on our weekly intelligence tests

cohere/command-a100/100 · 1 runs

1 correct0 partial0 wrong100% accuracy

claude-sonnet-4-594/100 · 48 runs

44 correct1 partial3 wrong92% accuracy

● 2026-07-26

Quality drops 9.7 points to 88.0, factual performance weakens significantly

Meta-Llama-3.3-70B-Instruct on OVH AI Endpoints shows a concerning quality decline in this benchmark window, falling from 97.7 to 88.0 overall. The most dramatic shift appears in factual performance, which scored just 57 compared to strong performance in other categories. Creative writing maintains its previous excellence at 95, while multilingual capabilities remain perfect at 100. Reasoning performance is now tracked at 100, representing solid logical processing. The coding category, which scored 98 in the previous window, is no longer represented in current results, making direct comparison difficult. Latency remains essentially stable at 7649ms compared to 7683ms previously, indicating no performance regression in response times. This quality drop of nearly 10 points is substantial and warrants attention, particularly given the weak factual accuracy score that pulls down the overall rating. Users relying on this model for fact-based tasks should be aware of this limitation, while those focused on creative, multilingual, or reasoning applications can expect continued strong performance. The consistency in test runs at 5 samples suggests these results are preliminary but indicative of current capabilities.

Quality

88.0

Latency p50

7,649 ms

Test runs

✗ Quality dropped 9.7 points✗ Factual performance weak at 57✓ Reasoning excellence at 100✓ Latency remains stable

Last automated test

Jul 30, 2026 · 14:04 UTC · Speed benchmark

P50 latency

133 ms

P95 latency

134 ms

Errors

0 / 6 runs

Last reviewed by Tokonomix Team·July 30, 2026