Skip to content
Tier A — Frontier
Runs in:USMade in:United States
Anthropic

Claude Opus 4.8

Tier A — Frontier · 1M tokens

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

Claude Opus 4.8 is Anthropic's most capable flagship to date, built specifically for long-horizon autonomous work where code quality and self-correction matter most. Its 4× improvement in catching code faults before they propagate marks a meaningful shift from Opus 4.7's already-strong baseline.

Tokonomix model analysis
Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency48 runs
687839016094237973150005-2906-09ms
Section 02

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

97
Coding
100
Creative
95
Factual
100
Multilingual
Section 03

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — Claude Opus 4.8
$5.00 per 1M input tokens
$25.00 per 1M output tokens
≈ $0.0080 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$5.00
per 1M output tokens$25.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$5.00

input / 1M

— stable

$25.00

output / 1M

— stable

2026-05-312026-06-072026-06-07
Input
Output
Price change
⟳ synced weekly
Section 04

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)230 / avg 198
28842

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 05

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

4× fewer code faults slip throughSharper self-judgment on task progressLonger autonomous runs without derailing1M-token context window for large codebasesVision input for multimodal pipelinesRobust tool-use and function-callingAdaptive thinking adjusts depth per task

Weaknesses

Higher cost than Sonnet 4.6 or Haiku 4.5Slower latency than lighter model tiersNo extended-thinking mode availableKnowledge cutoff still applies; live data needs tools
Section 06

Capabilities

toolssource: litellmvisionjson modepdf inputreasoningjson schemaprompt cachingmax output tokens: 128000
Section 07

Frequently asked questions

The headline change is a roughly 4× reduction in the rate at which code defects pass unnoticed through the model's own review. Alongside that, Opus 4.8 shows improved self-assessment — it is less likely to declare a task complete when sub-steps are still failing — and it sustains coherent autonomous runs over longer horizons. Pricing is unchanged from 4.7.

For engineering teams running multi-step pipelines or agentic workflows where a missed bug compounds quickly, Opus 4.8 is the clearest upgrade path from 4.7 — at identical pricing. Teams running lighter, latency-sensitive tasks will still find Sonnet 4.6 a better fit.

Tokonomix editorial team
Section 08

Tokonomix benchmark verdicts

⚖️
Endorsed by 1 judge
Independent LLM judges evaluated this model on our weekly intelligence tests
claude-sonnet-4-598/100 · 5 runs
5 correct0 partial0 wrong100% accuracy
2026-06-07

Claude Opus 4.8 adds multimodal and tooling capabilities to baseline

Claude Opus 4.8 expands significantly beyond its previous text-only baseline with the addition of vision, PDF input, tool use, JSON modes, reasoning capabilities, and prompt caching. These represent substantial functional enhancements to the model's utility across diverse workflows. The core academic performance established in the baseline appears maintained, though no new benchmark scores are available for this window to confirm performance trends. The additions of structured output formats through json_mode and json_schema address common integration needs, while tool support enables agentic workflows that were previously unavailable. Vision and PDF input capabilities extend the model's applicability to multimodal tasks. Prompt caching should improve efficiency for repetitive workflows with shared context. Users gain a notably more versatile model compared to the baseline, though the absence of updated performance metrics means stability of core capabilities cannot be verified. The breadth of new features positions this release as a major capability expansion rather than an incremental refinement.

Quality

Latency p50

Test runs

0

Vision and PDF support added Tool use capability introduced JSON output modes available Prompt caching efficiency feature
Section 09

Full model profile

Claude Opus 4.8: built for longer autonomous runs

Anthropic released Claude Opus 4.8 on 28 May 2026 as its newest flagship. The headline is not a single dramatic leap but a tightening of the qualities that matter when you point a model at real work and walk away: roughly four times less likely to let code flaws slip through than Opus 4.7, sharper judgement about its own progress on a task, and the stamina to sustain longer autonomous runs. Those three gains reinforce each other — and they tell you exactly what this model was built to be better at.

What the 1M context actually buys you

Opus 4.8 carries a 1,000,000-token context window. A number this size gets quoted more often than it gets used well, so it is worth being concrete about what it changes.

The payoff is range. A window this wide lets you hold a large codebase, a long document set, or an extended interaction history in front of the model at once, instead of stitching context together across calls and hoping nothing important falls out of scope. For the autonomous-agent workflows this model is aimed at, that matters more than it does for one-shot prompting: long-running tasks accumulate state — files read, decisions made, intermediate results — and a wide window keeps more of that state addressable instead of summarised away.

Two of the model's stated improvements pair naturally with the context. It sustains longer autonomous runs than Opus 4.7, and it judges its own progress more accurately. A model that can keep more of the task in view and reason more honestly about how far it has gotten is one you can leave running unattended for longer without it drifting. The window is the raw capacity; the improved self-assessment is what stops that capacity turning into expensive wandering.

Pair the window with prompt caching, which Opus 4.8 supports, and the economics of feeding it large, stable context repeatedly become defensible. When a substantial block of input does not change between calls — a repository, a reference document, a system brief — caching is what keeps you from paying to reprocess it every time.

Vision and structured output

On the input side, Opus 4.8 is multimodal: it accepts image input and PDF input, covering most of the document-and-screenshot ingestion patterns that come up in practice — reading a diagram, parsing a scanned form, working through a PDF spec without a separate extraction step in front of it.

On the output side, it supports tool use, JSON mode, and JSON-schema structured output. That last one earns its place. JSON mode gets you valid JSON; JSON-schema structured output gets you JSON that conforms to a shape you define, with the right fields and types. For anyone wiring a model into a system that expects a specific contract, schema-constrained output is the difference between a response you can hand straight to the next service and one you have to validate, repair, and second-guess. Combined with tool use, it makes Opus 4.8 a model you can put inside a pipeline rather than beside one: ingest a PDF or image, reason over it, and return schema-conformant output a downstream service consumes without a cleanup layer.

Reasoning is supported through adaptive thinking, and there is no separate extended-thinking mode. That is a design choice, not an omission. Instead of asking you to flip a switch between a fast path and a slow, deliberate one, the model scales its own effort within a single mode — one less routing decision and one less behavioural branch to test.

Where it lands against the field

Tokonomix has not yet benchmarked Opus 4.8. Our intelligence and speed runners have not scored it, and we are not going to invent a rank for it. Grounded scores will appear on this page automatically once the next test cycle picks the model up — that is the only number-shaped claim we will stand behind right now, and it is a promise of forthcoming data, not a placeholder for it.

What we can say is qualitative, and it comes straight from how the model is positioned against its own predecessor. Against Opus 4.7, Opus 4.8 is roughly four times less likely to let code flaws slip through. That is the cleanest, most decision-relevant claim Anthropic makes about it, and it points at a specific job: code the model is expected to produce, review, or modify with less human checking behind it. A four-fold drop in defects that get past the model changes how much trust you can extend to an automated coding loop — not because the model is now infallible, but because the rate at which you get burned drops materially.

The other two gains — longer autonomous runs and sharper self-assessment — are the agentic complement to that story. Many failures in unsupervised systems are not single-step failures; they are failures of self-monitoring, premature confidence, or drift. Read together, the picture is consistent: a model tuned to do more on its own, for longer, with fewer errors escaping review. None of that is a benchmark, and we will not pretend it is. But it tells you what the model was built to be better at — and when our runners score it, you will be able to check that positioning against grounded numbers here.

Where it is the wrong tool

A premium flagship is the wrong default for most of what most teams run. Opus 4.8 sits at the premium tier — priced identically to Opus 4.7 — and that pricing is the first filter. If your workload is high-volume, latency-sensitive, or simple enough that a smaller model handles it cleanly, paying for flagship capability you never exercise is a self-inflicted cost. The improvements here pay off only on hard, long-running, error-sensitive work; on a short classification call or a templated extraction, they are invisible.

The adaptive-thinking design has an implication worth naming. Because there is no separate extended-thinking mode to toggle, you do not get a manual lever for forcing maximum deliberation on demand or suppressing it to claw back latency. For most users that is a simplification; for anyone whose architecture was built around switching a model between a cheap fast mode and an expensive deliberate one, it is a behavioural change to design around.

It is also a wait-for-data case if your selection policy requires benchmark-confirmed standing before rollout — the grounded scores are not in yet. And if your task never approaches the limits of a normal context window, the 1M window is capacity you are carrying but not using. The window is a reason to choose this model, not a reason on its own.

Deployment notes

The integration surface is broad and, for the most part, pipeline-friendly: tool use, image input, PDF input, JSON mode, JSON-schema structured output, reasoning, and prompt caching. In practice Opus 4.8 can sit at the centre of an agentic system — reading documents and images, calling tools, and returning schema-conformant output downstream services consume directly.

Three practical notes. Lean on prompt caching wherever your input has a stable prefix; it is the lever that keeps the large window affordable under repeated calls. Prefer JSON-schema structured output over bare JSON mode anywhere a downstream service has real expectations about shape — constraining to a schema is cheaper than validating after the fact. And plan around adaptive thinking as the single reasoning behaviour, since there is no separate extended-thinking mode to route to.

Picking it

Choose Claude Opus 4.8 when the work is hard, long, and unforgiving of errors — particularly code work, where its roughly four-fold reduction in flaws slipping through over Opus 4.7 is the strongest single argument for it. Choose it when you want a model to run autonomously for longer and reason more accurately about its own progress, and when a million-token window earns its keep against genuinely large context.

Skip it when a smaller, cheaper model clears the bar, since it is priced at the premium tier, and skip it as a blind buy if you need hard comparative scores first. Watch this page for our forthcoming grounded intelligence and speed numbers; until those land, the case for Opus 4.8 rests on what it was built to do better — and that case is specific enough to act on.


Authored through a 3-model cross-family consensus run — Claude Opus 4.8, OpenAI GPT-5.4, and Google Gemini 2.5 Pro — on the Tokonomix consensus engine, then editorially synthesised. Every claim is grounded in Anthropic's published release data; benchmark scores remain pending Tokonomix's own test runners.

Last automated test
Jun 9, 2026 · 20:03 UTC · Speed benchmark
P50 latency
870 ms
P95 latency
964 ms
Errors
0 / 6 runs
Last reviewed by Tokonomix Team·June 5, 2026