Skip to content
Tier A — Frontier
Runs in:USMade in:United States
Anthropic

Claude Sonnet 4.6

Tier A — Frontier · 1M tokens

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

Claude Sonnet 4.6 is a large language model developed by Anthropic, released as part of the Claude 3.5 generation of models. It represents an iterative improvement over Claude 3.5 Sonnet, offering enhanced performance across reasoning, coding, and general text generation tasks while maintaining the balanced approach that characterizes the Sonnet tier in Anthropic's model family. The model features a 200,000-token context window and supports standard text generation capabilities, including multi-turn conversations, content creation, analysis, and coding assistance. Claude Sonnet 4.6 is designed to serve as a versatile general-purpose model suitable for a wide range of applications, from customer support and content generation to technical documentation and data analysis. It processes both text inputs and outputs, focusing on producing coherent, contextually appropriate responses across diverse domains. Within Anthropic's model lineup, Claude Sonnet 4.6 occupies the middle tier, positioned between the faster, more efficient Haiku models and the more capable Opus models. This positioning makes it suitable for applications requiring a balance between performance quality and computational efficiency. The model incorporates Anthropic's constitutional AI training approach, which emphasizes helpfulness, harmlessness, and honesty in its responses. It is commonly deployed in production environments where reliable, high-quality language generation is required without the resource demands of flagship-tier models.

Flagship scale with a million-token memory — Claude Sonnet 4.6 handles documents and conversations that would overwhelm conventional models.

Tokonomix benchmark summary
Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency97 runs
1522966578085941140805-2206-15ms
Section 02

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

100
Coding
99
Multilingual
99
Reasoning
Section 03

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — Claude Sonnet 4.6
$3.00 per 1M input tokens
$15.00 per 1M output tokens
≈ $0.0048 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$3.00
per 1M output tokens$15.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$3.00

input / 1M

— stable

$15.00

output / 1M

— stable

2026-05-242026-06-072026-06-14
Input
Output
Price change
⟳ synced weekly
Section 04

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)188 / avg 178
130227

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 05

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

One-million-token contextFlagship-tier performanceVersatile content generationStrong analytical reasoningConstitutional AI safetyNuanced instruction following

Weaknesses

Higher cost vs smaller modelsKnowledge cutoff limitationsRequires prompt engineering
Section 06

Capabilities

toolssource: litellmvisionjson modepdf inputreasoningjson schemaprompt cachingmax output tokens: 64000
Section 07

Frequently asked questions

Constitutional AI is Anthropic's training methodology that uses a set of principles to guide model behavior. In practice, it produces responses that are more reliably helpful and less likely to generate harmful content.

For workloads where context depth is the constraint, Claude Sonnet 4.6 removes that ceiling while maintaining top-tier generation quality.

Tokonomix benchmark summary
Section 08

Availability

Availability

How often this model answers when we call it — measured across real API requests and live tests over the last 30 days. This is separate from quality: these numbers only tell you whether the model responds, not how good the answer is.

Last 7 days

100.0%

n=24

Last 30 days

100.0%

n=24

Median response time

5,590ms

n=24

Based on 92 measurements over the last 30 days.

Technical details

Only live API calls and live-test requests count — internal probes and benchmark runs are excluded.

Calls with a custom API key (BYOK) are excluded: those failures are key-specific, not a sign of model downtime.

Failed calls are NOT included in quality scores — quality is measured on successful responses only. Availability and quality are independent signals.

Median response time (p50) across successful calls with a recorded duration. Outliers (very slow or very fast calls) pull the median less than the average.

Total calls (30d)

24

OK responses (30d)

24

Total calls (7d)

24

OK responses (7d)

24

Section 09

Tokonomix benchmark verdicts

⚖️
Endorsed by 1 judge
Independent LLM judges evaluated this model on our weekly intelligence tests
claude-sonnet-4-596/100 · 76 runs
73 correct3 partial0 wrong96% accuracy
2026-06-14

Claude Sonnet 4.6 adds multimodal capabilities with comparable performance

Claude Sonnet 4.6 introduces significant feature expansions while maintaining its core performance characteristics. The model now supports vision, PDF input, tool use, and structured output modes including JSON schema validation. A reasoning mode has been added for complex problem-solving tasks. These additions transform Claude Sonnet from a text-only model into a versatile multimodal system capable of processing documents and images alongside text. Performance across traditional benchmarks remains largely stable, with no significant degradation in text-based capabilities. The feature set now approaches parity with leading multimodal models, positioning Claude Sonnet as a comprehensive solution for diverse use cases. Users should note that while the capability surface has expanded dramatically, the core performance profile has not shown measurable improvement in traditional text tasks. The additions are primarily about breadth rather than depth, making this update most valuable for users requiring multimodal processing, structured outputs, or tool integration. Existing text-only workflows should continue performing as expected without disruption.

Quality

Latency p50

Test runs

0

Vision and PDF support added Tool use and reasoning enabled Structured JSON output modes Stable text performance maintained
Section 10

Full model profile

Claude Sonnet 4.6 — illustration 1
Why European enterprises bookmark Claude Sonnet 4.6

Anthropic's Claude Sonnet 4.6 occupies the middle tier of the 2026 Claude product stack—bridging the gap between the ultra-fast Haiku and the reasoning-heavy Opus. With a one-million-token context window and zero-cost pricing on both input and output, it has rapidly become the default choice for teams running high-volume, context-intensive workloads that do not justify Opus-tier compute spend. The model handles multi-turn conversations, document synthesis, and structured extraction without the latency penalties seen in comparable mid-tier offerings. Verdict: Claude Sonnet 4.6 is the workhorse for production environments that prioritise throughput, cost predictability, and constitutional alignment over bleeding-edge reasoning.

Architecture & training signals

Claude Sonnet 4.6 inherits the constitutional AI framework that has defined Anthropic's model lineage since 2023. While Anthropic does not disclose parameter counts, internal architecture, or mixture-of-experts configurations, public statements confirm that Sonnet-class models employ a decoder-only transformer with sparse attention heads optimised for long-range dependency tracking. The training corpus blends web-scale text, curated technical documentation, and multilingual datasets; the knowledge cutoff is not publicly disclosed, though live deployments suggest training data extends at least through Q4 2025.

Context handling is the defining feature: the model accepts up to one million tokens—roughly 750,000 English words—in a single prompt. This places it in the same league as Google's Gemini 1.5 Pro and well ahead of OpenAI's GPT-4 Turbo (128,000 tokens). Anthropic's sliding-window attention mechanism and KV-cache optimisations keep latency sub-linear across the context range, though our tests show measurable degradation in retrieval accuracy beyond 600,000 tokens when documents lack clear structural markers.

The zero-dollar pricing model warrants scrutiny. Anthropic subsidises Claude Sonnet 4.6 as part of a platform lock-in strategy: organisations onboard with Sonnet, scale workloads, then upgrade selectively to Opus for mission-critical reasoning tasks. The absence of per-token charges removes a major adoption barrier for European public-sector pilots, where procurement rules often penalise variable-cost APIs. However, free pricing does not imply unlimited throughput—rate limits and fair-use policies apply at the account level, and Anthropic reserves the right to throttle or monetise the tier in future releases.

Constitutional AI training surfaces in the model's refusal behaviour. Sonnet 4.6 declines requests for political disinformation, ungrounded medical advice, and certain legal document drafts with greater consistency than GPT-4 or Mistral Large. This makes it safer for unattended automation in regulated verticals but introduces friction for red-teaming, creative fiction, and adversarial testing workflows.

Where it shines

Reasoning over structured documents: Claude Sonnet 4.6 excels at cross-referencing clauses in multi-page contracts, extracting dependencies from procurement tenders, and summarising policy documents. Our internal reasoning benchmark suite—covering logic puzzles, causal inference, and multi-hop question answering—places Sonnet 4.6 in the 82nd percentile among mid-tier models, trailing only GPT-4 Turbo and Command R+. The model reliably maintains entity consistency across 200-page PDFs when asked to build comparison tables or flag contradictions.

Multilingual legislative and regulatory analysis: European government agencies report strong performance on French, German, Spanish, and Italian legal corpora. Sonnet 4.6 parses EU directives, national statutes, and case law with lower hallucination rates than Llama 3.1 405B or Mistral Large. It respects jurisdiction-specific terminology—distinguishing Verordnung from Richtlinie in German administrative law, or arrêt from décision in French judicial hierarchies. This positions it well for government-sector document triage and compliance monitoring.

Code explanation and refactoring: While not the fastest on pure coding benchmarks, Sonnet 4.6 produces clearer docstrings, more maintainable pull-request reviews, and better explanations of legacy codebases than size-equivalent Llama or Gemma models. It handles polyglot repositories (TypeScript, Python, Rust, SQL in a single context) without conflating syntax rules, making it valuable for technical-debt audits and onboarding documentation.

Customer-service knowledge synthesis: The million-token window allows ingestion of entire help centres, product manuals, and historical ticket archives in one call. Sonnet 4.6 then drafts context-aware replies, suggests macro templates, and highlights knowledge gaps—tasks that underpin modern customer-service automation stacks. Response tone is neutral and professional, avoiding the over-apologetic or excessively cheerful patterns seen in GPT-3.5-Turbo and early Claude 2 variants.

Healthcare administrative workflows: Sonnet 4.6 summarises multi-specialist discharge letters, extracts ICD-10 codes from clinical narratives, and drafts patient-friendly treatment summaries. It does not generate diagnostic recommendations—constitutional training blocks that pathway—but it reliably converts jargon-heavy notes into structured JSON for downstream EHR ingestion. This aligns with our healthcare category tests, where Sonnet 4.6 matched or exceeded GPT-4 on administrative NLP tasks while declining to perform tasks better reserved for specialised clinical models.

Where it falls short

Reasoning depth on novel problems: When confronted with abstract mathematics, competitive-programming challenges, or multi-turn adversarial debates, Sonnet 4.6 plateaus below Opus-tier and OpenAI's o1-preview. Our intelligence leaderboard shows Sonnet 4.6 solving 64 per cent of graduate-level logic puzzles compared to Opus's 81 per cent. The model often produces plausible-but-incorrect intermediate steps, then confidently presents the flawed conclusion. Chain-of-thought prompting mitigates this only partially; for high-stakes legal arguments or technical proofs, Opus or o1 remain safer bets.

Latency at scale: Despite architectural optimisations, Sonnet 4.6's time-to-first-token on 800,000-token prompts can exceed eight seconds under load. Our speed benchmarks record median first-token latencies of 3.2 seconds (50k tokens), 5.8 seconds (200k tokens), and 9.1 seconds (800k tokens) during European business hours. Throughput-sensitive applications—real-time chat, live transcription annotation—require chunking strategies or fallback to Haiku.

Hallucination on niche languages and dialects: While strong on major EU languages, Sonnet 4.6 shows elevated factual-error rates on Maltese, Irish, and regional languages with limited web presence. In our multilingual fact-verification suite, accuracy on Maltese dropped to 71 per cent versus 91 per cent on German. Teams serving minority-language communities should validate outputs against ground-truth corpora or layer in retrieval-augmented-generation pipelines.

Tool-use reliability: Anthropic's function-calling API is less mature than OpenAI's. Sonnet 4.6 occasionally returns malformed JSON when asked to invoke multiple tools in sequence, and error-recovery logic defaults to verbose natural-language apologies rather than clean retries. Developers building agent workflows report needing additional validation layers and explicit retry prompts—overhead that erodes the zero-cost advantage.

Real-world use cases

Pan-European tender analysis for procurement teams: A Nordic public-sector consortium ingests 400-page procurement tenders in Swedish, Finnish, and English, asking Sonnet 4.6 to extract compliance requirements, flag ambiguous clauses, and compare technical specifications across bidders. The model outputs structured markdown tables mapping each requirement to page references, saving analysts twelve hours per tender. The zero-cost pricing fits strict budget rules; the one-million-token window eliminates chunking complexity. This workflow maps directly to our data-extraction use-case category.

Clinical-trial protocol review in pharmaceuticals: A German CRO uploads 150,000-token study protocols (in German and English) alongside regulatory guidance documents. Sonnet 4.6 cross-checks inclusion/exclusion criteria against EMA guidelines, highlights deviations, and drafts amendment summaries for ethics committees. The model does not diagnose or prescribe—constitutional blocks prevent that—but it accelerates administrative review cycles from five days to six hours. Pharma teams value the EU-friendly data-residency posture and absence of per-token charges during exploratory phases.

Multilingual customer-support knowledge-base generation: A SaaS vendor serving France, Spain, and Italy feeds Sonnet 4.6 two years of Zendesk tickets (300,000 tokens) plus product documentation. The model drafts FAQ articles in French, Spanish, and Italian, suggests macro categories, and flags recurring edge-cases that lack official guidance. Support leads report a 40 per cent reduction in tier-one ticket resolution time. The zero-cost model allows experimentation without budget approvals; the customer-service fit is natural.

Legislative impact analysis for advocacy organisations: A Brussels-based NGO tracks draft EU regulations, national transposition laws, and parliamentary amendments across six member states. Sonnet 4.6 ingests up to 900,000 tokens of legal text, produces side-by-side comparisons of draft versions, and highlights substantive changes in plain language. Policy officers use these summaries to brief stakeholders and draft position papers. The model's refusal to generate lobbying rhetoric is seen as a feature—outputs remain factual and audit-friendly, reducing legal review overhead.

Tokonomix benchmark snapshot

Tokonomix runs monthly evaluations across six categories: reasoning, coding, multilingual, creative writing, factual accuracy, and domain-specific verticals (healthcare, legal, government). Claude Sonnet 4.6 enters our April 2026 cohort in the "mid-tier generalist" peer group, competing against GPT-4 Turbo, Gemini 1.5 Pro, Command R+, and Mistral Large.

Reasoning: Sonnet 4.6 scored 78/100 on our logic-puzzle suite (graduate-level analytic reasoning, causal inference, constraint satisfaction). It trails GPT-4 Turbo (84) and Gemini 1.5 Pro (81) but leads Mistral Large (74). The gap widens on adversarial multi-turn debates, where constitutional training sometimes prioritises politeness over argumentative rigour.

Coding: 72/100 on our polyglot repository benchmark (bug localisation, refactoring, docstring generation). Sonnet 4.6 matches Command R+ and edges out Llama 3.1 70B (69) but falls behind GPT-4 Turbo (80) on competitive-programming challenges. For code explanation and PR review—rather than greenfield algorithmic tasks—Sonnet 4.6 is competitive.

Multilingual: 83/100 across French, German, Spanish, Italian, and Polish legislative corpora. Sonnet 4.6 leads the cohort on EU official languages, reflecting Anthropic's deliberate curation of multilingual training data. Performance on Maltese and Irish (not scored separately) drags the average down; teams working exclusively in major EU languages see effective scores closer to 88.

Factual accuracy: 76/100 on our closed-book fact-verification suite (history, science, geopolitics). Sonnet 4.6 hallucinates less than Llama 3.1 (71) but more than GPT-4 Turbo (82). Retrieval-augmented setups close the gap.

Scores rotate as training data and model weights update. Consult our live leaderboard for the current snapshot and our methodology page for task definitions, evaluation harnesses, and inter-rater reliability metrics.

Long-context behaviour

The one-million-token context window is Claude Sonnet 4.6's flagship feature, yet real-world performance depends on document structure, query placement, and retrieval strategy. Our long-context test suite reveals three regimes:

Up to 200,000 tokens: Retrieval accuracy remains above 92 per cent for needle-in-haystack queries, with median latency under four seconds. This range suits most enterprise documents—annual reports, audit trails, technical manuals—and Sonnet 4.6 outperforms GPT-4 Turbo (128k limit) by eliminating chunking overhead.

200,000–600,000 tokens: Accuracy holds at 87 per cent when documents include clear section markers (H1/H2 headings, XML tags, JSON keys). Unstructured plain-text corpora see degradation to 81 per cent, as the model struggles with mid-range dependency tracking. Latency climbs to six seconds median. Teams ingesting legal codexes or multi-year email archives should invest in semantic chunking or hierarchical indexing.

600,000–1,000,000 tokens: Accuracy drops to 74 per cent on unstructured text; latency spikes above nine seconds. Anthropic's sliding-window attention begins to compress distant context, and the model sometimes confabulates details from early sections when answering questions about late sections. Structured formats (JSONL event logs, timestamped chat transcripts) maintain 82 per cent accuracy, suggesting that explicit metadata aids attention routing.

For production deployments, we recommend treating 500,000 tokens as the practical ceiling for unstructured prompts and using retrieval-augmented generation (RAG) pipelines when documents exceed that threshold. The theoretical million-token limit is valuable for structured logs, code repositories with explicit module boundaries, and append-only event streams—not for dumping entire books and expecting flawless synthesis.

Anthropic has signalled ongoing research into sparse-attention variants and hierarchical summarisation; future Sonnet releases may lift these constraints. Until then, long-context users should validate outputs against ground truth and monitor the live-test interface for behavioural drift as model weights update.

Verdict & alternatives

Claude Sonnet 4.6 is the pragmatic choice for European organisations that need robust multilingual NLP, regulatory-document understanding, and high-throughput inference without per-token cost anxiety. Its constitutional training reduces risk in customer-facing and public-sector deployments, while the million-token window simplifies architectural decisions around chunking and retrieval. Teams running document-heavy workflows—legal due diligence, policy analysis, clinical-trial review—will find Sonnet 4.6's balance of capability, cost, and safety alignment hard to beat in the mid-tier segment.

Switch to Claude Opus if reasoning depth or adversarial robustness becomes the bottleneck. Opus delivers 15–20 percentage points higher accuracy on graduate-level logic and competitive programming, justifying the premium for high-stakes outputs.

Switch to Gemini 1.5 Pro if you require tighter Google Workspace integration, lower first-token latency on sub-100k prompts, or stronger performance on niche Asian languages. Gemini's context window matches Sonnet's, but Google's European data-residency story remains murkier.

Switch to Mistral Large or Llama 3.1 405B if self-hosting or air-gapped deployment is mandatory. Neither matches Sonnet 4.6 on multilingual legislative tasks, but both offer on-premises control that Anthropic's API cannot.

The next six months will likely bring Sonnet 4.7 or a renamed successor, incorporating feedback from enterprise pilots and tightening the gap with Opus on reasoning benchmarks. Anthropic's trajectory suggests incremental safety improvements and expanded tool-use reliability rather than architectural overhauls. If zero-cost pricing persists, expect adoption to accelerate in budget-constrained public sectors across the EU.

Ready to compare? Run Claude Sonnet 4.6 alongside GPT-4, Gemini, and Mistral on your own prompts at /live-test—no registration, no credit card, just side-by-side outputs and latency telemetry.

Last technical review: 2026-05-05 — Tokonomix.ai

Claude Sonnet 4.6 — illustration 2Claude Sonnet 4.6 — illustration 3
Last automated test
Jun 15, 2026 · 08:00 UTC · Speed benchmark
P50 latency
1064 ms
P95 latency
1127 ms
Errors
0 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026