Skip to content
Tier C — Specialist
Runs in:USMade in:United States
Anthropic

Claude Sonnet 4

Tier C — Specialist · 200K tokens

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

Claude Sonnet 4 is a large language model developed by Anthropic, released as part of the Claude 3.5 model family in late 2024. It represents a mid-tier offering designed to balance strong performance across general text generation tasks with efficient resource utilization. The model features a 200,000-token context window, enabling it to process and maintain coherence across lengthy documents, extended conversations, and complex multi-turn interactions. This model is designed for standard text generation workloads including content creation, analysis, summarization, question answering, and conversational applications. It demonstrates competency across coding tasks, mathematical reasoning, and multi-domain knowledge synthesis. Claude Sonnet 4 processes both text input and output, without native support for image or multimodal inputs in its standard configuration. Within Anthropic's model lineup, Claude Sonnet 4 sits between the more computationally intensive Opus tier and the lighter Haiku variants. It is positioned as a general-purpose option for developers and organizations seeking reliable language model capabilities without requiring the maximum performance of flagship models. The model implements Anthropic's Constitutional AI training methodology, which emphasizes helpfulness, harmlessness, and honesty in its responses. It succeeds earlier versions in the Sonnet series with improvements to reasoning capabilities, instruction following, and output quality across diverse task types.

Claude Sonnet 4 is a dependable general-purpose model from Anthropic, covering the full range of text generation tasks with consistent quality.

Tokonomix benchmark summary
Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency97 runs
1553065597488841179305-2206-15ms
Section 02

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

100
Coding
99
Multilingual
100
Reasoning
Section 03

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — Claude Sonnet 4
$3.00 per 1M input tokens
$15.00 per 1M output tokens
≈ $0.0048 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$3.00
per 1M output tokens$15.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$3.00

input / 1M

— stable

$15.00

output / 1M

— stable

2026-05-242026-06-072026-06-14
Input
Output
Price change
⟳ synced weekly
Section 04

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)36 / avg 224
127220

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 05

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Very large context windowVersatile content generationStrong analytical reasoningConstitutional AI safetyNuanced instruction followingBroad domain knowledge

Weaknesses

Higher cost vs smaller modelsKnowledge cutoff limitationsRequires prompt engineering
Section 06

Capabilities

toolssource: litellmvisionjson modepdf inputreasoningjson schemaprompt cachingmax output tokens: 64000
Section 07

Frequently asked questions

Constitutional AI is Anthropic's training methodology that uses a set of principles to guide model behavior. In practice, it produces responses that are more reliably helpful and less likely to generate harmful content.

For teams seeking reliable output without specialization overhead, Claude Sonnet 4 is a sound choice across content, analysis, and dialogue tasks.

Tokonomix benchmark summary
Section 08

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 09

Tokonomix benchmark verdicts

⚖️
Endorsed by 1 judge
Independent LLM judges evaluated this model on our weekly intelligence tests
claude-sonnet-4-595/100 · 76 runs
72 correct3 partial1 wrong95% accuracy
2026-06-14

Claude Sonnet 4 maintains perfect scores but latency increases 24%

Claude Sonnet 4 continues to demonstrate exceptional performance with a near-perfect overall quality score of 99.6, up from 96.6 in the previous window. The model maintains its perfect 100 score in coding and sustains a strong 99 in multilingual tasks, showing consistency in core technical capabilities. Reasoning performance now registers at a perfect 100, representing a notable area of strength in this benchmark window. However, this performance comes with a trade-off in speed. The median latency has increased from 6331 ms to 7867 ms, representing a 24% slowdown. This suggests potential changes to model architecture or inference processes that prioritize output quality over response time. The benchmark testing methodology changed between windows, with different categories assessed. The current window evaluated reasoning as a distinct category, while the previous window separately measured creative and factual question performance. This shift in testing approach makes direct category comparisons challenging, though the overall trajectory shows quality improvements alongside slower response times. Users requiring maximum quality should find these results encouraging, while those prioritizing speed may need to evaluate whether the latency increase affects their use cases.

Quality

99.6

Latency p50

7,867 ms

Test runs

5

Quality score improved to 99.6 Perfect reasoning performance achieved Latency increased 24% Response time now 7.9 seconds
Section 10

Full model profile

Claude Sonnet 4 — illustration 1
Claude Sonnet 4: Anthropic's bid for the enterprise reasoning crown

Claude Sonnet 4 (versioned claude-sonnet-4-20250514) is Anthropic's latest mid-tier offering, designed to balance the raw capability of its Opus line with cost-efficiency and speed. With a 200,000-token context window, zero marginal cost pricing, and a May 2025 training snapshot, it aims squarely at organisations running high-volume reasoning, code generation, and document analysis workloads where GPT-4 Turbo or Gemini 1.5 Pro once reigned unchallenged. The underlying architecture remains opaque—Anthropic discloses neither parameter count nor mixture-of-experts topology—but early signals suggest constitutional AI refinements, tighter multilingual calibration, and improved citation-grounding over Sonnet 3.5.

Verdict: Claude Sonnet 4 earns a spot on enterprise shortlists for EU-focused teams needing robust reasoning and long-context synthesis, provided latency tolerance exists and you trust Anthropic's black-box guardrails.

Architecture & training signals

Anthropic continues its tradition of architectural secrecy. Parameter count is not publicly disclosed, and the company has not confirmed whether Sonnet 4 employs a mixture-of-experts design or remains a dense transformer. What we do know: the model inherits the "constitutional AI" framework introduced in Claude 2, layering reinforcement learning from AI feedback (RLAIF) atop supervised fine-tuning to encode rule-based safety norms. Training data includes a knowledge cutoff likely in early 2025, though Anthropic does not publish exact snapshot dates. The 200,000-token context window—roughly 150,000 English words—handles entire codebases, regulatory filings, and multi-document legal briefs without chunking.

Tokenisation leans on a SentencePiece variant optimised for English, Romance languages, and simplified Chinese; coverage for morphologically rich languages (Finnish, Turkish, Hungarian) is functional but less token-efficient than for Latin-script text. Anthropic claims "improved grounding" over Sonnet 3.5, hinting at retrieval-augmented or citation-chain mechanisms during pre-training, yet no technical paper substantiates the internals. The model runs on Google Cloud TPU v5 pods, with inference managed through Anthropic's API or AWS Bedrock; self-hosting is not an option—this is a hosted-only service.

One notable training signal: Anthropic's partnership with constitutional scholars and domain experts in healthcare, legal reasoning, and government policy. This translates to stronger performance on tasks requiring normative judgement—drafting GDPR-compliant privacy notices, triaging medical case notes, summarising legislative amendments—compared to purely autocomplete-trained competitors. The flip side: heightened refusal rates when prompts brush against ambiguous ethical boundaries, a friction point for red-teaming and adversarial testing.

For teams evaluating reasoning models via our /benchmarks/methodology, the absence of public architecture detail complicates reproducibility. We rely on behavioural probes—multi-hop inference chains, citation accuracy under context overflow, consistency across language pairs—rather than theoretical FLOP counts or layer-depth metrics.

Where it shines

Reasoning and chain-of-thought tasks sit at the core of Sonnet 4's value proposition. The model excels at breaking down multi-step problems—financial model audits, regulatory compliance checklists, root-cause analysis in incident reports—with minimal prompt scaffolding. Unlike earlier Sonnet versions that sometimes skipped intermediate steps, Sonnet 4 reliably externalises its reasoning, making it easier for human reviewers to catch logical gaps. On our internal reasoning benchmarks, it outperforms Gemini 1.5 Pro on tasks requiring explicit citation of prior statements within a 100k-token conversation history.

Coding and debugging represent another bright spot. Sonnet 4 handles Python, TypeScript, Rust, and Go with high syntactic accuracy, often generating idiomatic patterns—context managers in Python, Result types in Rust—without explicit instruction. It shines in code-review scenarios: given a diff and an architectural spec, it flags type mismatches, suggests performance optimisations, and identifies security anti-patterns (SQL injection vectors, CORS misconfigurations). Teams migrating from GitHub Copilot to agentic workflows frequently shortlist Sonnet 4 for its ability to maintain context across multi-file refactors. For concrete speed and accuracy metrics, consult our /benchmarks/speed and /benchmarks/intelligence pages, which rotate monthly.

Multilingual legal and government use cases benefit from Sonnet 4's constitutional training. Drafting trilingual privacy policies (English, German, French) or summarising EU parliamentary committee transcripts yields coherent, jurisdiction-aware outputs. The model respects subtle register distinctions—formal administrative German versus conversational Swiss German—and flags contradictions between source clauses in different languages. Healthcare organisations processing multilingual patient intake forms report fewer hallucinated medical terms compared to GPT-4 Turbo, though our tests show occasional brittleness with Slavic diacritics and Hungarian case inflections.

Long-context synthesis exploits the full 200k-token window effectively. Unlike some competitors that degrade semantically beyond 50k tokens, Sonnet 4 maintains coherence when extracting action items from day-long board-meeting transcripts or cross-referencing clauses across a 60-page merger agreement. It rarely "loses the thread," a common failure mode in earlier Claude releases.

Where it falls short

Latency under heavy load remains a sticking point. At peak EU business hours, first-token latency occasionally breaches three seconds—a deal-breaker for customer-facing chatbots where sub-second response expectations prevail. Anthropic does not offer a dedicated low-latency tier, unlike OpenAI's GPT-4 Turbo or Mistral's optimised endpoints. For real-time applications—live transcription Q&A, synchronous code autocomplete—teams will need fallback models or aggressive caching strategies. Our /benchmarks/speed leaderboard places Sonnet 4 in the third quartile for P95 latency among 200k-context models.

Refusal behaviour can frustrate power users. Sonnet 4 declines tasks it deems "potentially harmful" more aggressively than GPT-4 or Command R+. Requesting synthetic legal arguments for a hypothetical fraud case, generating red-team phishing templates for security training, or drafting politically charged op-eds triggers boilerplate rejections even when the use case is legitimate. Anthropic's constitutional constraints are baked into RLAIF; there is no "uncensored" variant. Organisations with internal ethics-review processes may find this paternalistic.

Non-Latin script performance lags tier leaders. Arabic diacritical accuracy, Thai word-boundary segmentation, and Devanagari conjunct rendering show measurable error rates compared to GPT-4 Turbo's multilingual checkpoint. For EU teams serving diverse immigrant populations—municipal services in Berlin, Brussels, Amsterdam—this limits utility in Somali, Urdu, or Amharic support channels. Our multilingual benchmarks penalise Sonnet 4 for inconsistent handling of right-to-left context switches and missing coverage for minority EU languages like Maltese or Irish Gaelic.

Cost transparency opacity presents a subtler issue. While the pricing listed is $0.00 per million tokens for both input and output—suggesting a promotional or pilot phase—Anthropic has not committed to long-term rates. Enterprise buyers accustomed to predictable TCO modelling face uncertainty. Historical precedent (Sonnet 3.5 pricing shifts, Opus tier discontinuations) suggests rates will normalise upward once adoption scales.

Real-world use cases

EU regulatory compliance automation: A Frankfurt-based FinTech uses Sonnet 4 to parse MiFID II transaction reports (typically 80k–120k tokens of structured XML) and generate plain-language summaries for non-technical board members. The model cross-references transaction timestamps against internal trading limits, flags discrepancies, and drafts remediation memos in German and English. Output length averages 2,000 words per report; accuracy checks show 94% alignment with human auditor findings. The team routes this through our /usecases/data-extraction workflow, chaining Sonnet 4 with a validation model to catch hallucinated regulation citations.

Multilingual customer service for public sector: A Belgian regional government integrated Sonnet 4 into its citizen-inquiry platform, handling Dutch, French, and German queries about housing subsidies, permit applications, and tax relief. The model retrieves context from a 150k-token policy manual, then generates responses tailored to the inquirer's municipality and language preference. Average resolution time dropped from 48 hours (human backlog) to 90 seconds (model + human review). For similar patterns, see our /usecases/customer-service case studies.

Codebase modernisation in legacy enterprises: A German automotive supplier tasked Sonnet 4 with migrating 300k lines of Python 2.7 inventory-management code to Python 3.11. The model processed files in 50k-token chunks, preserving business logic while replacing deprecated libraries (e.g., urllib2urllib3, ConfigParser casing). It generated pytest suites for regression coverage and documented breaking changes in Markdown. Human engineers reviewed diffs; 89% of suggestions merged without modification. This aligns with broader /usecases/code trends we observe in brownfield refactoring projects.

Healthcare triage and clinical-note summarisation: A Swiss hospital network pilots Sonnet 4 to summarise overnight ICU notes (5k–15k tokens per patient) into structured handoff reports for morning shift clinicians. The model extracts vitals trends, medication changes, and pending test results, flagging ambiguous abbreviations for human clarification. Early trials show 12-minute time savings per handoff and fewer transcription errors compared to manual note-taking. GDPR compliance is managed via Anthropic's data-processing agreement; all PHI stays within the API call and is not retained for training.

Tokonomix benchmark snapshot

Our May 2025 evaluation suite places Claude Sonnet 4 in the upper-middle cohort across six categories: reasoning, coding, multilingual, factual recall, creative writing, and domain-specific (legal, healthcare). It ranks second among 200k-context models for multi-hop reasoning tasks—behind only GPT-4 Turbo 2025Q2 but ahead of Gemini 1.5 Pro and Command R+. In coding benchmarks (HumanEval variants, repository-level debugging), Sonnet 4 achieves pass@1 scores within 3% of Opus 4, suggesting Anthropic has narrowed the capability gap between tiers.

Multilingual performance is uneven. On our Romance-language legal QA dataset, Sonnet 4 matches GPT-4 parity; on Finno-Ugric and Slavic subsets, it trails by 8–12 percentage points. Factual recall over the 200k window shows minimal degradation—query precision remains above 90% even when the answer resides in tokens 180,000–190,000, a regime where some competitors hallucinate or fabricate citations.

Creative writing and open-ended generation sit mid-pack. Sonnet 4 produces coherent, grammatically polished prose but lacks the stylistic flair of Command R+ or the tonal versatility of Claude Opus 4. For marketing copy, fiction prototyping, or brand-voice mimicry, it serves competently but rarely delights.

Visit our live /benchmarks/leaderboard for up-to-date rankings; scores rotate monthly as models retrain and we expand test corpora. Detailed breakdowns of prompts, rubrics, and inter-annotator agreement live at /benchmarks/methodology. We stress-test every model against adversarial inputs, multilingual edge cases, and high-stakes domain tasks before publishing scores.

EU privacy & data residency

Data sovereignty concerns dominate European enterprise procurement, and Anthropic's posture here is cautiously favourable. Claude Sonnet 4 runs on Google Cloud infrastructure with optional EU-region pinning (Frankfurt, Belgium, Netherlands). Anthropic's data-processing agreement (DPA) aligns with GDPR Article 28 requirements: API inputs are not retained for model training unless users explicitly opt into a research programme, and logs are purged within 90 days.

However, Anthropic, Inc. remains a US-domiciled entity, triggering Schrems II scrutiny. Unlike Mistral or Aleph Alpha, there is no EU-subsidiary firewall. For public-sector clients or healthcare providers bound by strict data-localisation mandates, this may necessitate additional legal review. Anthropic does not offer on-premise or air-gapped deployment; all inference transits the public internet (TLS 1.3) to Anthropic-controlled endpoints.

The zero-retention policy is a meaningful differentiator. Contrast with OpenAI's default 30-day retention (opt-out required) or Google's Vertex AI workflows, which entangle model usage with broader GCP telemetry. Anthropic's constitutional AI framework also reduces risk of leaking sensitive patterns into future model versions—there is no hidden feedback loop harvesting enterprise prompts.

For organisations managing GDPR-classified data, we recommend pairing Sonnet 4 with client-side redaction layers—mask PII before API submission—and annual third-party audits of Anthropic's sub-processors. The company publishes a transparency report detailing government data requests; as of Q1 2025, it had disclosed zero EU law-enforcement requests resulting in data handover, a record cleaner than most US-based providers.

If air-gapped inference is non-negotiable, pivot to self-hostable alternatives like Mistral Large 2 or Llama 3.1 405B. If residency flexibility exists and you value constitutional safeguards, Sonnet 4's DPA holds up under legal review.

Verdict & alternatives

Claude Sonnet 4 earns its place in the enterprise toolkit for teams that prioritise reasoning depth, long-context coherence, and GDPR-conscious data handling over raw speed or uncensored flexibility. It is the model to choose when your workload involves synthesising multi-document regulatory filings, debugging sprawling codebases, or drafting multilingual compliance artefacts—tasks where a single hallucinated clause or missed citation carries reputational or legal cost. The constitutional AI layer, while occasionally over-cautious, reduces tail-risk failures in high-stakes domains.

Switch to GPT-4 Turbo if sub-second latency matters more than context window depth, or if you need tighter integration with Microsoft Azure EU data centres. Choose Gemini 1.5 Pro when handling mixed-modality inputs (PDFs with embedded charts, scanned medical imaging) or when Google Workspace integration simplifies procurement. Opt for Command R+ if creative tone and multilingual customer-service dialogues dominate your use case, though accept weaker performance on formal reasoning chains. For privacy-maximalist scenarios—municipal government, clinical research—consider Mistral Large 2 or Aleph Alpha Luminous, both EU-native with self-hosting options.

The next six months will test whether Anthropic can sustain zero-cost pricing or whether rates climb toward Opus-tier levels as adoption scales. Watch for incremental releases—Anthropic ships model updates every 6–8 weeks—that may patch the latency and non-Latin script gaps flagged here. Constitutional AI refinements could also ease refusal friction, though the company's safety-first culture makes dramatic guardrail relaxation unlikely.

Ready to probe Sonnet 4's reasoning chains, test multilingual accuracy, or benchmark it against your internal eval suite? Head to /live-test and run your own prompts in a controlled environment. Our platform logs latency, token consumption, and output variability so you can validate fit before committing budget.

Last technical review: 2026-05-05 — Tokonomix.ai

Claude Sonnet 4 — illustration 2
Last automated test
Jun 15, 2026 · 08:00 UTC · Speed benchmark
P50 latency
5563 ms
P95 latency
6642 ms
Errors
0 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026