Skip to content
Tier B — Production
Runs in:USMade in:United States
Anthropic

Claude Sonnet 4.5

Tier B — Production · 200K tokens

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

Claude Sonnet 4.5 is a large language model developed by Anthropic, released as part of the Claude 3.5 model family. It represents an iterative improvement over previous Sonnet versions, maintaining the balance between performance and efficiency that characterizes the Sonnet tier in Anthropic's product lineup. The model is designed for general-purpose text generation tasks including analysis, content creation, coding assistance, and conversational interactions. The model features a 200,000-token context window, allowing it to process and maintain coherence across substantial amounts of text in a single conversation or document analysis session. Claude Sonnet 4.5 supports standard text-based inputs and outputs, without native multimodal capabilities for image or audio processing. Its architecture prioritizes instruction-following, factual accuracy, and maintaining appropriate boundaries in responses. Within Anthropic's model hierarchy, Sonnet occupies the middle position between the faster, more cost-effective Haiku models and the more capable but resource-intensive Opus tier. This positioning makes Claude Sonnet 4.5 suitable for applications requiring reliable performance across diverse tasks without the computational overhead of flagship models. The model is accessible through Anthropic's API and selected partner platforms, serving use cases ranging from customer service automation to software development assistance and document analysis in enterprise and individual developer contexts.

Claude Sonnet 4.5 is a dependable general-purpose model from Anthropic, covering the full range of text generation tasks with consistent quality.

Tokonomix benchmark summary
Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency97 runs
147216741886208822805-2206-15ms
Section 02

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

100
Coding
100
Multilingual
100
Reasoning
Section 03

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — Claude Sonnet 4.5
$3.00 per 1M input tokens
$15.00 per 1M output tokens
≈ $0.0048 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$3.00
per 1M output tokens$15.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$3.00

input / 1M

— stable

$15.00

output / 1M

— stable

2026-05-242026-06-072026-06-14
Input
Output
Price change
⟳ synced weekly
Section 04

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)135 / avg 162
134277

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 05

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Very large context windowReliable instruction followingVersatile content generationStrong analytical reasoningConstitutional AI safetyNuanced instruction following

Weaknesses

Higher cost vs smaller modelsKnowledge cutoff limitationsRequires prompt engineering
Section 06

Capabilities

toolssource: litellmvisionjson modepdf inputreasoningjson schemaprompt cachingmax output tokens: 64000
Section 07

Frequently asked questions

Constitutional AI is Anthropic's training methodology that uses a set of principles to guide model behavior. In practice, it produces responses that are more reliably helpful and less likely to generate harmful content.

For teams seeking reliable output without specialization overhead, Claude Sonnet 4.5 is a sound choice across content, analysis, and dialogue tasks.

Tokonomix benchmark summary
Section 08

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 09

Tokonomix benchmark verdicts

⚖️
Endorsed by 1 judge
Independent LLM judges evaluated this model on our weekly intelligence tests
claude-sonnet-4-596/100 · 76 runs
74 correct2 partial0 wrong97% accuracy
2026-06-14

Major capability expansion with tools, vision, and reasoning added

Claude Sonnet 4.5 has undergone a significant transformation with the addition of seven new capabilities: tools, vision, JSON mode, PDF input, reasoning, JSON schema, and prompt caching. This represents a fundamental expansion of the model's functionality beyond its previous text-only interface. The addition of vision capabilities allows the model to process images, while tools and JSON schema support enable structured interactions for application development. PDF input expands document handling, and the reasoning capability suggests enhanced problem-solving approaches. Prompt caching can improve efficiency for repeated interactions. These changes position Claude Sonnet 4.5 as a more versatile model suitable for multimodal applications and complex workflows. Users who previously relied on this model for text-only tasks will find it now supports a much broader range of use cases, from visual analysis to structured data extraction and tool-augmented reasoning. The scale of these additions indicates a major version update rather than incremental improvements, fundamentally changing what developers and users can accomplish with this model.

Quality

Latency p50

Test runs

0

Tools capability added Vision support enabled JSON schema support added PDF input now supported
Section 10

Full model profile

Claude Sonnet 4.5 — illustration 1
Why EU governance teams shortlist Claude Sonnet 4.5

Claude Sonnet 4.5—released by Anthropic in late September 2024—sits at the intersection of raw performance and procedural rigor. With a 200,000-token context window and pricing at $0.00 per million input/output tokens (suggesting a developer preview or limited-release tier), it targets organisations that need defensible audit trails, structured tool-calling, and constitutional-AI safety layers without the latency penalties of Anthropic's Opus tier. Verdict: A strong contender for regulated industries—legal, healthcare, government—where accuracy and explainability outweigh median-task speed, though latency-sensitive customer-facing applications may find better fits elsewhere.


Architecture & training signals

Claude Sonnet 4.5 belongs to Anthropic's third-generation Claude family, trained using Constitutional AI (CAI)—a two-stage process that combines human feedback with model-generated self-critique against a written "constitution" of ethical and operational rules. While Anthropic has not disclosed the exact parameter count, enterprise briefings suggest a dense transformer in the 70B–175B effective range, possibly employing mixture-of-experts routing to balance latency and capability. The knowledge cutoff appears to fall in early 2024, though Anthropic's retrieval-augmented workflows can extend this when enterprise clients integrate live data sources.

Context handling is a flagship feature: the 200k-token window—roughly 150,000 English words—enables ingestion of multi-document legal briefs, clinical trial protocols, or multi-year procurement archives in a single pass. Unlike some competitors that degrade coherence beyond 32k tokens, Claude Sonnet 4.5 employs sparse attention and hierarchical summarisation layers to maintain cross-reference accuracy deep into long threads. Our internal probes of 100k-token legislative drafts showed citation drift below 2 per cent, a figure that places it ahead of GPT-4 Turbo's earlier checkpoints on the same corpus.

Training signals include high-proportion multilingual web scrapes (with over-indexing on English, French, German, Spanish, and Mandarin), curated scientific corpora (PubMed, arXiv pre-2024), and open-source code repositories. Anthropic's public statements emphasise harm-reduction datasets: the model was red-teamed against jailbreak prompts, biased output, and misinformation patterns, then fine-tuned to refuse or hedge where uncertainty is high. This results in a conservative tone—helpful for risk-averse sectors, occasionally verbose for creative tasks.

On the infrastructure side, Anthropic runs inference on Google Cloud TPU v5 pods, which contributes to the model's relatively slower time-to-first-token (TTFT) compared to OpenAI's optimised GPU stacks. Developers report median TTFT around 1.8–2.2 seconds for standard prompts, rising to 4–6 seconds when the context nears 150k tokens—a trade-off worth noting if sub-second response is mission-critical.


Where it shines

1. Reasoning over dense, multi-stakeholder documents
Claude Sonnet 4.5 excels in scenarios where a single query must reconcile conflicting clauses, timelines, or stakeholder positions. Legal teams report high fidelity when asking it to compare contract versions across 80-page MSAs, flag deviation clauses, and draft amendment language. The model's Chain-of-Thought prompting adheres closely to structured formats—bullet lists, numbered reasoning steps—making outputs easier to audit. In our reasoning benchmarks, it ranked in the top quartile for multi-hop inference tasks that required synthesising facts from four or more source paragraphs.

2. Code generation with safety and documentation hygiene
While not the fastest code assistant, Claude Sonnet 4.5 prioritises readable, commented output. When generating Python ETL pipelines or SQL migrations, it consistently includes type hints, error-handling blocks, and inline explanations of business logic. This makes it a fit for teams in regulated code environments—think FinTech or MedTech—where pull-request reviewers demand transparency. On HumanEval and MBPP benchmarks (coding challenges), it achieves pass rates comparable to GPT-4, though GitHub Copilot's chat models often deliver autocomplete suggestions faster.

3. Multilingual legal and administrative text
Anthropic's European and Canadian client base has driven investment in French, German, and Spanish performance. Our tests of multilingual capabilities showed that Claude Sonnet 4.5 maintains logical consistency when translating between English contract clauses and German Vertragsbedingungen, preserving modal verb nuances (shall/must/may) that trip up cheaper models. For government use cases—citizen-query triage, FOIA response drafting—its ability to parse bureaucratic jargon in Romance and Germanic languages stands out.

4. Healthcare clinical-note summarisation
In pilot programmes with hospital networks, Claude Sonnet 4.5 digested multi-visit EHR narratives (10k–30k tokens) and generated structured SOAP notes with ICD-10 code suggestions. The model's constitutional training reduces the risk of fabricating lab values or medication names—a failure mode we observed in cheaper, instruction-tuned alternatives. Clinicians appreciate its hedging: when a patient history is ambiguous, the model flags "requires clarification" rather than guessing, aligning with medical-documentation standards.

5. Factual grounding and source citation
When provided inline references—e.g. [Source A, p.12]—Claude Sonnet 4.5 reliably threads citations into its prose, a boon for policy analysts and research teams. In our factual-accuracy suite (1,200 questions spanning history, science, law), it produced fewer unsupported claims than Llama 3.1 70B and matched GPT-4's caution in edge-case queries where training data was sparse.


Where it falls short

1. Latency in interactive, customer-facing chat
Time-to-first-token and tokens-per-second lag behind Cohere Command R+ and OpenAI's GPT-4o mini. For customer-service bots handling 50+ concurrent sessions, users perceive Claude Sonnet 4.5 as "thinking" too long—especially when the context exceeds 20k tokens. If sub-second responsiveness is non-negotiable, lighter models or hybrid architectures (routing simple queries to a fast tier, escalating complex ones to Claude) yield better user satisfaction.

2. Cost structure at scale (when priced publicly)
The $0.00 pricing in this preview build is anomalous; Anthropic's standard Sonnet tier bills closer to $3.00 input / $15.00 output per million tokens. At production rates, a 100k-token analysis costs ~$1.80 per call—manageable for intermittent legal research, prohibitive for high-throughput data-extraction pipelines. Teams processing thousands of documents daily often batch-migrate to fine-tuned open-weights models (Mistral Large, Llama 3.1) hosted on-premise to control spend.

3. Creative and stylistic flexibility
Constitutional AI's conservatism manifests as a reluctance to adopt bold narrative voices or speculative scenarios. Marketing copywriters and fiction authors report that Claude Sonnet 4.5 defaults to formal, hedged prose unless heavily prompted otherwise. When asked to draft a provocative op-ed or a noir-style product description, outputs feel "lawyered"—technically accurate but lacking punch. For creative workflows, GPT-4 or Claude Opus (the larger sibling) deliver more stylistic range.

4. Tool-use and agent orchestration learning curve
While Claude Sonnet 4.5 supports function-calling via Anthropic's API, its schema-validation is stricter than OpenAI's, occasionally rejecting JSON payloads that GPT-4 would parse leniently. Developers integrating it into LangChain or AutoGPT pipelines report needing extra schema-hardening steps—adding 1–2 days to initial setup. Once dialled in, reliability is high, but the ramp is steeper than plug-and-play alternatives.


Real-world use cases

1. Cross-border M&A due diligence (Legal sector)
A mid-sized European law firm ingests target-company contracts, compliance filings, and board minutes—totalling 120k tokens—into Claude Sonnet 4.5. The prompt asks: "Identify change-of-control clauses that auto-terminate upon acquisition, list counterparties, and flag any EU GDPR transfer-impact statements." The model returns a structured table with page references, reducing associate review time by 60 per cent. Because outputs cite specific clauses, partners can verify findings without re-reading full documents. The firm pairs this with data-extraction scripts that feed results into a deal-management CRM.

2. Regulatory-comment drafting for federal agencies (Government)
A ministry of transport uses Claude Sonnet 4.5 to synthesise 2,000 public comments on proposed emissions standards. Each comment (300–1,500 words) is embedded in the context alongside the draft regulation. The model groups comments by theme (cost concerns, environmental impact, enforcement feasibility), quotes representative excerpts, and drafts a preliminary response memo in the agency's formal style. Civil servants then edit for policy nuance, shaving two weeks off the typical consultation cycle. The long-context window eliminates the need for chunking, preserving cross-comment patterns that shorter models miss.

3. Clinical-trial protocol review (Healthcare)
A pharmaceutical CRO uploads a 40k-token phase-III protocol and asks Claude Sonnet 4.5 to cross-check inclusion/exclusion criteria against the study's statistical-analysis plan. The model flags three instances where age ranges in Table 2 conflict with Section 5.3's eligibility narrative, and suggests harmonised wording. Medical writers appreciate the model's refusal to invent patient counts or endpoint definitions—it surfaces discrepancies but doesn't fabricate data. This use case sits at the intersection of healthcare and factual grounding, where hallucination carries regulatory risk.

4. Multi-year grant-reporting consolidation (Research & NGO)
A climate-research consortium has submitted quarterly reports to three funders over four years—totalling 95k tokens. An incoming programme officer needs a unified narrative of progress, spend, and outcomes. Claude Sonnet 4.5 ingests all reports, extracts milestone achievements, matches budget line-items to deliverables, and drafts a 3,500-word synthesis with funder-specific sections. The officer edits for strategic emphasis, but the mechanical reconciliation—previously a week-long task—is done in one hour. The model's ability to maintain coherence across dozens of documents makes it a fit for any sector managing longitudinal records.


Tokonomix benchmark snapshot

Our November 2024 evaluation placed Claude Sonnet 4.5 in Tier 1 for reasoning and legal-text tasks, Tier 1.5 for coding (behind Codex descendants but ahead of most open-weights models), and Tier 2 for speed. On the Tokonomix leaderboard, it scored:

  • Reasoning (multi-hop inference, 500-question suite): 82/100—third behind GPT-4 Turbo (85) and o1-preview (88), but well ahead of Gemini 1.5 Pro (76).
  • Multilingual accuracy (DE/FR/ES legal contracts): 79/100—matching GPT-4, outperforming Mixtral 8x22B (72).
  • Code generation (HumanEval pass@1): 74%—comparable to GPT-4, trailing Codex-descended models at 81%.
  • Factual grounding (no-hallucination rate on ambiguous queries): 91%—highest in cohort, reflecting Constitutional AI's caution.
  • Speed (median tokens/sec, 10k-token context): 28 t/s—half the throughput of GPT-4o mini (56 t/s).

These figures rotate monthly as we re-test models against evolving prompt sets; consult our methodology page for sampling details and statistical confidence intervals. Claude Sonnet 4.5's standout is the factual-grounding score: in scenarios where wrong answers carry compliance or reputational risk, its conservative posture is an asset, not a bug.

One notable result: on our 100k-token coherence probe—a synthetic legal brief with planted contradictions—Claude Sonnet 4.5 correctly identified 19 of 20 conflicts, whereas GPT-4 Turbo caught 16 and Llama 3.1 70B only 11. This long-context reliability justifies the latency trade-off for document-heavy workflows.


EU privacy & data residency

Claude Sonnet 4.5 benefits from Anthropic's partnership with Google Cloud, which offers EU-region inference endpoints (typically europe-west1 in Belgium or europe-west4 in the Netherlands). Enterprise customers can contractually mandate that prompts and completions never leave EU data centres, satisfying GDPR Article 44 transfer-restriction requirements. Anthropic's DPA (Data Processing Agreement) includes standard contractual clauses and specifies a 30-day maximum retention of API logs for abuse monitoring—after which prompts are purged unless the customer opts into a longer audit trail for compliance reasons.

Critically, Anthropic does not train future models on enterprise API traffic by default; customers must explicitly opt in to data sharing, and even then, only anonymised, aggregated patterns are used. This contrasts with some competitors whose terms permit training-data inclusion unless users navigate opt-out settings. For public-sector and healthcare clients subject to strict data-minimisation rules, this privacy posture is a decision factor.

On the transparency front, Anthropic publishes quarterly responsible-AI reports detailing red-team findings, jailbreak-attempt volumes, and constitutional-rule updates. While not as granular as a full model card with training-data breakdowns, it exceeds the disclosure standard in the commercial LLM market. Legal teams appreciate the audit trail: when a Claude output is challenged, they can point to versioned API logs, timestamp metadata, and Anthropic's published harm-mitigation controls—documentation that speeds internal review and external regulatory inquiries.

One caveat: UK public-sector clients must verify that post-Brexit adequacy decisions cover their specific use case, as UK GDPR and EU GDPR have minor divergences. Anthropic's legal team provides jurisdiction-specific guidance, but ultimate responsibility rests with the data controller.


Verdict & alternatives

Who should use Claude Sonnet 4.5?
Legal practices, regulatory agencies, healthcare research organisations, and FinTech compliance teams that prioritise accuracy, auditability, and long-context reasoning over raw speed. If your workflow involves synthesising 50+ page documents, cross-referencing contract clauses, or generating outputs that a domain expert will review (rather than publish directly), Claude Sonnet 4.5's conservative, citation-friendly design aligns well. Its EU data-residency options and GDPR-friendly DPA make it a safer bet than US-only models when data sovereignty is non-negotiable.

When to choose an alternative:

  • Latency-critical chat: Opt for GPT-4o mini or Cohere Command R+ if sub-second TTFT is essential for customer-facing interfaces.
  • Budget-constrained high-volume tasks: Fine-tune Llama 3.1 70B or Mistral Large on your domain and self-host; once initial ML-ops overhead is absorbed, per-query cost drops to near zero.
  • Creative, stylistically adventurous content: GPT-4 or Claude Opus (Anthropic's larger, pricier tier) offer more tonal flexibility and speculative reasoning.
  • Bleeding-edge coding autocomplete: GitHub Copilot Chat or Codex-based tools deliver faster, more context-aware suggestions in IDE workflows.

Looking ahead (next six months):
Anthropic's roadmap hints at function-calling enhancements and tighter integration with retrieval systems—likely positioning Claude Sonnet 4.5 as the reasoning engine in hybrid architectures where a vector database supplies grounding documents. We also expect public pricing to stabilise once the preview window closes; early signals suggest input rates near $3/M tokens, output near $15/M. Organisations evaluating now should budget accordingly and test cost at realistic query volumes via the Tokonomix live-test environment, where you can run side-by-side comparisons with Gemini 1.5 Pro, GPT-4, and open-weights peers on your own prompts—no API key required, results delivered in under two minutes.

Final word: Claude Sonnet 4.5 is the model you pick when the cost of a wrong answer exceeds the cost of waiting an extra second. In regulated, high-stakes domains, that trade-off is not just acceptable—it's prudent.

Last technical review: 2026-05-05 — Tokonomix.ai

Claude Sonnet 4.5 — illustration 2
Last automated test
Jun 15, 2026 · 08:00 UTC · Speed benchmark
P50 latency
1483 ms
P95 latency
1487 ms
Errors
0 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026