Claude Haiku 4.5 — historique de jeu

Chaque round de benchmark joué par Claude Haiku 4.5 dans l'arène Tokonomix : adversaires, vainqueurs, résultats du jury et coût par round. Mis à jour à chaque nouvelle partie.

10 rounds joués · Anthropic

Rounds récents (30 derniers jours)

Claude Opus 4.8, gpt-oss-20b, Mistral-7B-Instruct-v0.32026-07-07

Scénario: Office Lease Agreement — Riverside Tower · data extraction · hard

Perdu€0.001 coût

Antigravity Agent Preview, Claude Fable 52026-06-12

Scénario: Custom — My computer is not starting and i get a black screen, i use Windows. what is the · customer service · medium

Gagné3 sur 3 jurés€0.101 coût

"Response 3 is excellent because it leverages implied prior conversation (about fans/lights being on) to provide targeted reassurance, offers immediate actionable troubleshooting steps, and clearly out…"

Claude Opus 4.1, Claude Sonnet 4.5, Deep Research Preview (Apr-21-2026), Deep Research Max Preview (Apr-21-2026), DeepSeek v4 Pro2026-06-05

Scénario: Verkeerd artikel ontvangen · multilingual support · easy

Gagné3 sur 3 jurés€0.006 coût

"Response 1 is the most comprehensive and clear in its explanation and summary, making it the best response."

Conseil · Council A vs gpt-4-turbo2026-06-05

Scénario: Custom — Mijn website werkt niet, kan het zijn dat het probleem aan mijn printer ligt? · customer service · medium

Gagné2 sur 3 jurés€0.210 coût

"Response 1 is the winner because it provides a more comprehensive and detailed explanation of the potential issues and solutions, including specific examples and technical details, making it a more ac…"

Conseil · Council A vs Claude Opus 4.72026-06-05

Scénario: Router Will Not Connect After Firmware Update · customer service · medium

Perdu0 sur 3 jurés€0.028 coût

"Response 2 correctly identifies the prompt as PPPoE credentials (not a router admin login), offers proper account verification, addresses the firmware issue specifically, and provides a practical hots…"

Gemini 2.5 Flash, Gemini Pro Latest, gpt-4.1, gpt-4o-2024-05-13, gpt-5.5-2026-04-232026-06-05

Scénario: Medical Report — Radiology Findings · data extraction · hard

Gagné2 sur 5 jurés€0.003 coût

"Response 5 (index 5) provides the most balanced and comprehensive customer service approach by delivering clear, actionable medical information from the report while appropriately maintaining boundari…"

Conseil · Council A vs Claude Opus 4.72026-06-05

Scénario: Deployment Failing After Plan Upgrade · customer service · medium

Perdu0 sur 1 jurés€0.049 coût

"Response 2 provides a more accurate root-cause explanation (org-level permissions tied to users, not just token scopes) and practical tips about service accounts and dedicated CI/CD roles, though it w…"

Conseil · Council B vs gpt-5.4-2026-03-05, Gemini 2.5 Flash, Gemini Pro Latest, gpt-3.5-turbo, Claude Sonnet 4.62026-06-04

Scénario: Late delivery — refund request · customer service · medium

Perdu0 sur 1 jurés€0.020 coût

"Response 3 is the most empathetic, transparent, and well-structured, giving a clear timeline while managing expectations and offering helpful alternatives without being pushy."

Claude Sonnet 4.6, DeepSeek v4 Pro2026-06-04

Scénario: Password reset email not arriving · customer service · easy

Perdu0 sur 2 jurés€0.005 coût

"Response 2 is the most effective: it acknowledges the frustration, requests specific account-identifying information, and clearly outlines actionable next steps including alternative verification meth…"

DeepSeek v4 Pro, Gemini 2.5 Pro, gpt-5.2-chat-latest2026-06-04

Scénario: Late delivery — refund request · customer service · medium

Perdu0 sur 1 jurés€0.004 coût

"Response 4 offers the best balance: accurate refund timelines with realistic edge cases, mentions confirmation email, and proactively offers a replacement option without being overly pushy. Response 1…"

Rounds publics uniquement — les rounds privés des utilisateurs sont exclus.