Zum Inhalt

Claude Opus 4.7 — Spielhistorie

Jede Benchmark-Runde, die Claude Opus 4.7 in der Tokonomix-Arena gespielt hat: Gegner, Sieger, Jury-Zählungen und Kosten pro Runde. Wird bei neuen Spielen aktualisiert.

9 Runden gespielt · Anthropic

Aktuelle Runden (letzte 30 tage)

Claude Fable 5, Claude Opus 4.6, Claude Opus 4.5, Claude Opus 4.8, Claude Sonnet 4.62026-06-09
Szenario: Custom — Help my computer is not starting, can the problem happen because i turn off my p · customer service · medium
Verloren0 von 3 Juroren€0.253 Kosten

"Response 6 (index 5) is best because it provides the correct, clear technical answer while also being exceptionally empathetic, gently addressing the user's repetitive questioning with compassion and "

gpt-5.5, DeepSeek v3.2, Llama 3.3 70B Instruct, Llama 4 Scout, Nous Hermes 3 70B2026-06-06
Szenario: Custom — Mijn website doet het niet, kan het zijn dat het komt omdat mijn printer uit sta · customer service · medium
Verloren0 von 2 Juroren€0.258 Kosten

"None of the responses address the user's specific concern about the printer, and several end abruptly mid-sentence. Response 2 is the winner because it provides the most practical, well-structured, an"

Claude Opus 4.5, Claude Opus 4.62026-06-05
Szenario: Te late levering — terugbetaling gevraagd · multilingual support · medium
Verloren1 von 3 Juroren€0.048 Kosten

"Response 1 is excellent, providing clear refund timelines, showing empathy for the delay, and politely asking for the necessary order details. Response 3 cuts off mid-sentence, and Response 2 suggests"

Frontier · Frontier C vs Mistral-7B-Instruct-v0.3, Mistral-Nemo-Instruct-2407, Mistral-Small-3.2-24B-Instruct-2506, gpt-5, gpt-4, gpt-4-turbo2026-06-05
Szenario: Router Will Not Connect After Firmware Update · customer service · medium
Gewonnen3 von 3 Juroren€0.088 Kosten

"Response 2 (index 2) provides the most customer-centric approach with clear accountability, specific timelines, escalation paths, and confirmed compensation - demonstrating excellent customer service "

Frontier · Frontier C vs Llama 3.3 70B Instruct, Llama 4 Maverick, Llama 4 Scout, Qwen 3.6 Plus, Qwen 3.7 Max2026-06-05
Szenario: Custom — Mijn wordpress website werkt niet, kan het aan mijn email instellingen liggen? · customer service · medium
Verloren0 von 3 Juroren€0.094 Kosten

"Response 2 (index 1) is the most complete and well-organized, covering SMTP plugins, lightweight testing tools, external deliverability tools, and debugging approaches with clear categorization and pr"

Frontier · Frontier B vs Qwen2.5-VL-72B-Instruct, Qwen 2.5 VL 72B Instruct, Meta-Llama-3_3-70B-Instruct, Llama 3.3 70B Instruct, Qwen 3.7 Max2026-06-05
Szenario: Parcel Marked Delivered but Not Received · customer service · easy
Gewonnen3 von 3 Juroren€0.049 Kosten

"Response 2 is more professional, actionable, and complete — it acknowledges the timing correction, offers clear options, verifies the shipping address, and files a parallel courier report. Response 1 "

Frontier · Frontier B vs Claude Haiku 4.5, DeepSeek v4 Pro, Llama 3.3 70B Instruct, Mistral-7B-Instruct-v0.3, Nano Banana2026-06-05
Szenario: Router Will Not Connect After Firmware Update · customer service · medium
Gewonnen3 von 3 Juroren€0.044 Kosten

"Response 2 correctly identifies the prompt as PPPoE credentials (not a router admin login), offers proper account verification, addresses the firmware issue specifically, and provides a practical hots"

Frontier · Frontier B vs Claude Haiku 4.5, Gemini 2.5 Pro, Gemini Pro Latest, gpt-5.1-chat-latest, Llama 3.3 70B Instruct2026-06-05
Szenario: Deployment Failing After Plan Upgrade · customer service · medium
Gewonnen1 von 1 Juroren€0.049 Kosten

"Response 2 provides a more accurate root-cause explanation (org-level permissions tied to users, not just token scopes) and practical tips about service accounts and dedicated CI/CD roles, though it w"

Claude Sonnet 4.62026-06-04
Szenario: Invoice — Blue Harbor Logistics · data extraction · easy
Gewonnen€0.005 Kosten
Nur öffentliche Runden — private Nutzerrunden werden ausgeschlossen.