Naar inhoud

Llama 3.3 70B Instruct games — juni 2026

Elke benchmarkreeks die Llama 3.3 70B Instruct speelde in de Tokonomix-arena: tegenstanders, winnaars, jurytellingen en kosten per ronde. Bijgewerkt zodra nieuwe spellen worden gespeeld.

7 rondes gespeeld · OpenRouter

Recente rondes (laatste 30 dagen)

gpt-5.5, Qwen 3.6 Plus, DeepSeek v4 Pro2026-06-06
Scenario: Account Merged Without Consent · multilingual support · hard
Verloren0 van 1 jury€0.001 kosten

"Response 3 is the most comprehensive and professional, providing specific details (timestamped notice, specific email addresses, GDPR/DPO references) while maintaining clarity and structure. Response "

Claude Opus 4.7, gpt-5.5, DeepSeek v3.2, Llama 4 Scout, Nous Hermes 3 70B2026-06-06
Scenario: Custom — Mijn website doet het niet, kan het zijn dat het komt omdat mijn printer uit sta · customer service · medium
Verloren0 van 2 jury€0.003 kosten

"None of the responses address the user's specific concern about the printer, and several end abruptly mid-sentence. Response 2 is the winner because it provides the most practical, well-structured, an"

Raad · Council A vs Claude Opus 42026-06-05
Scenario: Custom — Mijn pc start niet op, kan het zijn dat ze mijn website hebben gehacked? · customer service · medium
Verloren0 van 1 jury€0.007 kosten

"Response 2 provides a safer, better-prioritized approach by recommending checking from a separate device first to avoid further compromise, uses clear actionable steps, and engages the customer with a"

Raad · Council A vs Qwen 3.6 Plus, Qwen 3.7 Max, Claude Opus 4.72026-06-05
Scenario: Custom — Mijn wordpress website werkt niet, kan het aan mijn email instellingen liggen? · customer service · medium
Verloren0 van 3 jury€0.007 kosten

"Response 2 (index 1) is the most complete and well-organized, covering SMTP plugins, lightweight testing tools, external deliverability tools, and debugging approaches with clear categorization and pr"

Raad · Council A vs Claude Opus 4.72026-06-05
Scenario: Parcel Marked Delivered but Not Received · customer service · easy
Verloren0 van 3 jury€0.086 kosten

"Response 2 is more professional, actionable, and complete — it acknowledges the timing correction, offers clear options, verifies the shipping address, and files a parallel courier report. Response 1 "

Raad · Council A vs Claude Opus 4.72026-06-05
Scenario: Router Will Not Connect After Firmware Update · customer service · medium
Verloren0 van 3 jury€0.028 kosten

"Response 2 correctly identifies the prompt as PPPoE credentials (not a router admin login), offers proper account verification, addresses the firmware issue specifically, and provides a practical hots"

Raad · Council A vs Claude Opus 4.72026-06-05
Scenario: Deployment Failing After Plan Upgrade · customer service · medium
Verloren0 van 1 jury€0.049 kosten

"Response 2 provides a more accurate root-cause explanation (org-level permissions tied to users, not just token scopes) and practical tips about service accounts and dedicated CI/CD roles, though it w"

Alleen openbare rondes — privé-rondes van gebruikers zijn uitgesloten.