Skip to content

DeepSeek v4 Pro — game history

Every benchmark round DeepSeek v4 Pro played in the Tokonomix arena: opponents, winners, judge tallies and cost per round. Updated as new games run.

6 rounds played · OpenRouter

Recent rounds (last 30 days)

gpt-5.5, Llama 3.3 70B Instruct, Qwen 3.6 Plus2026-06-06
Scenario: Account Merged Without Consent · multilingual support · hard
Lost0 of 1 judges€0.004 cost

"Response 3 is the most comprehensive and professional, providing specific details (timestamped notice, specific email addresses, GDPR/DPO references) while maintaining clarity and structure. Response "

Claude Haiku 4.5, Claude Opus 4.1, Claude Sonnet 4.5, Deep Research Preview (Apr-21-2026), Deep Research Max Preview (Apr-21-2026)2026-06-05
Scenario: Verkeerd artikel ontvangen · multilingual support · easy
Lost0 of 3 judges€0.001 cost

"Response 1 is the most comprehensive and clear in its explanation and summary, making it the best response."

Claude Opus 4.5, gpt-52026-06-05
Scenario: Invoice — Lumen Cloud Services · data extraction · medium
Lost1 of 2 judges€0.001 cost

"Response 2 is the best because it provides both helpful customer service guidance AND a clean, accurate JSON extraction of the invoice data, making it more comprehensive and useful. Response 1 is good"

Council · Council A vs Claude Opus 4.72026-06-05
Scenario: Router Will Not Connect After Firmware Update · customer service · medium
Lost0 of 3 judges€0.028 cost

"Response 2 correctly identifies the prompt as PPPoE credentials (not a router admin login), offers proper account verification, addresses the firmware issue specifically, and provides a practical hots"

Claude Haiku 4.5, Claude Sonnet 4.62026-06-04
Scenario: Password reset email not arriving · customer service · easy
Lost0 of 2 judges€0.002 cost

"Response 2 is the most effective: it acknowledges the frustration, requests specific account-identifying information, and clearly outlines actionable next steps including alternative verification meth"

Claude Haiku 4.5, Gemini 2.5 Pro, gpt-5.2-chat-latest2026-06-04
Scenario: Late delivery — refund request · customer service · medium
Lost0 of 1 judges€0.001 cost

"Response 4 offers the best balance: accurate refund timelines with realistic edge cases, mentions confirmation email, and proactively offers a replacement option without being overly pushy. Response 1"

Public rounds only — private user-play rounds are excluded.