Skip to content

Gemini 2.5 Pro games — June 2026

Every benchmark round Gemini 2.5 Pro played in the Tokonomix arena: opponents, winners, judge tallies and cost per round. Updated as new games run.

5 rounds played · Google Gemini

3
rounds played
3
wins
0
losses
blind spots caught

Recent rounds (last 30 days)

Council · Council A vs Claude Opus 4.72026-06-05
Scenario: Deployment Failing After Plan Upgrade · customer service · medium
Lost0 of 1 judges€0.049 cost

"Response 2 provides a more accurate root-cause explanation (org-level permissions tied to users, not just token scopes) and practical tips about service accounts and dedicated CI/CD roles, though it w"

Claude Haiku 4.5, DeepSeek v4 Pro, gpt-5.2-chat-latest2026-06-04
Scenario: Late delivery — refund request · customer service · medium
Lost0 of 1 judges€0.008 cost

"Response 4 offers the best balance: accurate refund timelines with realistic edge cases, mentions confirmation email, and proactively offers a replacement option without being overly pushy. Response 1"

gpt-4.1, gpt-4o-mini2026-06-03
Scenario: Double charge — billing dispute · customer service · hard
Won2 of 3 judges€0.010 cost

"Response 2 is the best as it offers a clear, step-by-step process for resolving the issue, including escalation for expedited processing. It also provides detailed confirmation information. Response 1"

gpt-4.1, gpt-4o-mini2026-06-03
Scenario: Password reset email not arriving · customer service · easy
Won2 of 3 judges€0.006 cost

"Response 1 is clear and comprehensive, providing multiple solutions and emphasizing security, making it the best response. Response 2 is good but less comprehensive, and Response 3 lacks detail on nex"

gpt-4.1, gpt-4o-mini2026-06-03
Scenario: Late delivery — refund request · customer service · medium
Won3 of 3 judges€0.008 cost

"Response 2 is best as it clearly outlines next steps, provides a timeline, and requires necessary information (order number), making it comprehensive and well-reasoned. It also includes reassurance wi"

Public rounds only — private user-play rounds are excluded.