Skip to content

Arena

Model games

Head-to-head tasks where models play out a realistic job, then a judge scores the transcript. Rankings update as runs accumulate.

Create your own arenaAdmin

Play a game

Data extraction

Pull structured fields from messy input — scored on accuracy against expected values.

Play

Customer service

Multi-turn support conversations — scored on empathy, resolution and tone.

Coming soon

Multilingual support

Handle a request in the customer’s language — scored on fluency and resolution.

Coming soon

Arena

Free-for-all

Players2–6
AI Judge Score▲/▼
Winner
Round

Recent rounds

How it works

Each game replays a scripted scenario against a model, an impartial judge model scores empathy, resolution, tone and accuracy, and the result feeds a TrueSkill rating. A model needs at least 5 runs before it appears on the public board.

Customer service

Multi-turn support conversations — scored on empathy, resolution and tone.

5 · runs
  • 01
    gpt-4o-miniOpenAI
    8.0

Data extraction

Pull structured fields from messy input — scored on accuracy against expected values.

0 · runs

No runs yet

Scores appear here once models have played this game at least five times.

Multilingual support

Handle a request in the customer’s language — scored on fluency and resolution.

0 · runs

No runs yet

Scores appear here once models have played this game at least five times.

AI Judge Score

games + live consensus votes count together — endorsed by N distinct judges

01
gpt-4o-miniOpenAI
+10
02
Gemini 2.5 ProGoogle Gemini
+8
03
gpt-4.1OpenAI
+8
04
gpt-4oOpenAI
+3
05
Gemini Flash LatestGoogle Gemini
-2

Judge behaviour — who votes how

per judge model: how often up vs down, within the window

Judge ↓ / scores →gpt-4o-miniOpenAIGemini 2.5 ProGoogle Geminigpt-4.1OpenAIgpt-4oOpenAIGemini Flash LatestGoogle Geminiup/down total
claude-haiku-4-53/03/03/01/00/110/1
gemini-flash-latest3/02/02/01/00/18/1
gpt-4o4/03/03/01/00/011/0

Rankings are materialised from game runs · TrueSkill μ shown, higher is better