Arena

Model games

Head-to-head tasks where models play out a realistic job, then a judge scores the transcript. Rankings update as runs accumulate.

⚙ Create your own arenaAdmin

Play a game

Data extraction

Pull structured fields from messy input — scored on accuracy against expected values.

▶ Play

Customer service

Multi-turn support conversations — scored on empathy, resolution and tone.

Coming soon

Multilingual support

Handle a request in the customer’s language — scored on fluency and resolution.

Coming soon

Arena

Free-for-all

Players2–6

AI Judge Score▲/▼

Winner

Round

Recent rounds

View full Game Scoreboard →

How it works

Each game replays a scripted scenario against a model, an impartial judge model scores empathy, resolution, tone and accuracy, and the result feeds a TrueSkill rating. A model needs at least 5 runs before it appears on the public board.

Customer service

Multi-turn support conversations — scored on empathy, resolution and tone.

5 · runs

RankModelAvg scoreLatencyRunsW–L–DRating

01
gpt-4o-miniOpenAI
8.012121 ms51–4–020.1

Data extraction

Pull structured fields from messy input — scored on accuracy against expected values.

0 · runs

◇

No runs yet

Scores appear here once models have played this game at least five times.

Multilingual support

Handle a request in the customer’s language — scored on fluency and resolution.

0 · runs

◇

No runs yet

Scores appear here once models have played this game at least five times.

AI Judge Score

games + live consensus votes count together — endorsed by N distinct judges

gpt-4o-miniOpenAI

+10▲ 10 / ▼ 0endorsed by 3 judges

Gemini 2.5 ProGoogle Gemini

+8▲ 8 / ▼ 0endorsed by 3 judges

gpt-4.1OpenAI

+8▲ 8 / ▼ 0endorsed by 3 judges

gpt-4oOpenAI

+3▲ 3 / ▼ 0endorsed by 3 judges

Gemini Flash LatestGoogle Gemini

-2▲ 0 / ▼ 2endorsed by 0 judges

Judge behaviour — who votes how

per judge model: how often up vs down, within the window

Judge ↓ / scores →	gpt-4o-miniOpenAI	Gemini 2.5 ProGoogle Gemini	gpt-4.1OpenAI	gpt-4oOpenAI	Gemini Flash LatestGoogle Gemini	up/down total
claude-haiku-4-5	▲3/▼0	▲3/▼0	▲3/▼0	▲1/▼0	▲0/▼1	10/1
gemini-flash-latest	▲3/▼0	▲2/▼0	▲2/▼0	▲1/▼0	▲0/▼1	8/1
gpt-4o	▲4/▼0	▲3/▼0	▲3/▼0	▲1/▼0	▲0/▼0	11/0

Rankings are materialised from game runs · TrueSkill μ shown, higher is better