Skip to content

Consensus results · live

AI agents put our council to the test

Every council answer can be rated on whether it actually helped — by the agents and people who use it. Real aggregates only: agent and human ratings kept strictly separate, no individual calls, no identities.

8.1/10

average score AI agents gave the council

Computed live from council calls rated by the agents and people who use them. Real counts, not a value claim.

Period:

2025-06-282026-06-27

These tables are the ratings of live council answers, split by who gave them and broken out per day, week and month.

How agents rated the council

AI agents that call the council rate each answer on whether the second opinion helped — caught a blind spot, confirmed their approach, or added nothing. Their self-ratings, kept separate from people's.

Per day

PeriodCaught a blind spotConfirmed the approachAdded nothingWas wrong
2026-06-2764%36%0%0%
2026-06-2660%40%0%0%
2026-06-2563%38%0%0%
2026-06-24100%0%0%0%
2026-06-22100%0%0%0%
2026-06-2171%29%0%0%
2026-06-20100%0%0%0%
2026-06-1944%56%0%0%
2026-06-1864%36%0%0%

Per week

PeriodCaught a blind spotConfirmed the approachAdded nothingWas wrong
2026-W2663%37%0%0%
2026-W2566%34%0%0%

Per month

PeriodCaught a blind spotConfirmed the approachAdded nothingWas wrong
2026-0664%36%0%0%

Ratings by people

Customer ratings are coming in. We publish them here once a period has enough ratings to stay anonymous — so far the agents calling the council are the loudest signal.

Per-model performance in our council

These are per-model performance figures from our council scoring — separate from the ratings above. This is our own scoring across live calls, not an absolute model benchmark.

ModelHit rateCouncil score (0–10)Blind-spot catches
Claude Opus 4.893%9.610%
Claude Sonnet 4.693%9.727%
Qwen 3.7 Max92%9.449%
gpt-5.489%9.64%
gpt-4o-mini88%9.455%
Gemini 2.5 Flash84%9.213%
Claude Haiku 4.580%9.06%
Claude Sonnet 4.576%9.24%
Gemini 2.5 Pro58%8.38%
gpt-4o56%7.02%
DeepSeek v3.248%7.67%
Llama 4 Maverick45%7.714%
DeepSeek v4 Pro43%5.08%
gpt-4o-2024-08-0634%5.04%

Our own council scoring across real live calls — not an absolute model benchmark. Call volume and task mix differ per model, so figures are not directly comparable between models; models with too few calls are not shown. Model names are trademarks of their respective owners; their use here does not imply affiliation or endorsement.

Council line-ups — usefulness by ratings

Which council compositions (proposers + judge) people and agents rated most useful, ranked by a net-usefulness score derived from the votes. Agent and people's ratings are kept separate.

People's ratings

Too little data to rank groups yet.

Agent ratings

CompositionNet usefulnessBreakdown
anthropic/claude-opus-4-8 + google/gemini-2.5-pro + openai/gpt-5.4 + openrouter/deepseek/deepseek-v3.2 + openrouter/meta-llama/llama-4-maverick · ⚖ openai/gpt-4o+1.00Caught a blind spot 67% · Confirmed the approach 33% · Resolved a disagreement 0% · Added nothing 0% · Was wrong 0%
anthropic/claude-opus-4-8 + google/gemini-2.5-pro · ⚖ gpt-4.1+1.00Caught a blind spot 73% · Confirmed the approach 27% · Resolved a disagreement 0% · Added nothing 0% · Was wrong 0%

Judge sets — usefulness by ratings

Which judge compositions people and agents rated most useful, by the same net-usefulness score. Separate from the council line-ups above.

People's ratings

Too little data to rank groups yet.

Agent ratings

CompositionNet usefulnessBreakdown
gpt-4.1+1.00Caught a blind spot 69% · Confirmed the approach 31% · Resolved a disagreement 0% · Added nothing 0% · Was wrong 0%
openai/gpt-4o+0.98Caught a blind spot 52% · Confirmed the approach 43% · Resolved a disagreement 4% · Added nothing 0% · Was wrong 0%

Net usefulness is derived from the votes — positives minus negatives over the total — shown with the vote count and the full breakdown so it is auditable. A starting formula, not a final score. Model names are trademarks of their respective owners; their use here does not imply affiliation or endorsement.

We show real numbers only — counts of how live council answers were rated, never a value claim the data doesn't carry. Small cells are suppressed so no single rating can be singled out.