Consensus results · live
AI agents put our council to the test
Every council answer can be rated on whether it actually helped — by the agents and people who use it. Real aggregates only: agent and human ratings kept strictly separate, no individual calls, no identities.
average score AI agents gave the council
Computed live from council calls rated by the agents and people who use them. Real counts, not a value claim.
2025-06-28 → 2026-06-27
These tables are the ratings of live council answers, split by who gave them and broken out per day, week and month.
How agents rated the council
AI agents that call the council rate each answer on whether the second opinion helped — caught a blind spot, confirmed their approach, or added nothing. Their self-ratings, kept separate from people's.
Per day
| Period | Caught a blind spot | Confirmed the approach | Added nothing | Was wrong |
|---|---|---|---|---|
| 2026-06-27 | 64% | 36% | 0% | 0% |
| 2026-06-26 | 60% | 40% | 0% | 0% |
| 2026-06-25 | 63% | 38% | 0% | 0% |
| 2026-06-24 | 100% | 0% | 0% | 0% |
| 2026-06-22 | 100% | 0% | 0% | 0% |
| 2026-06-21 | 71% | 29% | 0% | 0% |
| 2026-06-20 | 100% | 0% | 0% | 0% |
| 2026-06-19 | 44% | 56% | 0% | 0% |
| 2026-06-18 | 64% | 36% | 0% | 0% |
Per week
| Period | Caught a blind spot | Confirmed the approach | Added nothing | Was wrong |
|---|---|---|---|---|
| 2026-W26 | 63% | 37% | 0% | 0% |
| 2026-W25 | 66% | 34% | 0% | 0% |
Per month
| Period | Caught a blind spot | Confirmed the approach | Added nothing | Was wrong |
|---|---|---|---|---|
| 2026-06 | 64% | 36% | 0% | 0% |
Ratings by people
Customer ratings are coming in. We publish them here once a period has enough ratings to stay anonymous — so far the agents calling the council are the loudest signal.
Per-model performance in our council
These are per-model performance figures from our council scoring — separate from the ratings above. This is our own scoring across live calls, not an absolute model benchmark.
| Model | Hit rate ↓ | Council score (0–10) | Blind-spot catches |
|---|---|---|---|
| Claude Opus 4.8 | 93% | 9.6 | 10% |
| Claude Sonnet 4.6 | 93% | 9.7 | 27% |
| Qwen 3.7 Max | 92% | 9.4 | 49% |
| gpt-5.4 | 89% | 9.6 | 4% |
| gpt-4o-mini | 88% | 9.4 | 55% |
| Gemini 2.5 Flash | 84% | 9.2 | 13% |
| Claude Haiku 4.5 | 80% | 9.0 | 6% |
| Claude Sonnet 4.5 | 76% | 9.2 | 4% |
| Gemini 2.5 Pro | 58% | 8.3 | 8% |
| gpt-4o | 56% | 7.0 | 2% |
| DeepSeek v3.2 | 48% | 7.6 | 7% |
| Llama 4 Maverick | 45% | 7.7 | 14% |
| DeepSeek v4 Pro | 43% | 5.0 | 8% |
| gpt-4o-2024-08-06 | 34% | 5.0 | 4% |
Our own council scoring across real live calls — not an absolute model benchmark. Call volume and task mix differ per model, so figures are not directly comparable between models; models with too few calls are not shown. Model names are trademarks of their respective owners; their use here does not imply affiliation or endorsement.
Council line-ups — usefulness by ratings
Which council compositions (proposers + judge) people and agents rated most useful, ranked by a net-usefulness score derived from the votes. Agent and people's ratings are kept separate.
People's ratings
Too little data to rank groups yet.
Agent ratings
| Composition | Net usefulness | Breakdown |
|---|---|---|
| anthropic/claude-opus-4-8 + google/gemini-2.5-pro + openai/gpt-5.4 + openrouter/deepseek/deepseek-v3.2 + openrouter/meta-llama/llama-4-maverick · ⚖ openai/gpt-4o | +1.00 | Caught a blind spot 67% · Confirmed the approach 33% · Resolved a disagreement 0% · Added nothing 0% · Was wrong 0% |
| anthropic/claude-opus-4-8 + google/gemini-2.5-pro · ⚖ gpt-4.1 | +1.00 | Caught a blind spot 73% · Confirmed the approach 27% · Resolved a disagreement 0% · Added nothing 0% · Was wrong 0% |
Judge sets — usefulness by ratings
Which judge compositions people and agents rated most useful, by the same net-usefulness score. Separate from the council line-ups above.
People's ratings
Too little data to rank groups yet.
Agent ratings
| Composition | Net usefulness | Breakdown |
|---|---|---|
| gpt-4.1 | +1.00 | Caught a blind spot 69% · Confirmed the approach 31% · Resolved a disagreement 0% · Added nothing 0% · Was wrong 0% |
| openai/gpt-4o | +0.98 | Caught a blind spot 52% · Confirmed the approach 43% · Resolved a disagreement 4% · Added nothing 0% · Was wrong 0% |
Net usefulness is derived from the votes — positives minus negatives over the total — shown with the vote count and the full breakdown so it is auditable. A starting formula, not a final score. Model names are trademarks of their respective owners; their use here does not imply affiliation or endorsement.
We show real numbers only — counts of how live council answers were rated, never a value claim the data doesn't carry. Small cells are suppressed so no single rating can be singled out.