Skip to content
Use cases/Voice & conversational

Which AI model feels most human in dialogue?

Voice and conversational AI is the workload that surfaces every weakness a model has the fastest. Tone drift, latency, broken memory, persona collapse, the small disfluencies that make a human-sounding agent suddenly feel robotic — all of them register inside the first minute of a real conversation. This guide breaks down the dimensions that decide which model carries a voice product, then names the five we would put on a phone call today.

Voice agent workspace — concept image
Voice is the unforgiving channel — every second of latency shows.

Why dialogue is the workload models fail on most visibly

Text gives a model time. A user sends a message, the model reads, thinks, writes, the user reads, considers, replies. Slow reasoning is invisible inside that cadence. Voice removes the buffer. A pause longer than a second reads as confusion; a pause longer than two reads as failure. Whoever picks the model for a voice product is picking on a latency budget every other workload would treat as aggressive.

The architecture choice that follows is whether to run an audio-native model end to end or to stack a chain — speech to text, then language model, then text to speech. The audio-native route is unbeaten on latency and on paralinguistic awareness: the model can tell when the user is hesitating, can interrupt and be interrupted, can adopt a register the prompt did not name. The stacked route is easier to debug, cheaper to scale, and gives you full control of voice selection and brand sound.

Persona consistency matters more here than almost anywhere else. In text, a one-line tone shift between turns goes unnoticed; in voice it lands as a different person taking over the call. Models that drift between turns are unfit for voice work even when they would be fine for chat. Test for it explicitly — twenty turns at minimum, with deliberately distracting user inputs along the way.

Five constraints define the work: end-to-end latency, persona stability across turns, audio quality where relevant, multilingual speech coverage, and tool-calling discipline mid-conversation. A voice agent that handles all five gracefully feels like a person; one that drops any single dimension feels like a chatbot reading aloud.

Voice pipeline architecture — concept image
Audio-native versus stacked STT-LLM-TTS — the architecture is the choice.

The five dimensions that decide which model wins

These are the axes our scorecard weights for any model that ships in a voice product. Their relative importance shifts with whether you are building a phone-line agent or a long-form companion app — but every contender clears a minimum bar on all five.

  1. 01 — End-to-end latency

    Does the user hear a reply inside a heartbeat?

    The clock starts the moment the user stops speaking and ends the moment they hear the first audible word back. Audio-native models can hit that budget; stacked pipelines have to optimise every layer. Measure on the network you will deploy on, not the vendor's demo region.

  2. 02 — Persona stability across turns

    Does turn twenty sound like turn one?

    Drift is the single failure mode that breaks the illusion of a person on the other end. Models that snap back to their default voice as the prompt loses salience are unusable for any voice product with a brand identity. Always stress-test with adversarial users that try to change the persona mid-call.

  3. 03 — Audio quality and paralinguistic awareness

    Does it hear how the user said it, not just what?

    Frustration, hesitation, sarcasm, urgency — humans carry meaning in tone that pure-text models cannot perceive. Audio-native models read these signals and adapt; stacked pipelines lose them entirely at the STT step. The right architecture depends on whether your product needs that nuance.

  4. 04 — Multilingual speech coverage

    Does it handle code-switching mid-sentence?

    Real voice traffic includes accents, dialects, and users who switch language inside a single utterance. The model has to follow without losing the thread. Test on recordings from your actual customer base, not the vendor's pronunciation benchmark.

  5. 05 — Tool-calling mid-conversation

    Can it look something up without breaking flow?

    Voice agents need to query CRMs, check inventory, book appointments. The hard part is doing it naturally — filling the wait with a spoken acknowledgment, recovering gracefully when the tool fails. Models tuned for chat tool-use often emit awkward filler that breaks immersion.

Tokonomix top 5 picks for voice and dialogue today

These are the five we would put on a live channel today. A voice product almost never ships with a single model; the architecture that works is layered — an audio-native model on the spoken layer for latency and paralinguistic awareness, and a stronger text model underneath doing the planning, tool-calling and knowledge work the audio layer hands off to it.

#1 · Audio-native realtimeTier A

Claude Sonnet 4.6

via Anthropic

Audio in, audio out, low-latency end to end. The right pick for telephony, browser voice agents and any application where the user expects an interruption to land within a heartbeat. Native handling of paralinguistic cues — pauses, tone, urgency — that text-plus-TTS pipelines cannot match.

Input / 1M tokens
$3.00
Output / 1M tokens
$15.00
Context
1M
Full benchmark profile →
#2 · Best dialogue tone (text + TTS)Tier A

Gemini 2.5 Pro

via Google Gemini

The model to put behind a text-first voice agent that streams to a TTS layer. Sonnet 4.6 holds persona across long sessions better than most peers and reliably matches the register you describe in the prompt. Cheaper than audio-native models and easier to swap out as TTS quality keeps improving.

Input / 1M tokens
$1.25
Output / 1M tokens
$10.00
Context
1.048576M
Full benchmark profile →
#3 · Long-context memoryTier A

Claude Haiku 4.5

via Anthropic

A million-token context turns the entire session — and arbitrarily large history — into something the model can attend to without truncation. Right pick for companion apps, coaching agents and any voice product that benefits from remembering what the user said in the call last week.

Input / 1M tokens
$1.00
Output / 1M tokens
$5.00
Context
200K
Full benchmark profile →
#4 · Snappy back-and-forth

Meta-Llama-3_3-70B-Instruct

via OVH AI Endpoints (GRA)

Short turns, fast first-token, low cost. Right pick when the conversation is structured — booking, lookup, status check — and the latency budget is the constraint. Pair with a strong system prompt and the same TTS layer you use for Sonnet escalations.

Input / 1M tokens
$0.6700
Output / 1M tokens
$0.6700
Context
Full benchmark profile →

Output price per million tokens

For voice the output cost dominates — most of the tokens are the spoken reply. The chart below shows the text-tier list price for the models above with published rates; audio-native models price separately, on audio minutes rather than tokens, and need a different model than the one shown here.

Price per 1M output tokens, USD. Audio-native models (gpt-realtime) bill on audio-minute rates and are excluded from this comparison. Source: live provider pricing tracked by Tokonomix.
Voice analytics dashboard — concept image
Measure session-end satisfaction, not first-turn accuracy.

A field guide: which model for which voice pattern

The mapping below is the one we would use to advise a team building a new voice product. Treat it as a default, not a verdict — a single weekend of testing on real recordings will beat any general recommendation.

Pattern A

Real-time phone-line agent

Inbound support calls, outbound sales, booking lines. Latency wins everything. gpt-realtime end to end, with Sonnet 4.6 as the planner the realtime model defers to when the conversation goes off-script.

Pattern B

Browser voice agent with brand voice

In-product assistant where the voice is part of the identity. Stacked pipeline — Sonnet 4.6 driving the conversation, a chosen TTS engine producing the audio. Trade some latency for full control of how the agent sounds.

Pattern C

Long-form companion or coach

Sessions that run for an hour or more and benefit from cross-session memory. Gemini 2.5 Pro for the context window; persist conversation history per user and feed it back into the system prompt every session.

Pattern D

Self-hosted voice agent

Healthcare, finance, regulated industries where recordings cannot leave a specific jurisdiction. Self-host Llama 3.3 70B plus Whisper for STT and an open-weight TTS engine. Slower to iterate, full control of the data.

Voice agent operations setup — concept image
A voice agent designed in text always disappoints in production.

Benchmark on your own calls before you commit

You will not learn what you need from a vendor demo or a static prompt set. Record twenty real conversations — users you actually have, scenarios you actually run — and replay each one through every candidate end to end. Synthetic transcripts will not surface the failure modes that matter; the awkward pauses, the hostile users, the cross-talk all live in real audio.

Listen, do not just read the transcript. Did the first word land before the user gave up? Did the agent still sound like itself by minute ten? Did it pick up on the frustration in the third turn or talk past it? Did the tool call land naturally inside the flow of the call, or did it leave a hole the user noticed? Pick whichever model your own ear trusts at the end of the playback, not the one a benchmark prefers.

Open the live test tool →

Related use cases