Skip to main content

Command Palette

Search for a command to run...

For Voice Agents, Latency Beats Intelligence

The phone network steals 300ms before your model even wakes up. Here's how voice agents survive the budget that's left

Updated
6 min read
For Voice Agents, Latency Beats Intelligence
I
The hard parts of building production voice AI

"I've spent the last 3 years building conversational AI, from rule-based bots to production voice agents handling live phone calls."

Here's the most expensive lesson I've learned building agents: a model that's 800ms smarter but 800ms slower will lose a phone call. Every time.

On text, nobody notices a thoughtful pause. On a voice call, 800ms of silence is a UX failure - the caller assumes the line dropped, talks over the agent, or hangs up. The entire evolution of agents, from chatbots to the real-time voice systems I work on now, bends toward this one tension. And almost everything hard about voice agents is, underneath, a fight to buy back milliseconds.

Let me trace how we got here, because the history explains the constraint.

The text era: latency was free

The first wave of "agents" weren't agents at all they were chatbots with decision trees. Then large language models arrived, and we got something genuinely new: a system that could read a question, reason about it, and answer in natural language. We bolted on retrieval (RAG) so the model could ground answers in real documents, and then tool-calling so it could act, look up an order, book a slot, hit an API.

That progression - chatbot → RAG → tool-using agent is the story everyone knows. What gets glossed over is that the whole time, latency was a free resource. A web chat agent can take two, three, four seconds to respond and the user just sees a typing indicator. So we spent that budget lavishly: bigger models, multi-step reasoning, retrieve-then-rerank-then-generate pipelines, agents that call three tools before saying a word. Cleverness was cheap because waiting was invisible.

That assumption is so baked into how we build agents that most teams don't notice they're relying on it. Then they try to put their agent on a phone, and the physics change.

Voice changes the physics

A spoken conversation has a metronome. Humans expect a response within roughly 200-500ms of finishing a sentence. Cross a second of silence and the interaction feels broken; even if the eventual answer is perfect.

Now stack reality on top of that. On a real phone call, before your model has processed a single token, you've already spent something like 300ms just moving audio around: the caller's voice traveling up through the carrier network, jitter buffering, codec encode/decode, the same trip back down for your reply. That's overhead you don't control and can't remove. It comes straight out of your budget.

So the math is brutal. Say your conversational target is 700ms mouth-to-ear. Subtract ~300ms of telephony round-trip. You have ~400ms left to: detect the caller actually stopped talking, run speech-to-text, run the LLM to first token, and start speech synthesis. The smart, four-tool, retrieve-and-rerank pipeline you built for chat doesn't fit. It was never going to fit.

This is why I say latency beats intelligence. Not because intelligence doesn't matter, but because below a latency floor, no amount of intelligence is perceived at all. The caller has already started talking over you.

What you actually build to survive it

Once you internalize the budget, the engineering stops being about the model and starts being about shaving milliseconds and hiding the ones you can't shave. A few of the scars I've collected:

Turn detection is the whole game, and fixed thresholds are a trap. The naive approach: wait for N milliseconds of silence, then assume the caller is done. Set N too low and you interrupt people mid-sentence. Set it too high and every response feels sluggish. The catch is there's no single right N - a fast talker and someone slowly reading a credit-card number off a receipt need completely different patience. I moved to adaptive endpointing: track the caller's own pause rhythm and adjust the threshold to them, within a floor and ceiling. The agent waits longer for the person who needs it and pounces on the person who doesn't.

Barge-in has to know the difference between an interruption and "uh-huh." Callers backchannel constantly "yeah," "okay," "mm-hm" while the agent is still talking. Cut the agent off every time and the conversation becomes unusable. So you can't just stop on any detected speech; you filter on minimum duration and word count, ignoring the single-word acknowledgements that are actually a signal to keep going. Getting this right cut my false interruptions roughly in half versus raw voice-activity detection.

Hide latency you can't remove. This is the dirty secret. The single biggest perceived-speed win wasn't making anything faster; it was starting to generate the response before I was certain the caller had finished, on the bet that they had. When the bet is right (it usually is), the reply is already underway the instant they stop. You're spending compute to buy the illusion of speed, and on voice the illusion is the product.

Stabilize your prompt so the model's cache works for you. Time-to-first-token is dominated by how much of your prompt the model has to chew through cold. Keep the big, stable part of the system prompt identity, instructions, and knowledge byte-for-byte identical across calls, and route requests so they land on a warm cache. The first call of the day pays full price; the rest hit the cache and shave real milliseconds off every single turn.

Filter the model's own mistakes before they reach the speaker. Models occasionally leak malformed tool-call tokens into their text output fragments like <functions.hangup. On a screen that's an ugly glitch. Spoken aloud by a text-to-speech engine, it's a baffling noise in the caller's ear. So there's a filter sitting between the model and the voice, watching for the leading <, buffering just enough to catch garbage, and passing clean text straight through. Nobody designs this on day one. You add it after the first time a caller hears your agent say "less-than functions dot hangup."

Where it's going

Voice-native models - ones that ingest and emit audio directly, collapsing the speech-to-text and text-to-speech hops out of the loop are the obvious next step, and they'll reclaim a big chunk of the budget I've been fighting for. But they don't repeal the physics. The phone network still adds its 300ms. Humans still expect an answer in half a second. The floor moves; it doesn't disappear.

So I'll keep saying it to every team about to put their brilliant agent on a phone line: make it dumber and faster before you make it smarter and slower. On voice, the caller forgives a simple answer. They never forgive the silence.