Why Latency Is the Only Metric That Matters in Voice AI

Human conversation has a rhythm. Turns happen every 200–400ms. When a voice AI breaks that rhythm — even slightly — the interaction feels off. Users speak over the agent, repeat themselves, or just hang up.
Latency isn't a performance metric. It's a UX metric.
The three layers that add up
1. Speech-to-text
Transcription latency is the time from when a user stops speaking to when the model sees text. Streaming ASR (real-time partial transcripts) is the difference between waiting for a full sentence and responding mid-thought. We run streaming transcription by default.
2. LLM inference
The first token matters more than the last. A model that starts generating a response in 80ms but takes 2 seconds to finish feels faster than one that starts at 400ms and finishes in 1 second. Time-to-first-token is what drives perceived responsiveness.
3. Text-to-speech
TTS has the same property — first audio chunk latency matters more than total generation time. We buffer aggressively and start playing audio the moment the first sentence is ready.
Where most latency comes from
In practice, 60–70% of end-to-end latency is network round trips. Running inference close to users — or on-device for common phrases — is a larger lever than model optimization.
The rest is coordination overhead: buffering, chunk assembly, audio playback scheduling.
What we're targeting
Our internal target is under 600ms end-to-end on p50, under 900ms on p95. We're not there yet for all languages. Indian language models in particular are heavier, and we're working on dedicated endpoints to close that gap.
The goal isn't to be faster than a human. It's to be fast enough that users stop noticing the AI.


