Turn Detection for Voice Agents: VAD, Endpointing, and Model-Based Detection

Turn detection is how a voice agent decides when a user has finished speaking so it can begin processing and responding. It is the trigger for your entire response pipeline. Get it right and conversations feel natural. Get it wrong and users notice immediately.

Turn detection is one of those things nobody talks about until it breaks. A voice agent that cuts you off mid-sentence feels rude. One that waits forever after you stop speaking feels broken. And yet the engineering behind that tiny window — the moment between when you finish a sentence and when an agent starts responding — determines whether a conversation feels natural or painful.

This guide covers what turn detection is, how different detection strategies work, and how the choice you make directly affects latency and conversation quality. Whether you're building a customer support bot, a voice assistant, or an outbound calling agent, understanding turn detection is foundational. There is no good voice agent without it.

LiveKit supports every major approach to turn detection, from simple VAD-only setups to model-based prediction and realtime model turn-taking. You choose the tradeoff that fits your use case.

What is turn detection?

In a conversation between two people, turn-taking is intuitive. Silence, intonation, context, and body language all signal when someone is done speaking. Wire up a raw STT → LLM → TTS pipeline and none of that comes with it. You have to build it.

Turn detection is the process of identifying the boundary between a user's spoken utterance and the silence that follows. It sounds straightforward. In practice, it's one of the trickier parts of voice agent design. Short pauses, filler words, mid-sentence hesitations, and trailing breath sounds can all look the same to a basic detector.

Getting this right matters because of how voice agent pipelines work. Speech comes in, transcription happens, an LLM generates a response, and a TTS system speaks it back. That pipeline can only start once the agent knows you're done talking. Delay that trigger, and latency goes up. Fire it too early and you cut the user off.

How does turn detection work?

The simplest approach is silence detection. The agent monitors incoming audio, waits for a pause above a set duration threshold, and treats that pause as the signal to respond. It's fast to implement and works well for simple, command-style interactions. But it struggles when users pause mid-thought, speak in noisy environments, or have conversational speech patterns that include natural pauses.

More advanced approaches use Voice Activity Detection (VAD), which classifies each incoming audio frame as speech or non-speech in real time, rather than simply waiting for silence above a threshold. This makes VAD more robust against background noise and short pauses, but it still operates purely at the audio level. It has no understanding of context or whether a sentence is semantically complete.

A better method uses the transcription stream itself as a signal, or uses a dedicated model to predict when a user is done speaking based on the meaning of what they said, not just the sound of silence.

What's the difference between VAD and endpointing?

These two terms get used interchangeably. They're not the same thing.

Voice Activity Detection (VAD) operates at the audio level. It classifies incoming audio frames as speech or silence in real time. VAD runs continuously, even before any transcription begins, and is typically the first layer in a turn detection system.

Endpointing operates at the transcription level. Your STT model returns a transcript, and the endpointing logic watches that stream for signals that the utterance is complete. Some STT providers expose an explicit end-of-utterance event. Others leave it to you to detect based on trailing silence or punctuation patterns in the transcript.

Model-based turn detection goes further. A classification model reads the partial transcript in real time and predicts whether the user is done speaking based on semantic completeness, not just silence duration. This approach can trigger before the trailing silence even begins.

The practical difference comes down to latency and accuracy. VAD triggers on silence, which means you're always waiting for a pause. Endpointing can be faster because it doesn't require silence, just a strong enough signal from the transcript. Model-based detection can be most accurate of all, but requires more compute and careful tuning.

How does turn detection affect latency?

End-to-end latency in a voice agent is the time between when a user stops speaking and when the agent starts responding. Turn detection is the first domino.

A silence timeout set to 800ms adds nearly a full second to every single response before the pipeline even starts. Over a 10-minute conversation, that adds up fast. Users don't frame it as "turn detection latency." They just say the agent feels slow.

The goal is to start the STT-to-LLM-to-TTS pipeline as early as possible without cutting the user off. LiveKit supports multiple turn detection strategies so you can tune this for your specific use case, whether that's a customer service agent that needs to be patient with thoughtful answers, or a quick-command assistant where speed is the priority.

See the Turns overview and Realtime models for specifics on how to configure each approach.

Interruptions and barge-in behavior

Natural conversation isn't just about knowing when to speak. It's knowing when to stop.

Barge-in is the ability for a user to interrupt the agent while it's speaking. Most users expect this to work. If an agent is reading out a long confirmation message and the user says "stop," they don't want to wait for the agent to finish its turn.

Handling barge-in correctly means keeping your turn detection layer active even while the agent is playing back audio. When user speech is detected during output, the agent should cancel the current TTS stream and hand control back to the STT pipeline immediately.

This is harder than it sounds. Echo cancellation has to be handled on the client side, where the speaker is. Devices with built-in echo cancellation can filter out agent audio before it reaches the pipeline. Devices without it need a fallback like push-to-talk. LiveKit handles barge-in detection as part of the real-time audio pipeline and works with client-side echo cancellation to make this as seamless as possible.

Implementation approaches

There are four main strategies for turn detection in production voice agents. Each has tradeoffs.

VAD-only

Use a VAD model to classify audio frames as speech or non-speech in real time. Once speech activity drops and a silence threshold is exceeded, the turn is considered complete. Works well for clean audio environments and simple command interactions. Adds latency proportional to the silence threshold you configure.

A good fit for phone bots with predictable speech patterns and low-resource environments.

STT endpointing

Use the end-of-utterance signal from your STT provider. Many providers expose this as an explicit event when they determine the user has stopped speaking. Faster than waiting for full silence, but dependent on the quality and tuning of the provider's own model.

A solid default for most production agents using a managed STT service.

Model-based detection

A separate classification model reads the partial transcript in real time and predicts turn completion based on semantic meaning. Can trigger before trailing silence occurs, which is the main latency advantage. Requires tuning to avoid two failure modes: false positives on genuinely incomplete sentences, and overreacting to mid-sentence pauses and natural gaps in speech.

Works well for latency-sensitive applications where shaving off response time is a priority.

Realtime model turn-taking

Some realtime multimodal models handle turn-taking natively. The model itself manages when to respond without a separate detection layer. This reduces the number of moving parts but also reduces your direct control over the behavior.

A natural fit for applications built entirely on realtime speech-to-speech models.

Frequently asked questions

What is turn detection in a voice agent?

Turn detection is how a voice agent identifies when a user has finished speaking so it can begin processing and responding. It is the trigger point for the entire response pipeline.

What is the best turn detection method for production voice agents?

STT endpointing is the best default for most production agents. Model-based detection offers lower latency for speed-sensitive applications. VAD-only is simple to set up but tends to add more latency.

What is Silero VAD?

Silero VAD is an open-source voice activity detection model that classifies audio frames as speech or silence in real time. It is one of the most widely used VAD models and is supported natively in the LiveKit Agents SDK.

What is barge-in?

Barge-in is the ability for a user to interrupt the agent while it is speaking. A well-built voice agent keeps turn detection active during audio playback and cancels the current response when the user speaks.

How does turn detection affect conversation quality?

Poor turn detection causes two failure modes: agents that interrupt users by triggering too early, and agents that feel slow by waiting too long. Getting this right is one of the highest-leverage improvements you can make to overall conversation quality.

What is the difference between end-of-turn detection and VAD?

VAD detects whether audio contains speech or silence. End-of-turn detection uses that signal, plus potentially transcription data and semantic context, to decide whether the user's full utterance is complete. VAD is an input to turn detection, not the same thing as turn detection.

How does turn detection connect to latency in the STT-to-LLM-to-TTS pipeline?

Turn detection fires the starting gun on the whole pipeline. Every millisecond of unnecessary wait before that trigger adds directly to the end-to-end response time the user experiences. Optimizing turn detection is often faster to improve than optimizing model latency.

Ready to build your first voice agent? Run the Voice AI Quickstart and see turn detection in action with a working agent in minutes. Visit the Voice Agents overview to understand the full picture before you start building.