A LiveKit voice agent gives you many related but distinct turn-taking controls: turn detection modes, endpointing delays, interruption modes, VAD thresholds, and your choice of models. They all shape how natural a conversation feels, and this guide helps you choose the right configuration for your use case.
This is a configuration guide, not a primer. It assumes you already know what turn detection, VAD, endpointing, and barge-in are. If any of those terms are new, this earlier post and the LiveKit turn detection docs are both good introductions. The focus here is how to configure these features, including new features such as LiveKit's audio-based turn detector, adaptive interruption handling, and STT-native endpointing models.
Table of contents#
Pipeline architecture#
These features are stages in a single pipeline that user audio flows through, and each is configurable. The pipeline exists to answer three questions:
- Is there speech right now? This is what VAD determines.
- Has the user finished speaking? If so, the agent can safely respond. Turn detection makes this call, and endpointing configures the time to wait before committing the turn.
- Is the user interrupting the agent? When speech arrives while the agent is talking, the agent has to decide whether to stop and yield the floor. This is interruption handling.
VAD, where present, runs first on every frame. Which of the next two questions the pipeline asks depends on whether the agent is already speaking, as illustrated below:
Loading diagram…
The remaining pieces (noise cancellation, preemptive generation) exist to make those two decisions better, faster, or cleaner. Refer to the Turns overview and Turn-taking tuning documentation for more information.
VAD configuration#
VAD is the first stage in the pipeline. The framework auto-provisions a VAD for you through the AgentSession, so you don't normally specify one. For advanced cases you can pass a vad argument yourself, but be aware that the new LiveKit turn detector imposes its own timing requirements when active (such as a minimum silence window), which can override the VAD settings you provide.
As you browse the docs and examples, you'll see VAD specified in different ways:
Bundled inference VAD (default, recommended)
AgentSession auto-provisions inference.VAD(model="silero") for you, a Silero model bundled with the Agents SDK. You can override it by passing a vad argument, which accepts these options, though most deployments should leave them unchanged.
Silero VAD plugin
The standalone Silero VAD plugin is the older, separately installed option, and is marked deprecated in the source code. You may see this listed in older examples and it still works, with fully documented configuration, but it should not be used when creating new agents.
ai-coustics VAD adapter
If you already run ai-coustics noise cancellation, its plugin ships a built-in VAD adapter. The ai-coustics VAD adapter is not required in order to use ai-coustics noise cancellation. Note that the LiveKit turn detector is not designed to run with this VAD, though it's fine for other configurations (for example, VAD-only or STT endpointing).
Alternative VADs
You can also bring your own: any object that implements the SDK's VAD interface can be passed as vad, so you can wrap a different VAD model if you have one. To turn VAD off entirely (for example, with a realtime model that does its own detection), pass vad=None in Python or vad: null in Node.js.
Turn detection configuration#
Turn detection decides when the user has finished their turn so the agent knows when to respond. This section covers the modes you can choose and how to configure each one; for the concepts behind it, see the Turns overview and the earlier post.
Choosing a mode#
The turn_detection option accepts either a turn-detector model object or one of several string modes. Decide this first, since it determines which other settings apply.
| Mode | How it decides when a turn ends | Use it when |
|---|---|---|
TurnDetector() (audio model) | Semantic and acoustic prediction on top of VAD | The default, and the best choice for most pipeline agents. |
"stt" | Your STT provider's native endpointing model | You're using an STT with strong built-in turn detection (Deepgram Flux, AssemblyAI). |
"realtime_llm" | The realtime model's own server-side detection | You're on a realtime LLM (OpenAI Realtime, Gemini Live) and want its built-in detection. |
"vad" | Pure speech-start and speech-stop cues | A language the turn detector doesn't cover, or simple command-style interactions that don't need semantic detection. |
"manual" | You commit turns explicitly in code | Push-to-talk, or any flow where you control turn boundaries directly. |
The rest of this section covers each mode in turn.
LiveKit turn detector#
This is the default turn detector, and is recommended for most developers.
The TurnDetector is an audio model that reasons over the user's speech directly, combining the words with acoustic cues like intonation and rhythm. Because it doesn't wait for a transcript, it decides faster and avoids the mid-turn-pause mistakes of transcript-only approaches.
For how the model works and how it was evaluated, see the turn detector docs and the deep-dive blog post. (An older text-based detector also exists but is deprecated and slated for removal in Agents SDK 2.0, so use the audio model for anything new.)
You configure it through two surfaces: the detector model itself, and the endpointing timing layer that controls how long to wait after detection before committing the turn. Most agents leave these at their defaults. For more detail, see the EndpointingOptions reference.
| Option | Description |
|---|---|
endpointing.min_delay | Minimum wait after detected silence before the turn closes. |
endpointing.max_delay | Maximum the agent waits before forcing the turn closed, so it never hangs forever. |
endpointing.mode (default "fixed") | "fixed" always uses the configured delays; "dynamic" adapts within the min_delay/max_delay range based on the user's pause patterns, so fast talkers get snappier responses and slower talkers get more patience. |
| Per-language thresholds | The detector supports 14 languages and picks a per-language confidence threshold from the STT-reported language. Override it per language with a custom unlikely_threshold. |
Model version (v1 / v1-mini) | Full model on LiveKit Inference versus a lighter model that runs locally, selected automatically by your environment to use v1 on LiveKit Cloud and v1-mini for self-hosted agents. |
A typical configuration looks like this:
Python
1from livekit.agents import AgentSession, TurnHandlingOptions, inference23session = AgentSession(4turn_handling=TurnHandlingOptions(5turn_detection=inference.TurnDetector(), # the default if omitted6endpointing={"mode": "dynamic", "min_delay": 0.3, "max_delay": 2.5}, # secs7),8# ... stt, tts, llm, etc.9)
Node.js
1import { inference, voice } from '@livekit/agents';23const session = new voice.AgentSession({4turnHandling: {5turnDetection: new inference.TurnDetector(), // the default if omitted6endpointing: { mode: 'dynamic', minDelay: 300, maxDelay: 2500 }, // ms7},8// ... stt, tts, llm, etc.9});
STT endpointing (turn_detection="stt")#
Setting turn_detection="stt" hands the end-of-turn decision to your STT provider's own endpointing model instead of running the LiveKit turn detector. Use it when your STT has strong built-in turn detection, such as Deepgram Flux or AssemblyAI Universal-3 Pro. Even in this mode you still need a VAD to handle interruptions; if you don't provide one, the default is auto-provisioned.
It's worth understanding how LiveKit's endpointing options interact with STT mode. endpointing.min_delay acts as a floor measured from when the user last stopped speaking, so it adds latency on top of the provider's own decision, but endpointing.max_delay has no effect. In practice, set min_delay to 0, leave endpointing in its default fixed mode (avoid dynamic, which can let the floor drift above 0 over a session), and let the provider's own settings control the timing, such as AssemblyAI's min_turn_silence or Deepgram Flux's eot_timeout_ms.
The configuration options for STT mode:
| Option | Description |
|---|---|
turn_detection="stt" | Hands end-of-turn detection to the STT provider instead of the LiveKit turn detector. |
endpointing.min_delay | Set to 0 in STT mode (see above). |
| STT provider parameters | The provider's own end-of-turn timing. With LiveKit Inference, pass them via extra_kwargs (Python) or modelOptions (Node.js); with the standalone STT plugin classes, pass them as named constructor arguments. See the individual model docs, e.g. Deepgram Flux or AssemblyAI for the available params and more recommendations specific to that provider. |
Python
1from livekit.agents import AgentSession, TurnHandlingOptions, inference23session = AgentSession(4turn_handling=TurnHandlingOptions(5turn_detection="stt",6endpointing={"min_delay": 0},7),8stt=inference.STT(9model="assemblyai/u3-rt-pro",10extra_kwargs={"min_turn_silence": 100, "max_turn_silence": 1000},11),12# ... vad (auto-provisioned), llm, tts, etc.13)
Node.js
1import { inference, voice } from '@livekit/agents';23const session = new voice.AgentSession({4turnHandling: {5turnDetection: 'stt',6endpointing: { minDelay: 0 },7},8stt: new inference.STT({9model: 'assemblyai/u3-rt-pro',10modelOptions: { min_turn_silence: 100, max_turn_silence: 1000 },11}),12// ... vad (auto-provisioned), llm, tts, etc.13});
Realtime LLM (turn_detection="realtime_llm")#
Realtime models consume and produce speech directly, and most of them do their own server-side turn detection as part of that. You can choose to either use that model's built-in detection, or hand turn detection to LiveKit's turn detector instead. LiveKit recommends using the model's built-in detection where possible.
Use the model's built-in detection
Turn detection settings should be applied to the model itself, not on the AgentSession. There's no standard set of realtime configuration parameters, and capabilities vary from provider to provider, so consult your provider's documentation for the specifics (for example, Gemini Live or OpenAI Realtime).
| Option | Description |
|---|---|
turn_detection="realtime_llm" | Hands end-of-turn detection to the realtime provider (default) |
| Realtime model turn detection parameters | Each provider has its own turn detection configuration. |
Note: when the model handles detection, most of LiveKit's interruption options are ignored, so they should not be specified.
The example below configures the OpenAI Realtime API's server VAD:
Python
1from livekit.agents import AgentSession, TurnHandlingOptions2from livekit.plugins.openai import realtime3from openai.types.beta.realtime.session import TurnDetection45session = AgentSession(6turn_handling=TurnHandlingOptions(7# the default for realtime models; shown for clarity8turn_detection="realtime_llm",9),10llm=realtime.RealtimeModel(11turn_detection=TurnDetection(12type="server_vad",13threshold=0.7,14prefix_padding_ms=300,15silence_duration_ms=400,16),17),18# ... tts, etc.19)
Node.js
1import { voice } from '@livekit/agents';2import * as openai from '@livekit/agents-plugin-openai';34const session = new voice.AgentSession({5turnHandling: {6// the default for realtime models; shown for clarity7turnDetection: 'realtime_llm',8},9llm: new openai.realtime.RealtimeModel({10turnDetection: {11type: 'server_vad',12threshold: 0.7,13prefix_padding_ms: 300,14silence_duration_ms: 400,15},16}),17// ... tts, etc.18});
Use the LiveKit turn detector
If your realtime model lets you disable its internal turn detection, you can use LiveKit's audio-based turn detector instead (described earlier in this guide). As noted in the turn detector docs, it works on audio directly, so it does not need a separate STT, unlike the older text-based detector. The mechanism for disabling the model's own detection is provider-specific; for the OpenAI Realtime API, for example, you set turn_detection=None.
Python
1from livekit.agents import AgentSession, TurnHandlingOptions, inference2from livekit.plugins import openai34session = AgentSession(5turn_handling=TurnHandlingOptions(6turn_detection=inference.TurnDetector(),7),8llm=openai.realtime.RealtimeModel(9voice="alloy",10turn_detection=None, # hand turn detection to the LiveKit audio model11),12)
Node.js
1import { inference, voice } from '@livekit/agents';2import * as openai from '@livekit/agents-plugin-openai';34const session = new voice.AgentSession({5turnHandling: {6turnDetection: new inference.TurnDetector(),7},8llm: new openai.realtime.RealtimeModel({9voice: 'alloy',10turnDetection: null, // hand turn detection to the LiveKit audio model11}),12});
VAD only (turn_detection="vad")#
Use this mode when your language isn't among the LiveKit turn detector's supported languages, or for simple, command-style interactions where short, predictable utterances do not need semantic end-of-turn detection. Outside those cases it's usually not the right choice: VAD-only detection relies on silence alone to decide the user has finished.
To understand more about which VAD is used, see the earlier section on VAD configuration.
Python
1from livekit.agents import AgentSession, TurnHandlingOptions23session = AgentSession(4turn_handling=TurnHandlingOptions(turn_detection="vad"),5# VAD is auto-provisioned; ... stt, tts, llm, etc.6)
Node.js
1import { voice } from '@livekit/agents';23const session = new voice.AgentSession({4turnHandling: { turnDetection: 'vad' },5// VAD is auto-provisioned; ... stt, tts, llm, etc.6});
VAD configuration options
Specifying endpointing.min_delay acts as a floor on top of the VAD's own silence window, so the effective wait after the user stops is max(VAD silence, min_delay). The other endpointing options have no effect in this mode.
The VAD-specific configuration for the bundled inference VAD is provided in the API reference.
Manual (turn_detection="manual")#
Use manual turn detection only when you need full control over when turns start and end, for example a push-to-talk workflow. You're responsible for committing and clearing turns and for interrupting the agent, so it requires extra code on your side. The manual turn control docs include a complete example to build from.
Interruption configuration#
Interruption handling decides what the agent does when the user starts speaking over it. Handling this well makes an agent feel natural, so it's worth configuring carefully.
Enabling interruption handling#
By default, the agent can be interrupted when the user speaks, but you can disable interruptions entirely if needed. When enabled, there are two modes: "adaptive" and "vad".
| Option | Description |
|---|---|
enabled (default True) | Whether the agent can be interrupted at all. Set to False to make it uninterruptible. |
mode: "adaptive" | Adaptive mode uses an audio model that distinguishes real barge-ins from backchannels using acoustic cues. Recommended, and the default where available. |
mode: "vad" | Any detected speech (VAD) interrupts the agent. Simple and always available, but will feel less natural compared with adaptive mode. |
Caveats:
- If you are using a realtime model running its own server-side detection, you should not disable interruption handling.
"adaptive"is the default only on LiveKit Cloud, and only when other conditions are met, such as a pipeline (non-realtime) LLM and an STT that supports aligned transcripts. Otherwise (for example, when self-hosting) you should use"vad".
Configuring interruption handling#
Interruptible agents have a few more options to fine-tune behavior. A false interruption is when VAD hears speech and stops the agent but no transcript materializes (a cough, or background noise); by default the agent waits out a silence window and then resumes where it left off.
For further detail on these configuration options, see the documentation:
| Option | Applies to | Description |
|---|---|---|
min_duration | VAD | Minimum speech duration to register as an interruption, filtering brief sounds. |
false_interruption_timeout | VAD | Silence to wait after an interruption before classifying it as false. |
resume_false_interruption (default True) | VAD | Whether to resume the agent's speech after a false interruption. |
backchannel_boundary | Adaptive only | A cooldown window at each turn edge so genuine corrections and late-arriving transcripts aren't discarded as backchannels. See turn boundary cooldown. |
min_words (requires STT) | VAD and Adaptive | Minimum word count before interrupting. Set above 0 to require actual content. |
discard_audio_if_uninterruptible (default True) | Uninterruptible agents | Drop buffered user audio while the agent is speaking and can't be interrupted. |
Caveat: Although adaptive mode is designed to avoid false interruptions in the first place, there's a nuance here. Some options that apply only to VAD are technically applicable to the adaptive model in the event of a misclassification.
A typical adaptive configuration looks like this:
Python
1from livekit.agents import AgentSession, TurnHandlingOptions, inference23session = AgentSession(4turn_handling=TurnHandlingOptions(5turn_detection=inference.TurnDetector(),6interruption={7"enabled": True, # the default; can be omitted8"mode": "adaptive",9},10),11# ... stt, llm, tts, vad12)
Node.js
1import { inference, voice } from '@livekit/agents';23const session = new voice.AgentSession({4turnHandling: {5turnDetection: new inference.TurnDetector(),6interruption: {7enabled: true, // the default; can be omitted8mode: 'adaptive',9},10},11// ... stt, llm, tts, vad12});
User turn limits#
Interruptions are user-initiated; user turn limits are the mirror image, letting the agent cut in on the user. User turn limits are disabled by default, but you can configure them to stop a user from monologuing and monopolizing the conversation.
| Option | Description |
|---|---|
user_turn_limit | Specify this key to enable user turn limits. Counts accumulate across consecutive user turns and reset only when the agent starts speaking. When a limit is crossed, the framework calls the agent's customizable on_user_turn_exceeded hook. |
max_words (default off) | Maximum accumulated word count before the agent cuts in. |
max_duration (default off) | Maximum accumulated speaking time before the agent cuts in. Python uses seconds, Node.js uses milliseconds. |
Python
1session = AgentSession(2turn_handling={"user_turn_limit": {"max_words": 100, "max_duration": 30.0}},3)
Node.js
1const session = new voice.AgentSession({2turnHandling: { userTurnLimit: { maxWords: 100, maxDuration: 30_000 } },3});
Other related configuration#
The following settings are related to turn detection and interruption handling but don't control them directly. For more on any of these, follow the associated docs link.
| Setting | Description | Docs |
|---|---|---|
| Noise cancellation | Cleaning the input audio before VAD and STT improves both turn detection and transcription quality; see also our noise cancellation blog. | More information |
| Preemptive generation | The LLM (and optionally TTS) starts generating before the turn is confirmed to shave latency, cancelling if the turn continues. On by default. | More information |
| Aligned transcripts | Per-word timestamps from the STT, required for adaptive interruption handling. All LiveKit Inference STT models provide them. | More information |
| STT language | The language the STT reports selects the LiveKit turn detector's per-language threshold; pin it to force a specific language. | More information |
| Diarization | Speaker labels for multi-party rooms, available on Deepgram and AssemblyAI and used with MultiSpeakerAdapter. | More information |
VAD activation_threshold | When using STT endpointing, align it with the STT's own VAD threshold so barge-in and turn-end behavior stay consistent. | More information |
Common issues and resolutions#
If your issue isn't covered by the Turn-taking tuning troubleshooting table in our docs, some additional issues and resolutions are given below.
With turn_detection set to stt, the agent pauses several seconds before responding, even after a simple 'hello'
If you are using turn_detection="stt", be sure to set endpointing.min_delay to 0. As detailed earlier in this article, the min_delay you specify here will interfere with the STT's own minimum endpointing delay. Alternatively, switch to the LiveKit turn detector to see if you have better results.
Can I use adaptive interruption handling on self-deployed (non-Cloud) agents?
No. Adaptive interruption handling runs on LiveKit Cloud's inference infrastructure: it's available to agents deployed to LiveKit Cloud, but it isn't available on self-hosted agents. The framework will automatically fall back to "vad" interruption handling, but be sure to set interruption.mode="vad" to avoid warnings in your logs.
Can I use the v1 turn detector model on self-deployed (non-Cloud) agents?
No. The v1 (full) model is only available for agents deployed to LiveKit Cloud (it's built into the hosting pricing model, so you won't see it billed as separate usage). Self-deployed, non-Cloud agents instead run v1-mini, a lightweight model that executes locally in a shared CPU process. The framework selects the version automatically by environment.
Does LiveKit's audio-based turn detector work on 8 kHz telephony audio?
Yes. The audio-based turn detector resamples input to 16 kHz internally, so narrowband PSTN audio is supported, though it's worth validating on your own call traffic. Noise cancellation runs before the detector, so, if using Krisp, be sure to use their telephony-tuned model for phone calls.
The agent repeats the same phrase twice within a single turn
On noisy lines this is usually false-interruption pause-and-resume, not preemptive generation. VAD detects speech mid-response (a cough, line noise) but no transcript materializes, so the framework pauses, waits out false_interruption_timeout, then resumes, replaying part of the utterance. Switch interruption.mode to "adaptive" so stray noise doesn't trigger the pause, raise interruption.min_duration / min_words to filter brief sounds, and add telephony-tuned noise cancellation.
On a realtime model, my turn detection and interruption settings seem to be ignored
When a realtime model runs its own server-side turn detection, most of LiveKit's InterruptionOptions are ignored; tune detection on the model instead. Alternatively, use LiveKit's turn detector, set the model's turn_detection=None, and don't set interruption.enabled=False while server-side detection is active.
A turn-taking problem is hard to reproduce by feel. How do I debug it?
Replay the session in agent observability and inspect the trace. It shows exactly when turns were committed and interruptions fired, which is usually faster than examining logs.