Understand and Improve Agent Latency
Details the potential sources of latency in your voice AI solution, and which you should address as a priority
Last Updated:
One of the most frequent questions we receive is, “How can I improve the latency of my voice agent?” Unfortunately, there is no simple or short answer.
First, not everyone means the same thing when asking this question. Some people measure the time until the agent’s first response, including the time required for the agent to initialize and join the room. Others measure latency only at specific points in a pre-scripted conversation, missing the nuances of real user interactions.
Second, there are many variables to consider — from network latency, to model choice, to geographical location — all of which affect overall latency.
Third, latency is often (but not always) a trade-off. Any improvement in latency typically comes at the expense of another feature such as reasoning or accuracy.
The purpose of this article is not to discuss every potential source of latency in detail, but to outline the different sources and point you to resources for further information.
Latency improvement playbook
TL;DR The most impactful things you can do to improve your latency are:
- Monitor Performance. Use Agent Observability to identify which stage of the pipeline (STT, LLM, TTS, tools, network) is the dominant contributor. More info.
- Agent-model co-location. Host your agent in the same region as your models. If you use SIP, ensure your trunk is also geographically close. More info.
- Evaluate faster models. Try a smaller or more recently released model in the stage that dominates. Even swapping one stage can meaningfully reduce total latency. More info.
- Practice tooling hygiene. Limit
max_tool_steps, consolidate external API calls, and use a "thinking" sound so users aren't waiting in silence. More info.
Table of contents
Architectural Considerations
How you architect your solution can have a large impact on agent latency.
Geographical location of agents relative to models
The most impactful step you can take is to host your agents in a region close to your AI model stack, including STT, LLM, and TTS.
For agent hosting, LiveKit Cloud supports multiple regions. If your required region is not supported, you have the flexibility to self-host agents wherever needed.
For model hosting, you can access models through LiveKit Inference or LiveKit Plugins.
LiveKit Inference is the recommended way to interface with models, and provides certain regionally hosted deployments (for example, the US, or Deepgram STT in Mumbai). Check the documentation for your chosen model to identify any region-specific routing options. If you have regional considerations that cannot be accommodated buy LiveKit Inference, then LiveKit Plugins allow integration with third-party providers, many of whom offer regional hosting or inference residency.
Level of impact: Very high
More information: The docs detail supported regions for agents hosted in LiveKit Cloud and explain how to deploy multiple agents across different regions within the same project. Instructions are available for self-hosting agents in your preferred cloud provider or on your own hardware. Model hosting is discussed further in the regional deployments checklist. For India-specific guidance, see this guide.
Geographical location of users relative to the LiveKit room
In most cases, the distance between clients and the LiveKit server contributes only a small portion of total agent latency. A longer client ↔ LiveKit path increases round-trip time (RTT), but LiveKit uses WebRTC for media transport, which is optimized for real-time audio and video and performs better than generic TCP connections.
If users are very far from the room region or on poor last-mile networks, you may see higher RTT on this leg. For large, global deployments, a multi-region architecture may be appropriate to place pipeline resources closer to users.
Level of impact: Low to Medium, depending on your solution.
More information: See our transport docs for details on the WebRTC layer. For large global solutions, refer to the multi-region agent deployment docs
Pipeline vs. Realtime
Realtime (“speech-to-speech”) models handle speech and reasoning in a single integrated path. This contrasts with the pipeline model, which processes each stage consecutively (VAD / Turn Detection → STT → LLM → TTS). Which type of model you choose will depend on your requirements, with realtime models capable of creating more realistic-sounding agents at the cost of less flexibility and less reliable tool calling.
Realtime models can also offer lower latency because they require fewer round trips and model calls. However, they are not guaranteed to be faster in every case - a well-tuned pipeline can be highly competitive.
Level of impact: None to Medium, depending on your pipeline.
More information: See our realtime model docs.
Half-cascade architecture
You can use a realtime model with a separate TTS configuration, known as a “half-cascade” architecture. This provides realtime speech comprehension while maintaining control over speech output. The trade-off is that some latency advantages of realtime are lost, as the model now produces text output that must be passed into a separate TTS service.
Level of impact: Low to Medium, depending on the model
More information: See our documentation on using a separate TTS with a realtime model
Virtual avatar
Adding a virtual avatar introduces extra latency on top of the voice pipeline to keep avatar visuals in sync with the room audio. The audio is handled by the avatar worker in a separate process, which then renders the video including lip-sync and facial expressions. Both audio and video tracks are then published to the LiveKit room, so the additional latency comes from both the audio transfer to the avatar and the render time for each video frame, including synchronization to the audio stream. To minimize latency, the avatar joins and publishes media to the LiveKit room directly as a secondary participant rather than adding additional hops back to the agent, so in practice the added latency is as small as possible. Which avatar provider (model) you use will also affect latency, so experiment with different providers as part of your development.
The trade-off, of course, is a far more engaging experience for the end user.
Level of impact: Low to Medium, depending on the avatar.
More information: We have a number of avatar models in our docs, though latency metrics and comparisons between them are not available.
Model Considerations (LLM, STT, TTS, Realtime)
Model choice
Model choice, whether for a pipeline architecture (STT, LLM, TTS) or a realtime architecture, directly affects overall latency.
Larger or more capable models usually take longer per step than smaller or distilled ones, but provide better capabilities. Finding the right balance between latency and capability requires experimentation. As new models are released, re-evaluating your model choices is one of the most effective ways to stay competitive on latency.
- For STT, your choice of provider and model affects transcription speed and final transcript delay.
- For the LLM, time to first token (TTFT) and token throughput vary significantly by model.
- For TTS, the chosen voice model and provider affect time to first byte (TTFB) and how quickly audio streams.
Choosing a faster or smaller model in any of the three stages can reduce that stage’s contribution to total latency, often at the cost of quality or capability. You can use Agent Observability to determine which stage dominates so you know where to optimize.
Another consideration is whether your chosen model supports streaming or not; streaming models are lower latency at the cost of accuracy.
Level of impact: High
More information: Always consult the model docs for a list of the latest supported models. We announce new model support in our developer community and our regular developer newsletter. Any recommendations on the best models to use in specific circumstances will quickly go out of date, so are not provided here. Many developers use our homepage agent (https://livekit.io) as a yardstick, and a separate guide is available that discusses how to match the homepage agent's latency
STT: Preemptive generation
You can cause the LLM to start generating a response as soon as a partial transcript of the user’s speech is available. In a best-case scenario, this reduces latency because by the time the user has finished speaking, the response is ready to go.
The caveat is that if the system needs to regenerate the reply following the final transcript (for example, if the preempted response was incorrect), latency will not be improved. Instead, this will waste LLM tokens and potentially add complexity to your agent if you have custom logic that runs on each turn.
You can experiment with this setting in your application, and use Agent Observability to measure the length of a user turn in a typical call to see whether preemptive generation is beneficial or detrimental.
Level of impact: Dependent on circumstances
More information: Our docs page provides more detail about preemptive generation
STT: Model-specific optimizations
Your choice of STT model and its geographic location relative to your agent have the greatest effect on transcription latency.
The options available for each model vary, but in general most models expose settings that trade accuracy or features for lower latency. For example, you can configure endpointing (end-of-utterance silence, max delay) so the final transcript is emitted sooner.
Level of impact: Low
More information: Please refer to the individual STT provider's documentation for each model.
LLM: Model-specific optimizations
Your choice of LLM and its physical proximity to your agent are the primary factors in LLM-stage latency.
Considering the number and variety of LLMs LiveKit supports, there are too many optional parameters to list here. Be mindful that options which constrain the LLM, such as max_completion_tokens or reasoning_level, will likely reduce latency at the cost of capability.
Level of impact: Low to Medium
More information: Please refer to the individual LLM provider's documentation for each model.
LLM: Function tool and MCP use
If your agent calls function tools, this happens before the reply is generated and can add large, variable latency. Since agents can invoke multiple function tools per turn and tools are executed sequentially, a single turn can greatly increase perceived latency.
Consider:
- Limiting the maximum number of tool executions with the
max_tool_stepssetting - Using Agent Observability to monitor the number and duration of function tool calls during an agent turn
- Playing a “Thinking” sound during tool execution, and notifying the user prior to making the call, so they are not kept waiting without feedback
Level of impact: Potentially high, depending on your implementation
More information: Function tools are detailed in our documentation. max_tool_steps is covered under tool configuration, and thinking sounds are documented in the external data section.
LLM: External API calls
Making external API calls is required for most, if not all, production agents, for example looking up a value in a database or setting a calendar appointment.
You will make external calls as part of a function tool, but I have mentioned them separately here because the time it takes for your call to complete has a direct and cumulative impact on tool execution time, and therefore on agent latency for that turn.
If possible, consolidate external calls to avoid multiple round trips.
Level of impact: Potentially high, depending on your implementation
More information: Our documentation on external services lists several providers of external services that work well with LiveKit Agents; these are already optimized so you can avoid “rolling your own.”
LLM: Prompt and context size
Larger prompts and longer conversation history mean more input tokens for the model to process before it can produce the first output token. The model must encode and attend over the entire context: system instructions, prior turns, the latest user message, and any tool results. As a result, time to first token (TTFT) typically grows as context length increases. For voice agents, keeping system instructions concise and trimming or summarizing older turns can reduce TTFT.
Since this only affects long-running conversations, you will need to consider whether the overhead of reducing chat context is worth the performance trade-off for your use case.
Level of impact: Likely low, but depends on your use case.
More information: The documentation for pipeline hooks discusses how and where you can update chat context via update_chat_ctx.
TTS: Model-specific optimizations
TTS model choice and geographic location matter most here. Configuration options for a given model generally have only a small impact on agent latency.
Level of impact: Low
More information: Please refer to the individual TTS provider's documentation for each model.
Voice Activity Detection (VAD) / Turn Detection Considerations
Prewarm the VAD
The framework runs each agent job in its own process. To accelerate agent start-up time, you should preload the model files associated with the VAD before it is assigned any jobs.
Prewarming the VAD is recommended for all production deployments.
Level of impact: Table stakes
More information: This is documented as part of the Agent server options and the Silero VAD plugin docs.
Configuration (endpointing)
The VAD and turn detector plugin expose several parameters that affect when the agent decides the user has finished speaking. These can directly impact both end-of-utterance delay and time to first response.
In general, tuning these parameters to make turn detection more aggressive will make the conversation feel less natural for the end user.
Level of impact: Depends on your use case.
More information: See the following pages for VAD configuration docs and turn detector parameters.
Realtime: Turn detection
Note that realtime models offer built-in turn detection with their own configurable parameters, depending on the chosen model.
It is possible, though not recommended, to use LiveKit’s turn detection model instead of the realtime model’s built-in turn detection, but doing so introduces additional dependencies and likely adds latency.
Level of impact: Depends on your use case.
More information: See the docs for realtime turn detection and the individual realtime plugins for additional information.
SIP / Telephony
There are two ways to integrate your agent with the telephone network: LiveKit Phone Numbers and integration with a third-party trunk provider such as Twilio, Telnyx, etc.
For context, consider the following architecture diagram:
Loading diagram…
When integrating with a trunk provider, the physical location of that trunk and how it affects RTT should be considered.
For example, if you buy a number associated with a UK trunk, your end-to-end call will pass through that trunk, the LiveKit SIP server, and your LiveKit Server. In general, you should not have to worry about this since our network topology routes packets through the most sensible and least-costly path. However, if you host your agent in the US, you add a transatlantic hop and will see increased latency.
Consult your SIP trunking provider for more information about regional routing.
At the time of writing, LiveKit Phone Numbers support US numbers only. Check the LiveKit Phone Numbers docs for the latest on regional availability.
Level of impact: Unless you intentionally configure your solution to introduce latency (as in the example above), the impact of SIP on overall latency is generally limited to RTT, plus a small allowance for jitter and audio transcoding.
Other Considerations
Network conditions and jitter
LiveKit’s agent framework is built on top of our globally distributed WebRTC infrastructure, designed to handle efficient media transmission at scale. We do all we can to mitigate network conditions and deliver a low-latency, performant media stream between your clients and agents.
In general, there is very little you can adjust on the network side to tweak latency, and the network will accommodate your solution. Network topology discussions can get detailed, especially among users who self-host LiveKit, but that is outside the scope of this article.
Level of impact: Best in class
More information: There have been a few blogs published over the years about our WebRTC infrastructure. This one, which discusses how we built our global mesh network, gives a good overview of the high level.
Noise cancellation
LiveKit Cloud offers enhanced noise cancellation, which uses advanced models to remove background noise. Noise cancellation runs with negligible impact on audio latency or quality.
Level of impact: Negligible
More information: See our enhanced noise cancellation docs.
Self-hosted Agent Latency caused by burstable instance types
On AWS, agents running on burstable instance types, such as t3 or t4g, can experience severe latency and timeout errors even when CPU usage appears low. This is not a bug, see the following article for more detail: Troubleshooting latency and timeout errors with turn detection on AWS.
Level of impact: None on LiveKit Cloud
Agent Startup and Session Initialization
While we do not consider agent start-up time as part of agent latency, it nevertheless contributes to user perception of overall agent performance, so it is worth mentioning here.
Agent cold starts
Projects on the Build plan are subject to “agent cold starts,” meaning that once all active agent sessions end, any subsequent agent join will experience a start-up time of several seconds.
This limitation is not present on paid plans.
Level of impact: High for initial start-up on our free plan. N/A for paid plans.
More information: See our docs for quotas and limits as well as managing agent deployments
Instant connect
On certain client platforms, it is possible to configure microphone capture and buffer user audio prior to the agent connecting.
Level of impact: Can potentially improve the initial user experience.
More information: More information in our audio docs.
CreateRoom() delays
Calls to CreateRoom() perform cross-region synchronization, which can cause delays.
Level of impact: Negligible by following guidance
More information: To avoid delays calling CreateRoom(), follow the guidance in this article.
Monitor Performance
Agent Observability provides transcripts, traces, logs, and audio recordings in a single place so you can see how each session behaved and where time was spent. Session traces break the run into spans for every stage of the voice pipeline (user turns, STT–LLM–TTS steps, tool calls, etc.), and in the details panel for each span you can see metrics / metadata emitted by the SDK (token counts, durations, speech identifiers, etc.). Given all this information, you can see how long each stage took and correlate latency with specific turns or tool calls.
Measuring latency
The LiveKit Agents SDK emits structured metrics that you can use to measure latency, as described in the Data hooks docs, and these are accessible to your agent through the metrics_collected event. Full information is available in the data hooks docs describing which metrics are available for each stage of your agent pipeline or realtime model, and can be used to trace the source(s) of latency throughout a conversation. It is recommended to store these logs externally and check for regression as you tweak your agent prompts or logic, or use as a comparison point as you introduce newer, more capable models.
In summary
Voice agents can support many use cases and application types. By understanding the full range of latency sources covered here, how much each contributes to your overall latency, and what you can do to address and monitor them, you can address the biggest contributors first, then tune or optimize the rest as needed.
There is no single silver bullet, but tackling the biggest sources of latency in your solution without sacrificing user experience is often the most sensible approach.