Voice Agent Architecture: STT, LLM, and TTS Pipelines Explained

A voice agent is built on three core components. A speech-to-text (STT) model transcribes audio, a large language model (LLM) generates a response, and a text-to-speech (TTS) model speaks it back. The pipeline architecture you choose determines how fast, how natural, and how production-ready your agent is.

Building a voice agent is not just about picking models. It's about understanding how audio flows through a system in real time, where latency accumulates, and how each architectural decision shapes the experience for the person on the other end of the call.

This guide walks through the core architecture of a voice agent, covering the components involved, the different pipeline patterns teams use in production, how turn detection works, and what to think about when it's time to scale. Whether you're designing your first voice agent or rethinking an existing one, this is a practical reference for making informed architectural decisions.

What are the core components of a voice agent?

Every voice agent, regardless of complexity, is built on three foundational components. These form the STT → LLM → TTS pipeline.

Speech-to-text (STT)

STT converts audio input from the user into text that the LLM can process. The choice of STT model affects:

Transcription accuracy across accents, noise levels, and domain-specific vocabulary.
Streaming vs. batch transcription — streaming returns partial results as the user speaks, which cuts time-to-first-token on the LLM significantly.
Word error rate (WER) and how it degrades in real-world conditions.

Popular options include Deepgram, AssemblyAI, Whisper (OpenAI), and provider-specific models. The best choice depends on your latency budget and accuracy requirements.

Large language model (LLM)

The LLM is the reasoning layer. It receives the transcribed text, applies context from the conversation history and any system prompt, and generates a response.

For voice agents, the critical LLM considerations:

Time-to-first-token (TTFT) — how quickly the model starts generating output. This is what users actually feel as "response time."
Streaming output — the LLM should stream tokens so TTS can start speaking before the full response is generated.
Context window management — long conversations require thoughtful pruning of conversation history to stay within limits without losing context.
Tool calling — agents that take actions (look up records, place orders, transfer calls) rely on structured tool calling support from the LLM.

Text-to-speech (TTS)

TTS converts the LLM's text output into audio that gets played back to the user. The variables here are voice quality, latency, and how quickly the model can begin streaming audio after receiving the first few tokens.

The most important TTS consideration for voice agents is streaming synthesis — the ability to start generating audio from partial text rather than waiting for the full sentence. This keeps the pipeline moving and reduces the perceived wait time.

How does a voice agent pipeline actually work?

The three components above can be connected in different ways. The architecture you choose has a direct impact on latency and the naturalness of the conversation.

Sequential pipeline

In the simplest architecture, the voice agent waits for the user to finish speaking, transcribes the full utterance, sends it to the LLM, waits for the full LLM response, then passes it to TTS and plays it back.

Loading diagram…

This is easy to build and reason about, but latency stacks at every stage. In practice, a sequential pipeline often produces 2–4 seconds of response delay, which makes conversation feel unnatural.

Streaming pipeline

A streaming pipeline overlaps the stages. STT streams partial transcripts to the LLM, the LLM streams tokens to TTS, and TTS begins generating audio before the user has even finished the full thought.

Loading diagram…

This is the standard architecture for production voice agents today. With streaming across all three stages, it's possible to get end-to-end latency under 1 second.

Realtime speech-to-speech

Some models (like OpenAI Realtime API and Gemini Live) skip the STT → LLM → TTS chain entirely. They accept raw audio input and return raw audio output, handling understanding and generation inside a single multimodal model.

Loading diagram…

This approach can get latency under 500ms and produces more natural prosody since the model is operating directly on audio. The trade-off is less control over the individual components and, depending on the provider, higher cost per token.

What is turn detection and why does it matter?

Turn detection is the mechanism that decides when the user has finished speaking and the agent should respond. It's one of the most underestimated parts of voice agent architecture, and it has an outsized effect on how natural a conversation feels.

Voice activity detection (VAD)

VAD detects the presence or absence of speech in an audio stream. It's fast and runs locally, which makes it a good default. However, VAD alone doesn't understand the difference between a natural pause mid-sentence and an actual end-of-turn.

A common approach uses a lightweight VAD model that runs accurately and efficiently on a CPU.

STT endpointing

Some STT providers expose an endpointing signal — a confidence score indicating that the user has finished their thought. This is more semantically aware than raw VAD because it has the benefit of partial transcription context.

Model-based turn detection

A better approach uses an LLM or smaller classifier to determine whether the user is done speaking based on the meaning of the transcript so far. This handles cases like "I want to book a flight to... uh... New York" where VAD might prematurely trigger.

Interruption handling

Turn detection is also what enables the agent to stop speaking when the user interrupts. A well-architected voice agent detects the user speaking mid-response and immediately cancels the current TTS playback to listen instead. Without this, agents feel robotic and rigid.

What architecture is best for low-latency voice agents?

Latency in a voice agent accumulates across four main sources.

Audio transport — how long it takes audio to travel from the user's device to your pipeline.
STT processing — transcription time, especially for the first partial result.
LLM time-to-first-token — how quickly the model starts responding.
TTS time-to-first-audio — how quickly synthesis starts after receiving text.

A practical breakdown for a streaming pipeline:

Stage	Target latency	Notes
Audio transport (WebRTC)	< 50ms	Use a global, low-latency media network
STT (first partial result)	100–200ms	Streaming STT
LLM time-to-first-token	200–400ms	Depends on model size and infrastructure
TTS time-to-first-audio	100–300ms	Streaming synthesis required
Total (perceived)	< 1 second	Goal for natural conversation feel

The biggest lever you can pull is co-locating your inference infrastructure with your media infrastructure. Round-trips between data centers add up fast. LiveKit's inference gateway routes to the closest available model endpoint, which keeps latency predictable across regions.

Scaling and deployment patterns

Session state and scaling

Each agent session maintains state, including conversation history, active tool calls, audio buffers, and connection context. LiveKit agents are stateful by design. Each session is bound to a single worker process for its full duration to preserve timing and streaming integrity.

Stateful, per-session workers — each session runs within a dedicated worker process, isolating state and preventing interference between sessions.
Horizontal scaling via worker pools — scaling is achieved by running more worker processes and distributing sessions across them using load-based scheduling (factoring in CPU, memory, and network conditions).
On LiveKit Cloud, this is handled automatically. For self-hosted deployments, configuring autoscaling against CPU utilization with a scale-up threshold below your load_threshold keeps new sessions from being rejected under load.

Hosted vs. self-hosted

For teams that want to skip infrastructure management, Hosted Voice Agents handles agent process orchestration, scaling, and session routing. For teams with strict data residency or custom infrastructure requirements, the open-source LiveKit Agents SDK supports full self-hosting.

Concurrency and resource planning

Voice agent sessions are resource-intensive compared to typical API calls. Each active session holds:

An open WebRTC connection (audio streams).
An active STT stream.
An LLM context window in memory or cache.
An active TTS stream.

Planning for concurrency means accounting for all four. A session that's "just waiting" still holds a connection and a context window.

Observability and monitoring

Voice agents are harder to debug than text-based systems because the key signals are distributed across audio, transcripts, model calls, and timing data. A session that "sounds wrong" requires correlating events across all of these layers simultaneously.

Effective observability for voice agents should capture:

Session-level audio recordings (with appropriate consent handling).
Turn-by-turn transcripts linked to timestamps.
LLM input/output traces with latency breakdowns per stage.
Tool call logs with arguments and return values.
Error events tied to specific moments in the session.

With all of this in one place, it's possible to replay a session and understand exactly where something went wrong. LiveKit's observability tooling is designed around this workflow. See the Observability documentation for setup details.

Frequently asked questions

What is voice agent architecture?

Voice agent architecture refers to the design of the system that processes audio input, generates intelligent responses, and returns audio output in real time. The core pipeline connects speech-to-text (STT), a large language model (LLM), and text-to-speech (TTS) components.

What is the STT, LLM, TTS pipeline?

The STT, LLM, TTS pipeline is the most common pattern for building voice agents. Audio from the user is transcribed by an STT model, passed to an LLM for reasoning and response generation, then converted back to speech by a TTS model. In a streaming pipeline, these stages overlap to reduce total response latency.

What is realtime speech-to-speech AI?

Realtime speech-to-speech refers to architectures where a single multimodal model accepts raw audio input and returns raw audio output, bypassing the separate STT, LLM, and TTS components. This can reduce latency and produce more natural speech, at the cost of some architectural flexibility.

What's the difference between sequential and streaming voice pipelines?

A sequential pipeline waits for each stage to fully complete before passing output to the next. A streaming pipeline runs stages in parallel, passing partial output downstream as it becomes available. Streaming pipelines are significantly faster and are the standard for production voice agents.

How does turn detection work in a voice agent?

Turn detection determines when the user has finished speaking so the agent knows when to respond. It can be implemented using voice activity detection (VAD), STT endpointing signals, or model-based classifiers. Good turn detection also handles interruptions, where the user speaks while the agent is still responding.

How do you scale a voice agent to handle many concurrent sessions?

Scaling voice agents requires managing WebRTC connections, STT streams, LLM context, and TTS streams concurrently. LiveKit agents are stateful by design — each session stays on the same worker process for its full duration. Scaling is achieved by running a pool of worker processes and distributing sessions across them based on load. On LiveKit Cloud, this scheduling is automatic. For self-hosted deployments, horizontal scaling is handled by adding worker processes with autoscaling configured against CPU utilization.

What latency should I target for a voice agent?

For a conversation that feels natural, target under 1 second of end-to-end latency from when the user stops speaking to when the agent starts responding. This requires streaming STT, a fast LLM (low time-to-first-token), and streaming TTS synthesis.