Sequential Pipeline Architecture for Voice Agents

The sequential pipeline is the foundational architecture behind every voice agent. Here's how it works, why streaming changes everything, and how to build one yourself.

There's a gap between when you stop speaking and when a voice agent responds. Most people notice it without being able to name it.

Under 1 second, the conversation feels natural. Under 500 milliseconds, it feels like talking to a person. Over 2 seconds, something feels broken, even if the words are perfect.

That gap, and everything that happens inside it, is the engineering story of voice AI. And at the center of that story is the sequential pipeline, the architecture pattern that powers virtually every production voice agent today.

What Is the Sequential Pipeline?#

The sequential pipeline chains specialized processing stages in a fixed order. Each stage has one job. The output of one stage becomes the input of the next. Think of it as an assembly line where raw audio enters on one end, and a spoken response comes out the other.

For voice agents, the canonical pipeline looks like this:

Audio In → VAD → STT → LLM → TTS → Audio Out

Each stage is independently testable, swappable, and optimizable. That modularity is what makes the pattern so practical for production systems.

The Five Stages, Explained#

1. Voice Activity Detection (VAD)#

Its job is to figure out when the user is actually speaking versus background noise, silence, or a TV in the next room.

VAD runs continuously on the incoming audio stream. It gates everything downstream. If VAD doesn't fire, the rest of the pipeline stays idle. Good VAD adds roughly 10 to 50ms of latency. Bad VAD either clips the beginning of sentences (too aggressive) or sends noise to the transcription model (too passive).

VAD also handles turn detection, deciding when the user has finished their thought. This is harder than it sounds. A pause might mean the user is thinking, or it might mean they're done. Some systems use semantic models on top of VAD to make smarter turn-taking decisions.

2. Speech-to-Text (STT)#

Its job is to convert the user's spoken audio into text.

STT takes the audio segments identified by VAD and produces a transcript. Modern STT models like Deepgram Nova and OpenAI Whisper support streaming transcription, emitting partial results (words or phrases) before the user finishes speaking. This is critical for keeping latency low.

Typical STT latency is around 200ms for a complete utterance, but partial transcripts can start arriving in under 100ms with streaming.

The accuracy of your STT model has outsized downstream impact. If the transcription is wrong, the LLM reasons about the wrong input, and the response will be wrong no matter how good everything else is.

3. Large Language Model (LLM)#

Its job is to understand what the user said, decide what to do about it, and generate a text response.

This is the brain of the pipeline. The LLM receives the transcript (plus conversation history, system instructions, and available tools) and produces a response. It might answer directly, call an external API, look up a database, or decide to transfer the conversation to another agent.

LLM inference is typically the slowest stage, often 300 to 800ms for the first token, depending on the model and prompt complexity. Streaming token output is essential here. The TTS stage doesn't need the full response before it starts working.

Conversation history is managed through a ChatContext object (chat_ctx) that accumulates turns across the session. Every user message and agent response is appended so the LLM always has full context. When handing off to another agent, you explicitly pass this object to the new agent to preserve the conversation. Without it, each agent starts with a fresh context and the user would have to re-explain what they already said.

4. Text-to-Speech (TTS)#

Its job is to convert the LLM's text response into natural-sounding audio.

TTS takes the generated text and synthesizes speech. Modern TTS models produce remarkably natural output, with control over voice, speed, and emotional tone. Like STT, production TTS operates in streaming mode, generating audio chunks as text tokens arrive rather than waiting for the complete response.

Typical TTS latency is around 100 to 200ms for the first audio chunk with streaming.

5. Audio Transport#

Its job is to get the synthesized audio back to the user with minimal delay.

The final stage delivers audio over the network. For web and mobile apps, WebRTC provides the lowest latency transport. For phone calls, SIP trunking connects to the telephone network. Either way, the transport layer needs to handle jitter, packet loss, and network variability without introducing perceptible delays.

Naive vs. Streaming: Why It Matters#

The sequential pipeline has an obvious problem. If each stage waits for the previous one to fully complete before starting, latency adds up fast.

Approach	How It Works	Typical Total Latency
Naive (blocking)	Each stage runs to completion before the next one starts	1000 to 2000ms+
Streaming	Stages overlap. STT emits partial text, LLM streams tokens, TTS synthesizes chunks in parallel	400 to 800ms

Streaming transforms the total latency from roughly VAD + STT + LLM + TTS to something much closer to max(VAD, STT, LLM, TTS). That's the difference between "feels broken" and "feels like a real conversation."

Here's what streaming looks like in practice:

STT starts emitting partial transcripts before the user finishes speaking
The LLM begins generating tokens as soon as it has enough context from the transcript
TTS starts synthesizing audio from the first sentence while the LLM is still generating the rest
Audio playback begins while later TTS chunks are still being rendered

Every stage boundary becomes a streaming interface rather than a blocking handoff. This is what separates demo-quality voice agents from production-quality ones.

Barge-in and Interruption Handling#

One of the trickiest problems in voice pipeline design is what happens when the user interrupts mid-response, a behavior known as barge-in. In a naive pipeline, the agent just keeps talking. In a production pipeline, the system needs to detect the interruption, stop TTS playback immediately, flush any queued audio, and restart the pipeline from STT.

LiveKit's framework handles this automatically. When the VAD detects speech while the agent is talking, it fires an interruption event that cancels the active TTS playback and triggers a new STT pass. From the user's perspective, the agent stops and listens, which is exactly what a natural conversation feels like.

The tricky edge cases are:

Filler sounds like "mm-hmm" or "yeah" mid-response shouldn't always trigger a full interruption
Accidental barge-in from background noise or the agent's own audio leaking back into the microphone (echo) can falsely trigger interruption
Mid-tool-call interruptions occur when the LLM is mid-way through a tool call; you need to decide whether to cancel the operation or complete it silently

For tool calls that perform irreversible actions (like a database write), call run_ctx.disallow_interruptions() at the start of the tool to prevent user speech from cancelling the operation. To wait for the agent to finish speaking before the tool continues, use await context.wait_for_playout().

Cascaded Pipeline vs. Speech-to-Speech: Which Architecture to Choose#

The sequential pipeline (also called the "cascaded" architecture) isn't the only option. Speech-to-speech (S2S) models like GPT-4o Realtime and Gemini 2.5 Flash handle audio in and audio out natively, without the text intermediary.

Consideration	Cascaded Pipeline (STT → LLM → TTS)	Speech-to-Speech (S2S)
Latency	300 to 600ms with streaming	200 to 300ms (fewer stages)
Control	Full visibility at every stage. Swap any component independently	Single model, less granular control
Debugging	Read the transcript, inspect LLM reasoning, audit each stage	Audio in, audio out. Harder to trace failures
Tool calling	Mature and reliable through the LLM's text interface	Improving but still less robust for complex tool use
Emotional context	Lost in the STT step (tone, prosody, emphasis)	Preserved. The model "hears" emotion directly
Compliance	Full text audit trail at every stage	Requires additional tooling for auditability
Provider flexibility	Mix and match best-in-class STT, LLM, and TTS providers	Locked to one model provider

For most production deployments in 2026, the cascaded pipeline remains the default. It gives you transparency, debuggability, and the flexibility to swap any component without touching the rest. S2S is gaining traction for latency-sensitive conversational flows, but cascaded still wins on control and compliance.

Many teams are exploring hybrid approaches that use S2S for simple, fast exchanges and fall back to the cascaded pipeline for complex reasoning or tool-heavy interactions.

When to Use the Sequential Pipeline#

You're building a voice agent for the first time. This is the entry point. Every other multi-agent pattern (Supervisor, Handoff, ReAct) builds on top of it.
You need to swap components independently. Want to try a different TTS voice? Switch STT providers? Upgrade your LLM? Each stage is a clean interface.
Debuggability matters. When something goes wrong, you can isolate the problem to a specific stage. Was the transcription bad? Did the LLM hallucinate? Was the TTS garbled?
You're in a regulated industry. Finance, healthcare, and legal deployments need full audit trails. The cascaded pipeline gives you text transcripts at every stage.
You need tool calling reliability. Text-based LLM tool calling is more mature and predictable than audio-native alternatives.

When to Consider Alternatives#

Latency below 300ms is critical and you can accept less control. S2S models get you there with fewer stages.
Emotional tone preservation matters more than debuggability. S2S models hear and reproduce prosody that gets lost in the STT step.
The workflow isn't linear. If your agent needs to route between specialists, fan out to multiple data sources, or coordinate parallel tasks, you'll layer additional patterns (like Supervisor or Handoff) on top of the pipeline.

Building a Sequential Pipeline Voice Agent with LiveKit#

LiveKit's Agents Framework implements the sequential pipeline through AgentSession. The framework handles streaming at every stage boundary out of the box, so you get production-grade latency without building the plumbing yourself.

Here's what a minimal voice agent looks like:

1from livekit import agents, rtc
2from livekit.agents import AgentServer, AgentSession, Agent, TurnHandlingOptions, room_io
3from livekit.plugins import noise_cancellation, silero
4from livekit.plugins.turn_detector.multilingual import MultilingualModel
5
6class VoiceAgent(Agent):
7    def __init__(self):
8        super().__init__(
9            instructions="You are a helpful voice assistant. Be concise and conversational."
10        )
11
12server = AgentServer()
13
14@server.rtc_session(agent_name="my-agent")
15async def entrypoint(ctx: agents.JobContext):
16    session = AgentSession(
17        vad=silero.VAD.load(),              # Stage 1: Voice Activity Detection
18        stt="deepgram/nova-3:multi",         # Stage 2: Speech-to-Text (LiveKit Inference)
19        llm="openai/gpt-4.1-mini",           # Stage 3: Language Model (LiveKit Inference)
20        tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",  # Stage 4: Text-to-Speech (LiveKit Inference)
21        turn_handling=TurnHandlingOptions(
22            turn_detection=MultilingualModel(),
23        ),
24    )
25
26    await session.start(
27        agent=VoiceAgent(),
28        room=ctx.room,
29        room_options=room_io.RoomOptions(
30            audio_input=room_io.AudioInputOptions(
31                noise_cancellation=lambda params: noise_cancellation.BVCTelephony()
32                if params.participant.kind == rtc.ParticipantKind.PARTICIPANT_KIND_SIP
33                else noise_cancellation.BVC(),
34            ),
35        ),
36    )
37
38    await session.generate_reply(
39        instructions="Greet the user and offer your assistance."
40    )
41
42if __name__ == "__main__":
43    agents.cli.run_app(server)

That's a working voice agent with the full sequential pipeline. Each component (VAD, STT, LLM, TTS) is a plugin you can swap with a single line change.

Swapping Providers#

Want to try a different model? Change one string. All providers are available through LiveKit Inference — no separate API keys required. For all available model strings, see the LLM, STT, and TTS model pages.

1# Switch to a larger LLM — change one string
2session = AgentSession(
3    vad=silero.VAD.load(),
4    stt="deepgram/nova-3:multi",
5    llm="openai/gpt-4.1",            # Swapped from gpt-4.1-mini to full GPT-4.1
6    tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
7    turn_handling=TurnHandlingOptions(
8        turn_detection=MultilingualModel(),
9    ),
10)

This is the modularity principle in action. The pipeline stages communicate through clean interfaces, so each component is independently replaceable. No model lock-in.

Adding Tool Calling#

The pipeline becomes more powerful when the LLM stage can call external tools. LiveKit uses the @function_tool decorator to expose tools to the LLM:

1from livekit.agents import Agent, function_tool, RunContext
2
3class AppointmentAgent(Agent):
4    def __init__(self):
5        super().__init__(
6            instructions="""You help users book appointments.
7            Use check_availability to find open slots,
8            then book_appointment to confirm."""
9        )
10
11    @function_tool()
12    async def check_availability(self, context: RunContext, date: str):
13        """Check available appointment slots for a given date.
14        Args:
15            date: The date to check in YYYY-MM-DD format.
16        """
17        slots = await calendar_api.get_slots(date)
18        return {"available_slots": slots}
19
20    @function_tool()
21    async def book_appointment(self, context: RunContext, date: str, time: str):
22        """Book an appointment at the specified date and time.
23        Args:
24            date: The date in YYYY-MM-DD format.
25            time: The time in HH:MM format.
26        """
27        context.disallow_interruptions()  # Prevent cancellation mid-write
28        result = await calendar_api.book(date, time)
29        return {"confirmation": result.id}

The sequential pipeline stays the same. Audio flows through VAD → STT → LLM → TTS. But now the LLM stage can pause, call a tool, observe the result, and continue generating its response. The framework handles this transparently.

Optimizing Latency#

LiveKit's pipeline streams at every stage boundary by default. But there are additional optimizations you can make:

Use a lightweight model for fast responses. For simple queries, GPT-4o-mini or a fine-tuned smaller model responds faster than a full-size model.

1# Faster LLM for simpler tasks
2session = AgentSession(
3    vad=silero.VAD.load(),
4    stt="deepgram/nova-3:multi",
5    llm="openai/gpt-4.1-mini",    # Faster inference
6    tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
7    turn_handling=TurnHandlingOptions(
8        turn_detection=MultilingualModel(),
9    ),
10)

Reduce pipeline hops with LiveKit Inference. Instead of making separate API calls to Deepgram, OpenAI, and Cartesia from your agent server, LiveKit Inference routes all three through a single optimized path built into LiveKit Cloud, cutting down on round-trip overhead between pipeline stages.

Tune VAD sensitivity. Aggressive endpointing (shorter silence threshold) starts the pipeline sooner but risks cutting off the user mid-thought. Conservative endpointing adds latency but catches complete utterances. Find the right balance for your use case.

The Pipeline as Foundation#

The sequential pipeline isn't just one pattern among many. It's the foundation that every other voice agent architecture builds on.

The Supervisor pattern coordinates multiple agents, but each agent still uses a sequential pipeline internally for voice I/O.
The Handoff pattern transfers between agents, but the audio processing at each agent follows the same VAD → STT → LLM → TTS flow.
The ReAct pattern describes how the LLM stage reasons and calls tools, but it operates within the pipeline.
The Human-in-the-Loop pattern pauses the pipeline at decision points, but the voice interaction before and after the pause is still a sequential pipeline.

Understanding this pattern is prerequisite to understanding everything else in voice agent architecture.

Error Handling and Graceful Degradation#

Each stage in the pipeline is an independent failure point. STT can return garbled text on a bad audio signal. The LLM can time out under load. TTS can fail mid-synthesis. In production, you need a strategy for each.

The most important principle is to fail gracefully and keep the conversation alive. A few patterns that work well:

STT failures happen when transcription confidence is low or the result is empty. Prompt the user rather than passing bad input downstream with something like "Sorry, I didn't catch that, could you say that again?"
LLM timeouts require explicit timeout thresholds and a fallback response ready. Don't let the user sit in silence for 5 seconds wondering if the call dropped.
TTS errors can cause dead air. If speech synthesis fails, consider falling back to a simpler TTS provider or a pre-recorded fallback clip.
Tool call failures can be handled by using LiveKit's ToolError to return structured errors to the LLM, which can then reason about recovery and communicate the issue naturally to the user.

The modularity of the sequential pipeline is an advantage here. Each stage can be monitored with Agent Observability and have its own retry logic independently.

Key Takeaways#

The sequential pipeline (VAD → STT → LLM → TTS) is the standard architecture for production voice agents
Streaming at every stage boundary is what makes the difference between a 2 second response and a 400ms response
The cascaded pipeline gives you modularity, debuggability, and provider flexibility that S2S models can't match yet
Every advanced multi-agent pattern (Supervisor, Handoff, ReAct, HITL) builds on top of this foundation
LiveKit's Agents Framework implements the full streaming pipeline with swappable components and no model lock-in

Getting Started#

If you'd prefer to explore without code first, Agent Builder lets you configure and preview a voice pipeline in your browser. The fastest way to see the full pipeline in action with code is LiveKit's Agent Console, where you can validate your agent behavior, compare model performance and latency, and debug interactions in detail.

When you're ready to build, start with the Agents Quickstart to get a working voice agent running in minutes. From there, you can swap components, add tools, and layer on more advanced patterns as your use case demands. When ready to go live, deploy to LiveKit Cloud with one click.

Give it a try, and let us know what you're building.

06.19.2026