The sequential pipeline is the foundational architecture behind every voice agent. Here's how it works, why streaming changes everything, and how to build one yourself.
There's a gap between when you stop speaking and when a voice agent responds. Most people notice it without being able to name it.
Under 1 second, the conversation feels natural. Under 500 milliseconds, it feels like talking to a person. Over 2 seconds, something feels broken, even if the words are perfect.
That gap, and everything that happens inside it, is the engineering story of voice AI. And at the center of that story is the sequential pipeline, the architecture pattern that powers virtually every production voice agent today.
What Is the Sequential Pipeline?
The sequential pipeline chains specialized processing stages in a fixed order. Each stage has one job. The output of one stage becomes the input of the next. Think of it as an assembly line where raw audio enters on one end, and a spoken response comes out the other.
For voice agents, the canonical pipeline looks like this:
Audio In → VAD → STT → LLM → TTS → Audio Out
Each stage is independently testable, swappable, and optimizable. That modularity is what makes the pattern so practical for production systems.
The Five Stages, Explained
1. Voice Activity Detection (VAD)
Its job is to figure out when the user is actually speaking versus background noise, silence, or a TV in the next room.
VAD runs continuously on the incoming audio stream. It gates everything downstream. If VAD doesn't fire, the rest of the pipeline stays idle. Good VAD adds roughly 10 to 50ms of latency. Bad VAD either clips the beginning of sentences (too aggressive) or sends noise to the transcription model (too passive).
VAD also handles turn detection, deciding when the user has finished their thought. This is harder than it sounds. A pause might mean the user is thinking, or it might mean they're done. Some systems use semantic models on top of VAD to make smarter turn-taking decisions.
2. Speech-to-Text (STT)
Its job is to convert the user's spoken audio into text.
STT takes the audio segments identified by VAD and produces a transcript. Modern STT models like Deepgram Nova and OpenAI Whisper support streaming transcription, emitting partial results (words or phrases) before the user finishes speaking. This is critical for keeping latency low.
Typical STT latency is around 200ms for a complete utterance, but partial transcripts can start arriving in under 100ms with streaming.
The accuracy of your STT model has outsized downstream impact. If the transcription is wrong, the LLM reasons about the wrong input, and the response will be wrong no matter how good everything else is.
3. Large Language Model (LLM)
Its job is to understand what the user said, decide what to do about it, and generate a text response.
This is the brain of the pipeline. The LLM receives the transcript (plus conversation history, system instructions, and available tools) and produces a response. It might answer directly, call an external API, look up a database, or decide to transfer the conversation to another agent.
LLM inference is typically the slowest stage, often 300 to 800ms for the first token, depending on the model and prompt complexity. Streaming token output is essential here. The TTS stage doesn't need the full response before it starts working.
Conversation history is managed through a ChatContext object (chat_ctx) that accumulates turns across the session. Every user message and agent response is appended so the LLM always has full context. When handing off to another agent, you explicitly pass this object to the new agent to preserve the conversation. Without it, each agent starts with a fresh context and the user would have to re-explain what they already said.
4. Text-to-Speech (TTS)
Its job is to convert the LLM's text response into natural-sounding audio.
TTS takes the generated text and synthesizes speech. Modern TTS models produce remarkably natural output, with control over voice, speed, and emotional tone. Like STT, production TTS operates in streaming mode, generating audio chunks as text tokens arrive rather than waiting for the complete response.
Typical TTS latency is around 100 to 200ms for the first audio chunk with streaming.
5. Audio Transport
Its job is to get the synthesized audio back to the user with minimal delay.
The final stage delivers audio over the network. For web and mobile apps, WebRTC provides the lowest latency transport. For phone calls, SIP trunking connects to the telephone network. Either way, the transport layer needs to handle jitter, packet loss, and network variability without introducing perceptible delays.
Naive vs. Streaming: Why It Matters
The sequential pipeline has an obvious problem. If each stage waits for the previous one to fully complete before starting, latency adds up fast.
| Approach | How It Works | Typical Total Latency |
|---|---|---|
| Naive (blocking) | Each stage runs to completion before the next one starts | 1000 to 2000ms+ |
| Streaming | Stages overlap. STT emits partial text, LLM streams tokens, TTS synthesizes chunks in parallel | 400 to 800ms |
Streaming transforms the total latency from roughly VAD + STT + LLM + TTS to something much closer to max(VAD, STT, LLM, TTS). That's the difference between "feels broken" and "feels like a real conversation."
Here's what streaming looks like in practice:
- STT starts emitting partial transcripts before the user finishes speaking
- The LLM begins generating tokens as soon as it has enough context from the transcript
- TTS starts synthesizing audio from the first sentence while the LLM is still generating the rest
- Audio playback begins while later TTS chunks are still being rendered
Every stage boundary becomes a streaming interface rather than a blocking handoff. This is what separates demo-quality voice agents from production-quality ones.
Barge-in and Interruption Handling
One of the trickiest problems in voice pipeline design is what happens when the user interrupts mid-response, a behavior known as barge-in. In a naive pipeline, the agent just keeps talking. In a production pipeline, the system needs to detect the interruption, stop TTS playback immediately, flush any queued audio, and restart the pipeline from STT.
LiveKit's framework handles this automatically. When the VAD detects speech while the agent is talking, it fires an interruption event that cancels the active TTS playback and triggers a new STT pass. From the user's perspective, the agent stops and listens, which is exactly what a natural conversation feels like.
The tricky edge cases are:
- Filler sounds like "mm-hmm" or "yeah" mid-response shouldn't always trigger a full interruption
- Accidental barge-in from background noise or the agent's own audio leaking back into the microphone (echo) can falsely trigger interruption
- Mid-tool-call interruptions occur when the LLM is mid-way through a tool call; you need to decide whether to cancel the operation or complete it silently
For tool calls that perform irreversible actions (like a database write), call run_ctx.disallow_interruptions() at the start of the tool to prevent user speech from cancelling the operation. To wait for the agent to finish speaking before the tool continues, use await context.wait_for_playout().
Cascaded Pipeline vs. Speech-to-Speech: Which Architecture to Choose
The sequential pipeline (also called the "cascaded" architecture) isn't the only option. Speech-to-speech (S2S) models like GPT-4o Realtime and Gemini 2.5 Flash handle audio in and audio out natively, without the text intermediary.
| Consideration | Cascaded Pipeline (STT → LLM → TTS) | Speech-to-Speech (S2S) |
|---|---|---|
| Latency | 300 to 600ms with streaming | 200 to 300ms (fewer stages) |
| Control | Full visibility at every stage. Swap any component independently | Single model, less granular control |
| Debugging | Read the transcript, inspect LLM reasoning, audit each stage | Audio in, audio out. Harder to trace failures |
| Tool calling | Mature and reliable through the LLM's text interface | Improving but still less robust for complex tool use |
| Emotional context | Lost in the STT step (tone, prosody, emphasis) | Preserved. The model "hears" emotion directly |
| Compliance | Full text audit trail at every stage | Requires additional tooling for auditability |
| Provider flexibility | Mix and match best-in-class STT, LLM, and TTS providers | Locked to one model provider |
For most production deployments in 2026, the cascaded pipeline remains the default. It gives you transparency, debuggability, and the flexibility to swap any component without touching the rest. S2S is gaining traction for latency-sensitive conversational flows, but cascaded still wins on control and compliance.
Many teams are exploring hybrid approaches that use S2S for simple, fast exchanges and fall back to the cascaded pipeline for complex reasoning or tool-heavy interactions.
When to Use the Sequential Pipeline
- You're building a voice agent for the first time. This is the entry point. Every other multi-agent pattern (Supervisor, Handoff, ReAct) builds on top of it.
- You need to swap components independently. Want to try a different TTS voice? Switch STT providers? Upgrade your LLM? Each stage is a clean interface.
- Debuggability matters. When something goes wrong, you can isolate the problem to a specific stage. Was the transcription bad? Did the LLM hallucinate? Was the TTS garbled?
- You're in a regulated industry. Finance, healthcare, and legal deployments need full audit trails. The cascaded pipeline gives you text transcripts at every stage.
- You need tool calling reliability. Text-based LLM tool calling is more mature and predictable than audio-native alternatives.
When to Consider Alternatives
- Latency below 300ms is critical and you can accept less control. S2S models get you there with fewer stages.
- Emotional tone preservation matters more than debuggability. S2S models hear and reproduce prosody that gets lost in the STT step.
- The workflow isn't linear. If your agent needs to route between specialists, fan out to multiple data sources, or coordinate parallel tasks, you'll layer additional patterns (like Supervisor or Handoff) on top of the pipeline.
Building a Sequential Pipeline Voice Agent with LiveKit
LiveKit's Agents Framework implements the sequential pipeline through AgentSession. The framework handles streaming at every stage boundary out of the box, so you get production-grade latency without building the plumbing yourself.
Here's what a minimal voice agent looks like:
1from livekit import agents, rtc2from livekit.agents import AgentServer, AgentSession, Agent, TurnHandlingOptions, room_io3from livekit.plugins import noise_cancellation, silero4from livekit.plugins.turn_detector.multilingual import MultilingualModel56class VoiceAgent(Agent):7def __init__(self):8super().__init__(9instructions="You are a helpful voice assistant. Be concise and conversational."10)1112server = AgentServer()1314@server.rtc_session(agent_name="my-agent")15async def entrypoint(ctx: agents.JobContext):16session = AgentSession(17vad=silero.VAD.load(), # Stage 1: Voice Activity Detection18stt="deepgram/nova-3:multi", # Stage 2: Speech-to-Text (LiveKit Inference)19llm="openai/gpt-4.1-mini", # Stage 3: Language Model (LiveKit Inference)20tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc", # Stage 4: Text-to-Speech (LiveKit Inference)21turn_handling=TurnHandlingOptions(22turn_detection=MultilingualModel(),23),24)2526await session.start(27agent=VoiceAgent(),28room=ctx.room,29room_options=room_io.RoomOptions(30audio_input=room_io.AudioInputOptions(31noise_cancellation=lambda params: noise_cancellation.BVCTelephony()32if params.participant.kind == rtc.ParticipantKind.PARTICIPANT_KIND_SIP33else noise_cancellation.BVC(),34),35),36)3738await session.generate_reply(39instructions="Greet the user and offer your assistance."40)4142if __name__ == "__main__":43agents.cli.run_app(server)
That's a working voice agent with the full sequential pipeline. Each component (VAD, STT, LLM, TTS) is a plugin you can swap with a single line change.
Swapping Providers
Want to try a different model? Change one string. All providers are available through LiveKit Inference — no separate API keys required. For all available model strings, see the LLM, STT, and TTS model pages.
1# Switch to a larger LLM — change one string2session = AgentSession(3vad=silero.VAD.load(),4stt="deepgram/nova-3:multi",5llm="openai/gpt-4.1", # Swapped from gpt-4.1-mini to full GPT-4.16tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",7turn_handling=TurnHandlingOptions(8turn_detection=MultilingualModel(),9),10)
This is the modularity principle in action. The pipeline stages communicate through clean interfaces, so each component is independently replaceable. No model lock-in.
Adding Tool Calling
The pipeline becomes more powerful when the LLM stage can call external tools. LiveKit uses the @function_tool decorator to expose tools to the LLM:
1from livekit.agents import Agent, function_tool, RunContext23class AppointmentAgent(Agent):4def __init__(self):5super().__init__(6instructions="""You help users book appointments.7Use check_availability to find open slots,8then book_appointment to confirm."""9)1011@function_tool()12async def check_availability(self, context: RunContext, date: str):13"""Check available appointment slots for a given date.14Args:15date: The date to check in YYYY-MM-DD format.16"""17slots = await calendar_api.get_slots(date)18return {"available_slots": slots}1920@function_tool()21async def book_appointment(self, context: RunContext, date: str, time: str):22"""Book an appointment at the specified date and time.23Args:24date: The date in YYYY-MM-DD format.25time: The time in HH:MM format.26"""27context.disallow_interruptions() # Prevent cancellation mid-write28result = await calendar_api.book(date, time)29return {"confirmation": result.id}
The sequential pipeline stays the same. Audio flows through VAD → STT → LLM → TTS. But now the LLM stage can pause, call a tool, observe the result, and continue generating its response. The framework handles this transparently.
Optimizing Latency
LiveKit's pipeline streams at every stage boundary by default. But there are additional optimizations you can make:
Use a lightweight model for fast responses. For simple queries, GPT-4o-mini or a fine-tuned smaller model responds faster than a full-size model.
1# Faster LLM for simpler tasks2session = AgentSession(3vad=silero.VAD.load(),4stt="deepgram/nova-3:multi",5llm="openai/gpt-4.1-mini", # Faster inference6tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",7turn_handling=TurnHandlingOptions(8turn_detection=MultilingualModel(),9),10)
Reduce pipeline hops with LiveKit Inference. Instead of making separate API calls to Deepgram, OpenAI, and Cartesia from your agent server, LiveKit Inference routes all three through a single optimized path built into LiveKit Cloud, cutting down on round-trip overhead between pipeline stages.
Tune VAD sensitivity. Aggressive endpointing (shorter silence threshold) starts the pipeline sooner but risks cutting off the user mid-thought. Conservative endpointing adds latency but catches complete utterances. Find the right balance for your use case.
The Pipeline as Foundation
The sequential pipeline isn't just one pattern among many. It's the foundation that every other voice agent architecture builds on.
- The Supervisor pattern coordinates multiple agents, but each agent still uses a sequential pipeline internally for voice I/O.
- The Handoff pattern transfers between agents, but the audio processing at each agent follows the same VAD → STT → LLM → TTS flow.
- The ReAct pattern describes how the LLM stage reasons and calls tools, but it operates within the pipeline.
- The Human-in-the-Loop pattern pauses the pipeline at decision points, but the voice interaction before and after the pause is still a sequential pipeline.
Understanding this pattern is prerequisite to understanding everything else in voice agent architecture.
Error Handling and Graceful Degradation
Each stage in the pipeline is an independent failure point. STT can return garbled text on a bad audio signal. The LLM can time out under load. TTS can fail mid-synthesis. In production, you need a strategy for each.
The most important principle is to fail gracefully and keep the conversation alive. A few patterns that work well:
- STT failures happen when transcription confidence is low or the result is empty. Prompt the user rather than passing bad input downstream with something like "Sorry, I didn't catch that, could you say that again?"
- LLM timeouts require explicit timeout thresholds and a fallback response ready. Don't let the user sit in silence for 5 seconds wondering if the call dropped.
- TTS errors can cause dead air. If speech synthesis fails, consider falling back to a simpler TTS provider or a pre-recorded fallback clip.
- Tool call failures can be handled by using LiveKit's
ToolErrorto return structured errors to the LLM, which can then reason about recovery and communicate the issue naturally to the user.
The modularity of the sequential pipeline is an advantage here. Each stage can be monitored with Agent Observability and have its own retry logic independently.
Key Takeaways
- The sequential pipeline (VAD → STT → LLM → TTS) is the standard architecture for production voice agents
- Streaming at every stage boundary is what makes the difference between a 2 second response and a 400ms response
- The cascaded pipeline gives you modularity, debuggability, and provider flexibility that S2S models can't match yet
- Every advanced multi-agent pattern (Supervisor, Handoff, ReAct, HITL) builds on top of this foundation
- LiveKit's Agents Framework implements the full streaming pipeline with swappable components and no model lock-in
Getting Started
If you'd prefer to explore without code first, Agent Builder lets you configure and preview a voice pipeline in your browser. The fastest way to see the full pipeline in action with code is LiveKit's Agent Playground, where you can test different STT, LLM, and TTS combinations and hear the latency differences yourself.
When you're ready to build, start with the Agents Quickstart to get a working voice agent running in minutes. From there, you can swap components, add tools, and layer on more advanced patterns as your use case demands. When ready to go live, deploy to LiveKit Cloud with one click.
Give it a try, and let us know what you're building.