Skip to main content

Pipeline vs. Realtime - Which is the better Voice Agent Architecture?

When building an AI voice agent today, you face a fundamental choice early on:

  • You can use a realtime model, sometimes referred to as a speech-to-speech (S2S) or audio-to-audio model, where a single multimodal model takes in audio, reasons, and speaks back, all in one step.
  • Or you can build an STT–LLM–TTS pipeline (also called a cascade architecture), the current industry standard, chaining together three specialist models: one to transcribe speech to text, one to reason and generate a response, and one to speak it back.

Both approaches are in active production use, with well-supported tooling and passionate advocates behind them, but neither is universally the right answer. The choice depends on what you're building, who your users are, and where your agent will run.

The Two Architectures at a Glance

Realtime / Speech-to-Speech (S2S)

In a realtime model, a single multimodal model handles the entire conversational turn: it ingests raw audio, reasons over it, and streams audio back, all within one model call. Because audio goes in and comes out without being converted to text in between, these models can pick up on things that don't survive transcription, like tone, pacing, hesitation, and emotional coloring.

STT–LLM–TTS Pipeline

In a pipeline architecture, three specialist models run in sequence:

  1. STT (Speech-to-Text / ASR): Transcribes the user's audio to text.
  2. LLM (Large Language Model): Processes the transcript, reasons, accesses tools, and generates a text response.
  3. TTS (Text-to-Speech): Converts the response text into natural-sounding audio.

Latency

This is the trade-off that gets the most attention, and for good reason. Realtime models have a genuine structural edge here.

Realtime models have no serialization and deserialization of audio into text and back, and they don't need to hand off between separate, specialized models.

Pipelines have a harder time. Even if each individual component is fast, latency compounds across the stack: STT + LLM time-to-first-token + TTS time-to-first-audio + network overhead. Most unoptimized pipeline deployments will be slower than a realtime model, and users will notice the difference.

Modern pipeline implementations don't wait for each stage to finish before starting the next. Instead, they stream partial STT transcripts to the LLM as the user is still speaking and feed LLM tokens into the TTS engine as they arrive and this streaming overlap between stages is what makes competitive latency possible. Although realtime wins this one structurally, the pipeline can get close, it just takes more engineering to get there.

Tool calling

Most production voice agents need to do more than just talk. They look up accounts, check order status, book appointments, or trigger workflows. How each architecture handles function calling has a real impact on the user experience.

In a pipeline, tool calling happens at the LLM layer using standard text-based function calling, the same mature mechanism you'd use in a chat application. You get structured error handling, parallel tool calls, and complete control over what happens while a tool executes.

Realtime models also support tool calling, but the experience varies by provider and model version. Some models block and wait in silence for the tool result, while newer versions support non-blocking calls where the model can keep speaking.

Beyond the mechanics, there's a reliability difference: pipeline LLMs make tool calling decisions from clean text using a function calling interface refined over years, while realtime models make those decisions from audio input, a newer path where the model is simultaneously understanding speech, reasoning, and generating audio. In practice, pipelines tend to be more accurate at deciding when to call a tool and invoking it correctly.

Turn detection

Deciding when a user has finished speaking and the agent should respond is one of the trickiest problems in voice AI, and how you approach it varies significantly depending on your architecture.

Realtime models rely on their own built-in mechanisms for end-of-turn detection. These can work well but offer limited customization, and you're largely dependent on what the model provider exposes.

Pipelines give you more flexibility. You can choose your own turn detection model, tune its sensitivity, and combine it with Voice Activity Detection (VAD) to reliably determine when the user has finished speaking (LiveKit's turn detector is one such model, which also supports adaptive interruption handling). Accurate turn detection is key to making the agent interaction feel natural, and having full control over it is one of the pipeline's strongest practical advantages.

Voice quality and conversational feel

This is where realtime models have their most interesting advantage, and it's one that's harder to quantify.

When audio is transcribed to text, a lot of information disappears. The model sees the words, but not how they were said: no tone of voice, no emotional coloring, no sense of whether someone is hesitating or rushing. A realtime model hears all of that and can respond in kind whereas a pipeline model is working from a transcript.

Modern TTS engines are becoming increasingly capable, producing speech with natural prosody, breathing, laughter, and emotional inflection. Pipelines can sound great, but they're working with less information about how the user spoke, whereas realtime models have the capacity to deliver a more emotionally aware experience.

Control, modularity, and debugging

This is where pipelines have the clearest advantage, and where realtime models show their biggest practical limitations in production.

A pipeline is transparent by design. Text sits between every stage, which means you can log exactly what was transcribed, what the LLM produced, and what was synthesized. When something goes wrong (a misheard word, an off-target response) you can diagnose the exact issue with mature tooling.

Pipelines are also easy to reconfigure. You can change your STT provider without touching your LLM prompt, or you can swap your TTS voice without rebuilding anything else. LiveKit's Inference was built precisely for this, offering a unified API across providers so that swapping a voice, LLM, or STT engine is a single configuration change.

Note: Although outside the scope of this article, LiveKit Inference also supports some realtime models, making it possible to easily switch between architectures

Cost

Realtime APIs are typically priced per second of audio in and audio out, which means costs scale directly with conversation length. This can make spending hard to predict, especially as system prompts and conversation history grow.

Pipelines let you optimize at every layer independently, controlling the trade-off between cost and capability even within a single solution. For example, you can use a lighter, cheaper LLM for simple queries and route complex ones to a more capable model, or you can pick a cost-efficient STT provider for high-volume transcription and a premium TTS engine only where voice quality matters.

Telephony

This is a factor that often gets overlooked, and it catches teams off guard.

Traditional phone networks carry audio at 8kHz using codecs like G.711, but realtime models are trained on high-quality web audio, typically 16–48kHz over WebRTC. For phone-based deployments such as call centers, IVR replacement, or outbound dialing, pipelines with telephony-optimized STT are usually the more reliable choice.

Compliance and regulated industries

For healthcare, financial services, legal, and government use cases, solutions are constrained by compliance, security, and regulatory requirements.

Pipelines offer granular control, allowing you to pick components that run in specific geographic regions. You can intercept and redact PII at the text layer before it reaches your LLM or gets logged. Major pipeline component providers have HIPAA, GDPR, SOC 2, and ISO 27001 certifications available, and you can audit exactly what was said, transcribed, and generated at every step.

Realtime models are largely hosted by a handful of large providers in centralized US-based infrastructure. Audio goes in and audio comes back out, which makes it harder to enforce content filtering, PII redaction, or detailed audit logging.

The best of both worlds

You don't have to choose one architecture entirely. LiveKit's Agents framework supports hybrid configurations that let you mix a realtime model with standalone pipeline components, giving you the strengths of both approaches where they matter most.

Realtime + separate STT. If your application needs a reliable, realtime transcript, you can run a dedicated STT model alongside the realtime model. The realtime model handles reasoning and audio generation, while the STT provides accurate, timely transcriptions. This is especially useful in regulated industries or any scenario where a complete, timestamped transcript is a hard requirement.

Realtime + separate TTS. Sometimes called the "half-cascade" approach, this configuration uses the realtime model for audio input, preserving its ability to hear emotion, hesitation, and tone, but outputs text instead of audio. That text is then routed through a dedicated TTS provider, giving you full control over voice output including brand voices, voice cloning, or scripted speech.

Summary table

STT–LLM–TTS PipelineRealtime (S2S)
Getting started⚠️ More components to orchestrate✅ Simpler initial integration
Latency⚠️ Natural conversation level with tuning✅ Natural conversation level
Turn detection✅ Full context-aware support⚠️ Built-in only; limited customization
Voice naturalness⚠️ Can be excellent, but requires additional configuration✅ Prosodic awareness
Modularity / Debugging✅ Fully modular and inspectable⚠️ Opaque; limited LLM and voice choice
Tool calling✅ Mature text-based function calling⚠️ Supported, but varies by provider; can be less reliable
Customization✅ Highly configurable⚠️ Limited to what the model supports
Cost✅ Optimize each layer independently⚠️ Difficult to optimize
Compliance✅ Full control over data flow⚠️ Centralized; data residency varies by provider

So, which is better?

Neither architecture is universally the right answer. Realtime models excel at naturalness and emotional awareness, pipelines offer control, modularity, and compliance tooling, and hybrid approaches let you mix the two. The right choice depends on what you're building, who your users are, and where your agent runs.

What matters most is understanding the trade-offs. An architecture that's perfect for an empathetic consumer chatbot over WebRTC is a poor fit for a compliance-heavy telephony deployment, and vice versa. The best voice agents in production today aren't built on the "best" architecture; they're built on the one whose strengths align with the problem at hand.

Want to discuss further?

Join the conversation at community.livekit.io to share what you're working on, ask questions, and learn from teams shipping voice agents in production.