Skip to main content

Why You Shouldn’t Build Real-Time Voice Agents Directly on Model APIs

Every major model provider now offers a real-time or streaming API. OpenAI has the Realtime API. Google has the Gemini Live API. And the pitch is appealing. Connect directly, send audio in, get audio out. Simple.

But "simple" breaks down fast once you start building for real users. The model handles the AI part. Everything else? That's on you. And "everything else" is where voice agents actually succeed or fail.

This guide walks through what you're really signing up for when you build directly on model APIs, and what changes when you use a voice agent framework instead.

The Model API Does Less Than You Think

When you connect to a model's real-time API, you get access to the model. That's it. The API handles inference. It doesn't handle the dozen other things that make a voice conversation actually work.

Here's what you're responsible for when going direct.

  • Audio transport. Most real-time model APIs use WebSockets, which run over TCP. TCP prioritizes guaranteed delivery over speed. That's great for chat messages, but for live voice, it means audio packets queue up instead of being dropped and recovered from gracefully. WebRTC, the protocol behind LiveKit, Google Meet, Zoom, and every major calling app, uses UDP and is built specifically for real-time media. It handles packet loss, jitter, and variable network conditions without stalling.
  • Echo cancellation. When your agent speaks through a user's device speakers and the microphone picks that audio back up, you get a feedback loop. The agent hears itself, interprets its own output as user speech, and the conversation derails. WebRTC has acoustic echo cancellation (AEC) built into the browser and native SDKs. Without it, you're building your own signal processing pipeline or dealing with agents that constantly interrupt themselves.
  • Turn detection and interruption handling. Knowing when a user has finished speaking and when they're interrupting the agent mid-sentence is one of the hardest problems in voice AI. Raw model APIs give you voice activity detection (VAD) at best. But good turn-taking requires more than silence detection. It needs context-aware end-of-utterance models that understand when a pause is the user thinking versus when they're done talking. Get this wrong and your agent either cuts people off or waits awkwardly after every sentence.
  • Client SDKs. You need to capture audio from the user's microphone, play back the agent's response, handle network changes, manage permissions, and do all of this across browsers, iOS, Android, and whatever else your users are on. Model APIs don't ship client SDKs for real-time media. You're either building your own or stitching together libraries that weren't designed to work together.
  • Scaling and infrastructure. A voice session isn't a stateless API call. It's a persistent, bidirectional connection that consumes resources for the entire duration of the conversation. Load balancing these sessions is fundamentally different from load balancing HTTP requests. You need session-aware routing, warm connection pools, and infrastructure that can scale concurrent sessions without dropping audio.

What Actually Happens When Teams Go Direct

The pattern plays out the same way almost every time.

A team starts with a model's real-time API because it feels like the fastest path. The first demo works. Audio goes in, the model responds, and everyone is impressed. Then they start testing with real users.

The agent interrupts people. Background noise triggers false responses. Latency spikes when the network fluctuates. Echo cancellation doesn't exist, so the agent talks over itself on speakerphone. Turn detection falls apart in languages where conversational pacing is different from English.

One team recently compared building with LiveKit versus using a model provider's SDK directly. Within an hour they had a working demo on LiveKit. The same model, the same provider, but the experience was noticeably different. Interruption handling was significantly better. The audio quality was cleaner. The conversation felt more natural.

The reason wasn't the model. The model was identical in both setups. The difference was everything around the model. How audio was transported, how turns were detected, how interruptions were handled, and how echo was cancelled.

Another team that had built their own WebRTC stack from scratch reported 3+ second delays between user speech and agent response. Their noise cancellation struggled in multi-person environments. Turn detection processed speech prematurely when users hesitated, cutting them off mid-thought. They'd spent months maintaining home-built components that required dedicated, experienced engineers just to keep running.

The Hidden Cost of "Flexibility"

The appeal of going direct is control. You own every layer. You can tune everything.

But that control comes with maintenance. When a model provider updates their API, you update your integration. When you want to test a different STT provider, you're rewriting your audio pipeline. When you need to add a second model for a different language, you're managing another set of API keys, billing relationships, and concurrency limits.

One company evaluating their infrastructure found that their current implementation was tightly coupled to a single model SDK. Every time they wanted to test a new model, especially from a different provider, there was significant developer overhead and a real risk of introducing bugs. Swapping models should be a configuration change, not a rewrite.

The voice AI model landscape shifts every few months. New models, new providers, new pricing. If your architecture ties you to one provider's SDK, you're rebuilding every time you want to move. And you will want to move.

What a Voice Agent Framework Actually Gives You

A voice agent framework sits between your application logic and the model APIs. It handles the hard parts of real-time voice so you can focus on what your agent actually does.

Here's what that looks like in practice.

  • WebRTC transport out of the box. Your audio travels over UDP with adaptive bitrate, congestion control, and jitter buffering. Network conditions change constantly on mobile devices and real-world connections. WebRTC was built for this. WebSockets were not.
  • Built-in echo cancellation. The framework handles AEC at the SDK level, across browsers and native platforms. Your agent won't hear itself. Your users won't hear echoes. This isn't a nice-to-have. It's the difference between a usable product and a broken one.
  • Production-grade turn detection. Multilingual end-of-utterance models that understand conversational patterns, not just silence thresholds. Users can pause to think without being cut off. They can interrupt without waiting for the agent to finish its entire response.
  • Client SDKs for every platform. Drop-in SDKs for web, iOS, Android, Flutter, React Native, Unity, and more. Audio capture, playback, permissions, reconnection logic, and network handling are all built in and tested across devices.
  • Model-agnostic architecture. Swap your STT, LLM, or TTS provider with a configuration change. Test Gemini against GPT-4o against Claude without touching your application code. Run different models for different languages or use cases. Your infrastructure stays the same.
  • Session management at scale. Persistent connections are routed and load-balanced correctly. Warm agent pools reduce cold-start latency. Autoscaling handles traffic spikes. You don't need to build session-aware infrastructure from scratch.

When Going Direct Makes Sense

There are cases where connecting directly to a model API is the right call.

If you're building a quick prototype to test whether a model's voice capabilities meet your needs, going direct gets you there fast. If your use case is a single-user, single-session demo that won't see production traffic, the infrastructure overhead doesn't apply.

But the moment you need real users on real devices over real networks, the gap between a model API and a production voice experience becomes clear. That gap is everything we've covered. Transport, noise cancellation, turn detection, client SDKs, scaling, and model flexibility.

The Bottom Line

Model APIs give you access to the model. A voice agent framework, like LiveKit Agents, gives you everything you need to put that model in front of real users and have it work.

The model is the brain. But a brain without a nervous system has no ears, no mouth, and no ability to coordinate turns in real time. It isn't a very good conversational partner. Building all of that yourself is possible. Teams do it. But they also spend months on infrastructure that has nothing to do with what makes their agent unique.

The question isn't whether you can build it yourself. It's whether you should.

If you want to see the difference firsthand, spin up a voice agent with LiveKit's Agents quickstart and compare it to your current setup. The model is the same. Everything else is what changes.