Skip to main content

Why WebRTC beats WebSockets for realtime voice AI

When developers start building voice AI agents, the first architectural decision is transport: how does audio get between the user and the agent? Many reach for WebSockets because they're familiar, well-documented, and already part of most web stacks. It seems like a reasonable choice — open a socket, stream audio bytes in both directions, done.

It works in a demo. It falls apart in production.

The gap between "audio is flowing" and "this feels like a real conversation" is enormous, and it's almost entirely a transport problem. WebSockets weren't designed for realtime media. WebRTC was. That distinction matters far more than most developers expect when they start building.

What WebSockets actually give you

WebSockets provide a persistent, full-duplex TCP connection between a client and server. They're great for chat, notifications, and streaming structured data. For those use cases, they're the right tool.

But when you push raw audio over a WebSocket, you inherit every property of TCP — including the ones that actively work against realtime conversation.

TCP guarantees ordered, reliable delivery. Every packet arrives, and it arrives in sequence. If a packet is lost in transit, TCP pauses the stream and retransmits it before delivering anything that came after. This is called head-of-line blocking, and for audio, it's devastating.

Consider what happens when a single packet is lost during a conversation. With TCP, the receiver stalls — possibly for hundreds of milliseconds — waiting for the retransmission. The audio that arrived perfectly fine after the lost packet sits in a buffer, unplayed, until the gap is filled. The user hears silence, then a burst of buffered audio. The conversational rhythm breaks.

In a text chat, a 200ms delay is invisible. In a voice conversation, it's the difference between a natural exchange and an awkward one.

WebSockets have no concept of media timing. Audio frames need to arrive at precise intervals for smooth playback. WebSockets deliver bytes — there's no jitter buffer, no playout timing, no mechanism to handle frames that arrive too early or too late. You have to build all of that yourself, and building it well is a multi-year engineering effort.

There's no built-in congestion control for media. TCP's congestion control algorithm is designed for bulk data transfer: it fills the pipe, detects loss, and backs off. This sawtooth pattern is fine for downloading files but terrible for realtime audio, where you need a steady, predictable bitrate. When the network degrades, TCP's response is to buffer more data and retry harder — exactly the wrong strategy for a live conversation where a dropped frame is better than a late one.

TCP windowing works against you. TCP uses a sliding window to control how much unacknowledged data can be in flight. When packets are lost, the window shrinks, throttling throughput right when you need consistent delivery. After the loss clears, the window doesn't snap back — it grows conservatively through slow start and congestion avoidance, taking multiple round trips to recover. On high-latency paths (like cross-region connections), this ramp-up is especially painful because each round trip takes longer. The result is bursts of underdelivery followed by slow recovery — exactly the kind of inconsistent throughput that turns a smooth voice conversation into a stuttering one.

What WebRTC was built to do

WebRTC was purpose-built for the problem of moving media between people in realtime. It addresses every shortcoming above with design decisions that specifically optimize for conversation.

UDP-based transport with loss tolerance. WebRTC sends media over UDP using RTP (Real-time Transport Protocol). When a packet is lost, the stream keeps flowing. A missing 20ms audio frame is nearly imperceptible to a listener; a 200ms stall while TCP retransmits is not. WebRTC trades perfect reliability for consistent timing, which is exactly the right trade-off for voice.

Built-in jitter buffers. Network jitter — variation in packet arrival times — is unavoidable on the internet. WebRTC clients include adaptive jitter buffers that absorb this variation, smoothing out playback so the listener hears a continuous stream even when packets arrive unevenly. With WebSockets, you're on your own.

Media-aware congestion control. WebRTC implements congestion control algorithms (like Google Congestion Control, GCC) that are specifically designed for realtime media. Instead of TCP's aggressive fill-and-backoff pattern, GCC measures one-way delay variation to detect congestion before packet loss occurs. When bandwidth drops, WebRTC can reduce bitrate smoothly — scaling down audio quality or switching to a lower video resolution — rather than stalling the stream.

Codec negotiation and adaptation. WebRTC handles codec selection, sample rates, and channel configuration as part of the connection setup. Both sides agree on the most efficient encoding. When network conditions change, the codec parameters can adapt. With WebSockets, you're streaming raw or pre-encoded bytes with no negotiation layer.

Noise cancellation and echo suppression. WebRTC clients include acoustic echo cancellation (AEC), automatic gain control (AGC), and noise suppression built into the media pipeline. These run before audio enters the network, which means the agent receives clean audio regardless of the user's environment. With WebSockets, you either skip these entirely or implement them separately.

NAT traversal. Most users are behind NATs and firewalls. WebRTC includes ICE (Interactive Connectivity Establishment), STUN, and TURN to reliably establish connections through these obstacles. WebSocket connections don't face the same NAT issues since they use standard HTTPS ports, but this advantage disappears when you realize WebRTC can also fall back to TURN over TCP/443 when UDP is blocked — and still maintain all its media-optimized behavior on top.

The compounding effect

Any one of these differences might seem manageable in isolation. You could build a jitter buffer. You could implement your own congestion detection. You could add echo cancellation as a preprocessing step.

But these systems interact. The jitter buffer feeds into playout timing. Congestion control affects codec bitrate decisions. Echo cancellation needs to track what audio was recently played to the speaker. In WebRTC, these components are co-designed to work together across every platform. In a WebSocket-based stack, you're integrating them piecemeal and debugging their interactions across browsers, mobile platforms, and network conditions.

This is years of engineering — and it's already solved.

Why an SFU matters for voice AI

WebRTC defines how media gets between endpoints. But voice AI agents aren't simple two-party calls. An SFU (Selective Forwarding Unit) sits at the center, routing media between participants without decoding or re-encoding it.

Think of it like travel. Peer-to-peer WebRTC is driving local roads — it's fine when you're going a short distance with one or two people. But when you need to connect participants across cities, or handle dozens of concurrent sessions, local roads don't scale. An SFU is the airport. Everyone connects to a central hub that efficiently routes them where they need to go. You don't drive from New York to London — you fly. And when you need global coverage, you don't build one massive airport. You build hubs in each region and connect them with fast, reliable links. That's exactly how a distributed SFU works.

For AI voice agents, the SFU architecture provides several critical advantages.

The agent connects once. Instead of establishing a direct peer-to-peer WebRTC connection with each user, the agent connects to the SFU. The SFU handles the fan-out. This means agent infrastructure doesn't scale with the number of concurrent connections per room — the SFU absorbs that complexity.

Heterogeneous network handling. Users connect from different networks — some on fiber, some on cellular, some on congested Wi-Fi. An SFU can send different quality levels to different subscribers using simulcast, adapting to each user's bandwidth without affecting others. With a direct connection or a WebSocket relay, you're stuck sending the same stream to everyone.

Selective forwarding instead of transcoding. An MCU (Multipoint Conferencing Unit) decodes all incoming streams, mixes them, and re-encodes a single output. This is CPU-intensive and adds latency. An SFU just forwards packets — no decode, no re-encode. For voice AI, where every millisecond of latency affects the feel of the conversation, this matters.

Observability at the routing layer. Because all media flows through the SFU, you get connection quality metrics, packet loss rates, jitter statistics, and latency measurements for every participant without any client-side instrumentation. This telemetry is invaluable for debugging agent behavior in production.

Multi-region SFUs and global voice AI

Voice AI agents serve users globally. A user in Singapore talking to an agent whose infrastructure is in US-East will experience 250+ ms of round-trip network latency before the agent even starts processing. For a voice conversation, that's unacceptable — it makes every exchange feel laggy, regardless of how fast the STT-LLM-TTS pipeline runs.

A distributed SFU architecture solves this by deploying nodes across regions. The user connects to the nearest SFU node, getting the lowest possible network latency for their media connection. The SFU handles routing between regions internally, with optimized server-to-server links that are far more reliable and lower-latency than consumer internet paths.

This is difficult to replicate with WebSockets. You'd need to build your own geo-aware routing, deploy relay servers in each region, manage session affinity, and handle failover — essentially re-creating the SFU's routing layer but without the media-optimized transport underneath.

With LiveKit, multi-region deployment is a configuration option, not an architecture project. Nodes report their stats, a region-aware selector routes new sessions to the closest available node, and connections drain gracefully during scale-down. The same architecture that handles two users in the same city handles a globally distributed voice AI deployment.

Video inference: the same argument, amplified

Everything above applies to audio. For video — used in multimodal agents that can see the user's camera or share visual content — the case is even stronger.

Video is orders of magnitude more bandwidth-intensive than audio. A 720p video stream at 1.5 Mbps is roughly 50x the bandwidth of a typical audio stream. The consequences of head-of-line blocking, poor congestion control, and missing jitter buffers are amplified proportionally.

WebRTC's simulcast support becomes essential here. A publisher sends multiple resolution layers (for example, 720p, 360p, and 180p), and the SFU selects the appropriate layer for each subscriber based on their available bandwidth and the size they're rendering the video at. This adaptive stream behavior is automatic — neither the publisher nor the subscriber needs to manage it.

For vision-capable AI agents that need to process video frames, the SFU can forward the appropriate quality level to the agent's processing pipeline without affecting the stream quality delivered to human participants. You get efficient resource usage on the agent side and high-quality delivery to users, simultaneously.

When WebSockets make sense

WebSockets aren't wrong for everything. They're the right choice for:

  • Signaling. WebRTC itself uses WebSockets (or HTTP) for the signaling layer — exchanging session descriptions and ICE candidates before the media connection is established. LiveKit uses WebSockets for this purpose.
  • Text-based AI interactions. Chat, streaming LLM responses, structured data exchange — all well-suited to WebSockets.
  • Non-realtime audio. If you're uploading a recording for batch transcription, WebSockets or plain HTTP are fine. There's no realtime constraint.

The distinction is simple: if the audio needs to feel like a live conversation, use WebRTC. If it's data transfer that happens to contain audio bytes, WebSockets work.

The practical takeaway

Building a production voice AI agent on WebSockets means re-solving problems that WebRTC solved years ago — and solving them across every browser, mobile platform, and network condition your users will encounter. It's not impossible, but it's a massive engineering investment that diverts effort from the thing that actually differentiates your product: the agent itself.

WebRTC gives you sub-second audio transport, automatic adaptation to network conditions, echo cancellation, noise suppression, and NAT traversal — out of the box. An SFU on top of it adds efficient routing, per-subscriber quality adaptation, and built-in observability. Distribute that SFU across regions and you have a global voice AI platform.

That's the stack LiveKit is built on, and it's the same stack available to every developer building on it. The infrastructure for realtime voice is a solved problem. Use it, and spend your engineering time on the agent.


A note on cross-region latency. Earlier we mentioned 250+ ms round-trip between Singapore and US-East. Here's where that comes from: the great-circle distance is roughly 15,000 km. Light in fiber travels at about two-thirds the speed of light in a vacuum — approximately 200,000 km/s — giving a one-way propagation time of ~75 ms, or ~150 ms round-trip. Real-world internet paths aren't straight lines: routing hops, peering exchanges, and submarine cable routes add overhead that typically pushes the actual RTT to 230–280 ms. That's pure network latency before any processing begins — and in a voice conversation, it's added to every turn of the STT-LLM-TTS pipeline.