Latency Optimized Inference: Gemma 4 on LiveKit

The clearest high-value uses of AI in business today fall into two categories. The first is asynchronous work, like coding, research, and chat assistants. The second is real-time, human-like voice agents that can take complex actions on behalf of users.

Frontier models keep getting better at the first category, but that progress is working against the second. As labs optimize for deeper reasoning, each generation carries higher latency and cost. The voice agents these models power are getting slower and less natural for users, and more expensive for the organizations deploying them.

We’re filling that gap with an LLM deployment optimized specifically for the requests we see from production voice agents: long system prompts, lots of tools, and a very low latency budget. Our target was simple: take the fastest model that passes realistic voice agent evals, give it generous context limits, and reserve enough GPU capacity to keep first-sentence latency extremely low.

That work is now live as Gemma 4 31B on LiveKit Inference: a model that’s both fast enough for real-time conversation and smart enough to run complex agents.

The fastest Gemma 4 anywhere#

For a voice agent, every millisecond before the first token matters. Here’s how Gemma 4 31B on LiveKit Inference compares on time-to-first-token against the models most commonly used for voice AI today:

Time to first token — measured on identical requests. Lower is faster.

Gemma 4 31B · LiveKit192 ms
Gemini 2.5 Flash911 ms
GPT-5.5966 ms
GPT-4.11006 ms
Gemini 3.0 Flash1095 ms
Gemma 4 31B · OpenRouter1,876 ms

Some of that speed is the model itself: at 31B parameters, Gemma is small enough to serve with full-precision weights while staying fast, and its reasoning is controllable, so it doesn’t spend tokens on long hidden reasoning chains before producing output.

The rest comes from serving that’s optimized end to end for voice AI. Gemma runs on GPUs behind SGLang, with speculative decoding enabled for better token throughput. The bigger choice is how much work we put on each GPU: we run with more headroom than a throughput-maximized deployment would, because queueing delay is very noticeable to users who are speaking to the agents. A warm request starts returning tokens in around 100 ms.

We’ll keep tuning the stack, but latency stays at the top of our list: fewer requests per GPU, less queueing, faster warm starts. The tradeoff is that cost is higher than some other deployments of Gemma, but at $1.20 per 1M output tokens, it’s still extremely affordable.

Time-to-first-sentence is the real picture#

TTFT is a good start, but it’s an incomplete measure of voice latency because it ignores generation speed. A voice agent doesn’t speak in tokens; it speaks in sentences. Speech synthesis needs the first complete sentence before the agent can say anything, so the latency users actually feel is time-to-first-sentence (TTFS): how long after the user stops talking before the agent starts speaking. Past a few hundred milliseconds, the conversation stops feeling natural. Users talk over the agent, repeat themselves, or hang up.

TTFS gives us two knobs to tune, not one: TTFT and tokens per second. A deployment with a fast first token but slow generation still leaves users waiting for the sentence to finish. We optimize both, keeping queueing delay low so the first token arrives quickly, and using speculative decoding to raise throughput so the rest of the sentence completes right behind it.

Most latency benchmarks don’t measure this under realistic conditions. Voice looks lightweight from the outside: a person talks, the agent answers. Those small, naive exchanges are what most benchmarks use, and they can make almost any hosted model look fast. In reality, most of the request is loaded before the conversation even begins: policy, persona, escalation rules, examples, tool instructions, retrieved business data, and the tool schemas themselves. In production, the prompt often outweighs the conversation history. That’s why we give generous context limits: instructions and retrieved context should never get cut to make the model look fast. A deployment has to stay fast with the real request shape, not just with a short benchmark prompt.

So we measured time-to-first-sentence across full conversations against a production-shaped agent (details on the eval below):

Response speed — time-to-first-speech latency across a conversation. Lower is faster.

Gemma 4 31B · LiveKit354 ms
Gemini 2.5 Flash1034 ms
GPT-4.11088 ms
Gemini 3.0 Flash1267 ms
GPT-5.51404 ms
Gemma 4 31B · OpenRouter4,120 ms

Even with large front-loaded prompts and dozens of tool schemas in play, Gemma 4 31B on LiveKit starts speaking more than 2x faster than the next-closest model.

Is it smart enough?#

Latency only matters if the agent is accomplishing its goals. For a voice agent, two capabilities dominate: following the instructions in a large system prompt, and using tools accurately. The model has to preserve business rules, choose the right tools from a big catalog, call them with correct arguments, handle interruptions and corrections, and keep working across branching conversations.

On industry benchmarks for exactly these skills, Gemma 4 punches well above its weight class. On IFBench, Artificial Analysis’ independent measure of precise instruction following, Gemma 4 31B scores 75.6%, essentially tied with frontier models like GPT-5.5 (75.9%), and nearly double the models most voice agents run on today: GPT-4.1 scores 43% and Gemini 2.5 Flash 39%. On τ²-bench, an agentic tool-use benchmark that simulates real-world customer service scenarios, it clearly outperforms the models in its speed class; only frontier reasoning models score higher, and they pay for that capability in latency.

Instruction following and agentic tool use — higher is better.

IFBench

GPT-5.575.9%
Gemma 4 31B75.6%
Gemini 3 Flash55%
GPT-4.143%
Gemini 2.5 Flash39%

τ²-bench

GPT-5.593.9%
Gemma 4 31B76.9%
GPT-4.154.7%
Gemini 2.5 Flash41.6%

IFBench scores are independently measured by Artificial Analysis. No published τ²-bench score for Gemini 3 Flash.

But benchmarks aren’t everything. What matters in the end is task completion: can the agent finish the job it was given? Single-prompt evals can’t tell you that, so we tested with 100 different simulated users running full conversations. The main eval was a hotel reception agent with a large system prompt, roughly 40 tools, and tasks that branch over multiple turns. The model had to keep the instructions in context, choose the right tools, call them correctly, and recover when the conversation moved around.

Task completion — share of guest requests resolved end-to-end without escalation, across 100 simulated conversations.

GPT-5.592%
Gemma 4 31B · LiveKit88%
Gemma 4 31B · OpenRouter88%
Gemini 3.0 Flash74%
GPT-4.173%
Gemini 2.5 Flash68%

Served with full-precision weights, Gemma 4 31B lands in the top tier on task completion, within a few points of a frontier model and well ahead of the other fast options. Combined with the latency numbers above, that’s the balance of accuracy and speed that makes it our recommended choice for voice.

Smaller models do come with rough edges. Gemma, for example, occasionally writes what looks like a tool call into a plain text message instead of emitting a clean structured call. Because LiveKit Inference controls the serving path, we can address issues like schema adherence directly at the inference serving layer, so tool calls arrive well-formed before the response ever reaches your agent, with no perceptible impact on latency. The tool-calling failures we saw in early simulations disappeared after we added these fixes.

We’ve open sourced this Hotel Receptionist so you can see how the agent works for yourself, and how we built the scenarios that tested it.

In production: Stellar Cafe#

Stellar Cafe, a voice-driven NPC game built on AstroBeam AI, switched its conversational NPCs from Gemini 2.5 Flash to Gemma 4 31B on LiveKit Inference.

The switch helped improve response quality and boost consistency while reducing their end-to-end response time by about 30%. By their account, they now have the fastest-responding voice AI NPCs in a video game today.

Moving to Gemma 4 31B not only gives us better reasoning, it allows us to be much faster on TTFT (time to first token) and more consistent in response times.

To learn more about their migration, see their blog post on Stellar Cafe at Warp Speed.

Pricing and availability#

To start using Gemma 4 31B, visit the LiveKit hosted model guide. You use the same LiveKit API key and billing relationship as the rest of your agent stack; there is no separate provider account to manage.

	Price per 1M tokens
Output	$1.20
Input (uncached)	$0.40
Input (cached)	$0.20

We believe this Gemma deployment represents the best set of tradeoffs between latency, cost, and capability available today. Try it out. We can’t wait for you to feel the difference.

06.17.2026