Skip to main content
LiveKit Inference

Gemma 4 31B on
LiveKit Inference

Same answer quality at 5.2x lower latency, and 6x lower cost.

Illustration of isometric grid with Gemma 4 31B on LiveKit Inference elements

83%cheaper

vs GPT-4.1

814msfaster

vs GPT-4.1

88%task completion

hotel_receptionist eval
Model choice

A faster, cheaper default
for voice agents

Gemma 4 31B is smart enough, fast enough, and affordable for real-time production traffic. It runs voice-agent tasks at a fraction of the latency and cost of proprietary defaults.

Model
Time to first token
Relative cost
Capability
Gemma 4 31B
LiveKit Inference
192ms
88/100
GPT-4.1 mini
OpenAI
802ms
~1.2×
69/100
Gemini 2.5 Flash
Google
911ms
~1.4×
64/100
GPT-4.1
OpenAI
1,006ms
~6×
73/100
Relative cost compares blended list price (3:1 input:output) on LiveKit Inference. For real-time voice agents, Gemma 4 31B is the best tradeoff: the highest capability at ~6× lower cost and far lower latency than GPT-4.1.
Capability

Faster, cheaper…
and smarter, too

Measured on a reference-agent evaluation with task-based judging. Gemma 4 31B clears the production bar on every voice-agent task we test.

Overall task completion
88

Instruction following
98Tool-call accuracy
100Multi-turn coherence
100Grounding / faithfulness
100Conciseness
96
Hosting path

Gemma runs better on LiveKit Inference

Other inference platforms optimize for throughput and accept higher latency. We do the opposite: we optimize for low latency and accept lower throughput. Voice can't wait.

Route
Time to first sentence
Tokens / sec
dedicated GPUs, co-located
354ms
158
OpenRouter *
best available route at test time
1,876ms
33
* OpenRouter may route requests across multiple third-party providers. Latency and availability can vary by selected route or provider; figures reflect the best available Gemma 4 31B route at test time.
Methodology

How we measured every number

Every figure on this page comes from the same reference agent, harnesses, and latency definition — measured the way a real voice call behaves.

Latency metric

Time to first token: how quickly the model starts generating each response, measured across every turn of every scenario.

Latency harness

The same reference-agent conversations that produce the capability scores; every turn of every scenario contributes a latency sample.

Capability harness

A reference-agent evaluation with task-based judging across instruction following, tool calls, and multi-turn coherence.

Providers tested

Gemma 4 31B on LiveKit Inference and OpenRouter; GPT-4.1, GPT-4.1 mini, and Gemini 2.5 Flash for the model-choice comparison.

Get started

See the difference for yourself

One line in your LiveKit Agents session points your agent at Gemma 4 31B.

1
# Gemma 4 31B, served on LiveKit's GPUs
2
from livekit.agents import AgentSession
3
4
session = AgentSession(
5
llm="google/gemma-4-31b-it",
6
)