LiveKit Inference

Gemma 4 31B on
LiveKit Inference

Same answer quality at 5.2x lower latency, and 6x lower cost.

Start building

See it in action

Illustration of isometric grid with Gemma 4 31B on LiveKit Inference elements

83%cheaper

vs GPT-4.1

814msfaster

vs GPT-4.1

88%task completion

hotel_receptionist eval

Model choice

A faster, cheaper default
for voice agents

Gemma 4 31B is smart enough, fast enough, and affordable for real-time production traffic. It runs voice-agent tasks at a fraction of the latency and cost of proprietary defaults.

Model

Time to first token

Relative cost

Capability

Gemma 4 31B

LiveKit Inference

192ms

1×

88/100

GPT-4.1 mini

OpenAI

802ms

~1.2×

69/100

Gemini 2.5 Flash

Google

911ms

~1.4×

64/100

GPT-4.1

OpenAI

1,006ms

~6×

73/100

Relative cost compares blended list price (3:1 input:output) on LiveKit Inference. For real-time voice agents, Gemma 4 31B is the best tradeoff: the highest capability at ~6× lower cost and far lower latency than GPT-4.1.

Capability

Faster, cheaper…
and smarter, too

Measured on a reference-agent evaluation with task-based judging. Gemma 4 31B clears the production bar on every voice-agent task we test.

Overall task completion

Instruction following

98Tool-call accuracy

100Multi-turn coherence

100Grounding / faithfulness

100Conciseness

Hosting path

Gemma runs better on LiveKit Inference

Other inference platforms optimize for throughput and accept higher latency. We do the opposite: we optimize for low latency and accept lower throughput. Voice can't wait.

Route

Time to first sentence

Tokens / sec

LiveKit Inference

dedicated GPUs, co-located

354ms

158

OpenRouter *

best available route at test time

1,876ms

* OpenRouter may route requests across multiple third-party providers. Latency and availability can vary by selected route or provider; figures reflect the best available Gemma 4 31B route at test time.

Methodology

How we measured every number

Every figure on this page comes from the same reference agent, harnesses, and latency definition — measured the way a real voice call behaves.

Latency metric

Time to first token: how quickly the model starts generating each response, measured across every turn of every scenario.

Latency harness

The same reference-agent conversations that produce the capability scores; every turn of every scenario contributes a latency sample.

Capability harness

A reference-agent evaluation with task-based judging across instruction following, tool calls, and multi-turn coherence.

Providers tested

Gemma 4 31B on LiveKit Inference and OpenRouter; GPT-4.1, GPT-4.1 mini, and Gemini 2.5 Flash for the model-choice comparison.

Get started

See the difference for yourself

One line in your LiveKit Agents session points your agent at Gemma 4 31B.

Start building

Read the docs

1# Gemma 4 31B, served on LiveKit's GPUs
2from livekit.agents import AgentSession
3
4session = AgentSession(
5    llm="google/gemma-4-31b-it",
6)

Gemma 4 31B onLiveKit Inference

A faster, cheaper defaultfor voice agents

Faster, cheaper…and smarter, too

Gemma runs better on LiveKit Inference

How we measured every number

See the difference for yourself

Gemma 4 31B on
LiveKit Inference

A faster, cheaper default
for voice agents

Faster, cheaper…
and smarter, too