Skip to main content
close

Build Your First AI Voice Agent in Python: Complete Tutorial

A Python voice agent connects a speech-to-text model, a large language model, and a text-to-speech model in a streaming pipeline. LiveKit Agents provides the framework and real-time audio transport layer. This tutorial covers setup, implementation, local testing, and deployment.

You'll build a working voice agent in the next 30 minutes.

By the end of this tutorial, you'll have a voice agent that listens to users, understands their questions, and responds in real-time with natural speech.

What you'll create: A voice agent that answers questions in real-time using speech-to-text, an LLM, and text-to-speech.

Prerequisites: Basic Python knowledge. You don't need prior experience with audio processing or WebRTC.

Note: LiveKit Agents also supports TypeScript. The TypeScript implementation is covered in a separate tutorial.

What you'll learn:

  • How the voice agent pipeline works (STT → LLM → TTS)
  • How to connect each component for a real-time conversation
  • How to test, debug, and deploy your agent

This guide is written by the LiveKit team, who maintain one of the most widely used open-source real-time communication stacks and power production voice agents across telephony, support, and AI platforms.

What is the STT-LLM-TTS pipeline?

Most voice agents follow the STT-LLM-TTS pipeline:

  1. Speech-to-Text (STT) converts the user's audio into text.
  2. Large Language Model (LLM) generates an intelligent response.
  3. Text-to-Speech (TTS) converts the response back into audio.

This modular approach gives you maximum flexibility. Swap providers, fine-tune each component, and debug issues at each stage independently.

Note: This pipeline goes by many names. You may see it referred to as:

  • Cascaded pipeline: emphasizes the sequential, chained nature of the components
  • Voice pipeline: generic term used broadly across the industry
  • Conversational AI pipeline: common in enterprise and contact center contexts
  • ASR → LLM → TTS pipeline: uses ASR (Automatic Speech Recognition) instead of STT
  • Listen-Think-Speak pipeline: a more human-readable framing
  • Turn-based voice pipeline: highlights the conversational turn-taking model
  • Classic voice agent architecture: used to distinguish it from newer speech-to-speech models

All of these refer to the same fundamental architecture: audio in → text → LLM → text → audio out.

Speech-to-speech: the emerging alternative

A newer approach skips the text intermediary entirely. Speech-to-speech models (like OpenAI's Realtime API and Google's Gemini Live) process audio in and generate audio out directly.

Advantages:

  • Lower latency (fewer processing steps)
  • Preserves paralinguistic cues (tone, emphasis, emotion)
  • More natural conversational dynamics

Trade-offs:

  • Less visibility into what the model "heard" or "said"
  • Harder to debug and evaluate
  • Fewer provider options (still early days)
  • Higher cost per minute in most cases

For production voice agents today, the cascaded pipeline remains the most practical choice. It's battle-tested, observable, and gives you fine-grained control. But speech-to-speech is maturing fast, and LiveKit Agents supports both approaches.

Why real-time matters

Traditional HTTP request/response patterns introduce latency at every step. Send audio, wait for transcription, send text, wait for response, send response, wait for audio. Each round trip adds 100–300ms.

WebRTC changes this. Instead of request/response, you maintain persistent connections that stream data in both directions simultaneously. Audio flows in while responses flow out. The result: total latency drops from seconds to hundreds of milliseconds.

Architecture overview:

Loading diagram…

Each component streams its output to the next. The LLM starts generating before the user finishes speaking. TTS starts synthesizing before the LLM finishes responding. This parallelization is what makes voice agents feel responsive.

How do you choose your model stack?

Your choice of providers affects latency, quality, and cost. LiveKit Inference simplifies this by giving you access to all major providers through a single API key, with no need to manage multiple accounts or credentials.

Note: The models below are popular options, not an exhaustive list. LiveKit Agents supports many more providers. Check the documentation for the full list.

STT options

ProviderModelLatencyAccuracyCost
DeepgramNova-3~250msExcellent$$$
DeepgramNova-2~250msGreat$$
AssemblyAIUniversal-Streaming~300msGreat$
CartesiaInk Whisper~80msGood$

LLM options

ProviderModelLatencyQualityCost
OpenAIGPT-4o~200msExcellent$$$
OpenAIGPT-4.1 mini~200msGreat$
GoogleGemini 2.5 Flash~350msGreat$
DeepSeekDeepSeek V3~300msGreat$

TTS options

ProviderModelLatencyVoice QualityCost
CartesiaSonic 3~90msExcellent$$
ElevenLabsFlash v2.5~150msExcellent$$$
DeepgramAura-2~100msGood$$
InworldTTS 1.5~200msGood$

Setting up your development environment

Let's get your environment ready.

Install LiveKit Agents

1
uv init --bare
2
uv add "livekit-agents[silero,turn-detector]~=1.4" python-dotenv

Note: When using LiveKit Inference (recommended), you don't need separate plugin packages for STT, LLM, or TTS. LiveKit routes those requests for you. You only need silero for VAD and turn-detector for the multilingual turn detection model.

API keys

With LiveKit Inference, you only need one account:

LiveKit Inference provides access to STT, LLM, and TTS models through a single API key. No need to create separate accounts with Deepgram, OpenAI, or Cartesia. LiveKit handles all provider credentials and billing.

Environment variables

Create a .env.local file:

1
LIVEKIT_URL=wss://your-project.livekit.cloud
2
LIVEKIT_API_KEY=your-api-key
3
LIVEKIT_API_SECRET=your-api-secret

Get your credentials from the LiveKit Cloud dashboard.

Building the voice agent

Time to write code. Create a new file called agent.py in your project directory. The steps below build up this file incrementally. The Complete implementation section at the end shows the finished file.

Step 1: Initialize the agent

Start with the agent entry point. Add this to agent.py:

1
from dotenv import load_dotenv
2
from livekit import agents
3
from livekit.agents import AgentServer, AgentSession, Agent, room_io
4
from livekit.plugins import silero
5
from livekit.plugins.turn_detector.multilingual import MultilingualModel
6
7
load_dotenv(".env.local")
8
9
class Assistant(Agent):
10
def __init__(self):
11
super().__init__(
12
instructions="You are a helpful voice AI assistant."
13
)
14
15
server = AgentServer()
16
17
@server.rtc_session(agent_name="my-agent")
18
async def my_agent(ctx: agents.JobContext):
19
pass # We'll add components here
20
21
if __name__ == "__main__":
22
agents.cli.run_app(server)

This defines your Assistant class (where agent instructions and personality live) and creates an AgentServer. The @server.rtc_session decorator registers the entry function that runs whenever a new room session starts.

The agent_name value is how LiveKit identifies which agent to start when a session is requested. It also switches the agent from automatic dispatch to explicit dispatch, so your agent only starts when explicitly requested rather than joining every new room automatically. This gives you precise control and is the right behavior for both development and production.

Step 2: Configure speech-to-text

Update your my_agent function in agent.py to add Deepgram for transcription:

1
@server.rtc_session(agent_name="my-agent")
2
async def my_agent(ctx: agents.JobContext):
3
session = AgentSession(
4
stt="deepgram/nova-3:multi", # Deepgram Nova-3 via LiveKit Inference
5
)

Key options:

  • "deepgram/nova-3:multi" routes to Deepgram Nova-3 via LiveKit Inference with multilingual support.
  • No separate Deepgram API key needed. LiveKit Inference handles provider credentials.

Step 3: Configure the LLM

Agent instructions live in the Agent class. Update your Assistant class with a system prompt, then add llm to your AgentSession:

1
class Assistant(Agent):
2
def __init__(self):
3
super().__init__(
4
instructions=(
5
"You are a helpful voice assistant. Keep responses concise, ideally under 2 sentences. Be conversational and friendly."
6
)
7
)
8
9
@server.rtc_session(agent_name="my-agent")
10
async def my_agent(ctx: agents.JobContext):
11
session = AgentSession(
12
stt="deepgram/nova-3:multi",
13
llm="openai/gpt-4.1-mini", # Great quality at low cost
14
)

Prompt tips for voice:

  • Keep responses short. Long responses feel unnatural in conversation.
  • Avoid lists and formatting. They don't translate to speech.
  • Write for the ear, not the eye.

Step 4: Configure text-to-speech

Add tts to the AgentSession in my_agent:

1
session = AgentSession(
2
stt="deepgram/nova-3:multi",
3
llm="openai/gpt-4.1-mini",
4
tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
5
)

The TTS string uses the format provider/model:voice-id. Cartesia offers many voices. Browse the options and swap the voice UUID in the string.

Warning: If no voice UUID is provided, Cartesia will fall back to its default voice. This may produce inconsistent results across sessions and regions, as the default can change without notice. For production agents, always specify an explicit voice UUID to keep your agent's voice consistent.

Step 5: Add VAD and turn detection

Finalize your AgentSession with VAD and turn detection:

1
session = AgentSession(
2
stt="deepgram/nova-3:multi",
3
llm="openai/gpt-4.1-mini",
4
tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
5
vad=silero.VAD.load(),
6
turn_detection=MultilingualModel(),
7
)

The AgentSession handles:

  • Voice Activity Detection (VAD): Silero detects when the user starts and stops speaking.
  • Turn detection: The multilingual model determines when the user has finished their turn.
  • Interruption handling: Stops speaking when the user interrupts.

Step 6: Start the agent

After the AgentSession, start the assistant and trigger the opening greeting:

1
await session.start(
2
room=ctx.room,
3
agent=Assistant(),
4
)
5
6
await session.generate_reply(
7
instructions="Greet the user and offer your assistance."
8
)

session.start() connects the session to the room and attaches your Assistant agent. generate_reply() triggers the initial greeting so the agent speaks first rather than waiting for the user.

Complete implementation

Here's the full agent.py with everything in place:

1
from dotenv import load_dotenv
2
from livekit import agents
3
from livekit.agents import AgentServer, AgentSession, Agent, room_io
4
from livekit.plugins import silero
5
from livekit.plugins.turn_detector.multilingual import MultilingualModel
6
7
load_dotenv(".env.local")
8
9
class Assistant(Agent):
10
def __init__(self):
11
super().__init__(
12
instructions=(
13
"You are a helpful voice assistant. Keep responses concise, ideally under 2 sentences. Be conversational and friendly."
14
)
15
)
16
17
server = AgentServer()
18
19
@server.rtc_session(agent_name="my-agent")
20
async def my_agent(ctx: agents.JobContext):
21
session = AgentSession(
22
stt="deepgram/nova-3:multi",
23
llm="openai/gpt-4.1-mini",
24
tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",
25
vad=silero.VAD.load(),
26
turn_detection=MultilingualModel(),
27
)
28
29
await session.start(
30
room=ctx.room,
31
agent=Assistant(),
32
)
33
34
await session.generate_reply(
35
instructions="Greet the user and offer your assistance."
36
)
37
38
if __name__ == "__main__":
39
agents.cli.run_app(server)

Download model files

Now that agent.py exists, download the required local model files for VAD and turn detection:

1
uv run agent.py download-files

This only needs to be done once. The files are cached locally for all future runs.

How do you test a voice agent locally?

Let's verify everything works.

Local testing

Console mode

First, verify your agent works:

1
uv run agent.py console

Console mode lets you interact with your agent directly in the terminal. Type messages or speak using your microphone. It's perfect for testing your LLM configuration and system prompt without deploying or opening a browser. Try both: start talking to test the full voice pipeline, or press Ctrl+T to enter text mode and type a message.

Test with LiveKit Playground

The LiveKit Agents Playground provides a browser-based interface to test your agent in development. No deployment required. Your agent runs locally while the Playground connects to it through LiveKit Cloud.

Step 1: Start your agent locally

Make sure your agent is running in dev mode:

1
uv run agent.py dev

You should see registered worker in the console output. This means your local agent has connected to LiveKit Cloud and is ready to receive requests.

Step 2: Open the Playground

Go to agents-playground.livekit.io in your browser.

Step 3: Connect to your LiveKit project

Select your LiveKit Cloud project to connect automatically, or enter your connection info manually:

Step 4: Connect and test

  1. Enter "my-agent" in the Agent name field.
  2. Click Connect in the Playground.
  3. Grant microphone permissions when prompted.
  4. Speak naturally. Say "Hello" or ask a question.
  5. You should hear a response within 1 second.

When you connect, LiveKit Cloud routes the session to your locally running agent. You'll see activity in your terminal as the agent processes the conversation.

Step 5: Monitor the conversation

The Playground shows:

  • Audio visualizer: See when you're speaking vs. when the agent is speaking.
  • Transcript: Real-time text of both sides of the conversation.
  • Connection status: Verify you're connected to the right room.

Troubleshooting:

  • No response? Check that your agent is running and shows registered worker.
  • Can't hear audio? Verify your browser's audio output settings.
  • Permission denied? Grant microphone access in your browser settings.

Measuring latency

Good voice agents respond in under 1 second. LiveKit Agents emits metrics you can capture in code:

1
@session.on("metrics_collected")
2
def on_metrics(metrics):
3
print(f"EOU: {metrics.eou_delay}ms")
4
print(f"LLM TTFT: {metrics.llm_ttft}ms")
5
print(f"TTS TTFB: {metrics.tts_ttfb}ms")

Total latency ≈ eou_delay + llm_ttft + tts_ttfb

Note: LiveKit Cloud users: the Agent Observability dashboard shows latency breakdowns with synchronized audio playback. Click anywhere in the timeline to see exact timing.

Target latencies:

  • STT: <200ms
  • LLM time to first token (TTFT): <300ms
  • TTS time to first audio (TTFA): <300ms
  • Total: <800ms

Debugging common issues

No response from agent:

  • Verify microphone permissions.
  • Confirm your agent is running and shows registered worker in the terminal.

Audio cutting out:

  • Check your VAD sensitivity settings.
  • Make sure you have a stable network connection.

High latency (>3s):

  • Switch to streaming mode for all components.
  • Check if you're using batch instead of streaming STT.
  • Verify your network latency to provider APIs.

STT accuracy problems:

  • Confirm the correct language is set.
  • Check audio quality (sample rate, noise).
  • Try a different STT model.

LLM timeout errors:

  • Reduce max tokens in your requests.
  • Check the OpenAI status page for outages.
  • Use LiveKit's built-in FallbackAdapter to automatically route to backup providers when the primary LLM fails.

Deploying to production

Your agent works locally. Let's deploy it.

Hosting options

LiveKit Cloud is the recommended way to deploy your agent. It handles infrastructure, scaling, and observability so you can focus on building. Deploy directly from your repo with a single command.

First, install the LiveKit CLI (lk) if you haven't already:

1
# macOS
2
brew install livekit-cli
3
4
# Linux
5
curl -sSL https://get.livekit.io/cli | bash
6
7
# Windows
8
winget install LiveKit.LiveKitCLI

Then deploy your agent:

1
lk agent create

Self-hosting:

If you need full infrastructure control or have specific compliance requirements, self-hosting is fully supported. Popular options include:

  • Fly.io: Great for global distribution.
  • Railway: Simple container deployments.
  • AWS/GCP: Full control, more setup.

Environment variables

Set your production environment variables in your hosting platform.

Note: Never commit API keys to your repository.

Scaling considerations

  • Each agent worker process handles one conversation at a time.
  • Scale horizontally by running multiple instances.
  • LiveKit Cloud auto-scales based on demand.

Cost estimates (with LiveKit Inference)

Using the recommended stack at $0.0502/min:

Usage levelMonthly costNotes
1,000 min~$95Ship plan ($50) + inference (~$45)
5,000 min~$295Ship plan ($50) + inference (~$245)
50,000 min~$2,950Scale plan ($500) + inference (~$2,450)

Based on Deepgram Nova-3 + GPT-4.1 mini + Cartesia Sonic 3 via LiveKit Inference. See LiveKit Pricing for details.

Common challenges and solutions

Challenge: high latency

Cause: Non-streaming components or network issues.

Solution:

  • Enable streaming for STT, LLM, and TTS.
  • Use providers with low-latency APIs.
  • Co-locate your agent with your AI models to minimize latency.

Challenge: poor audio quality

Cause: VAD misconfiguration or echo issues.

Solution:

  • Tune VAD sensitivity thresholds.
  • Enable echo cancellation in your WebRTC configuration.
  • Test with headphones to isolate issues.

Challenge: context loss in long conversations

Cause: Exceeding LLM context window.

Solution:

  • Implement conversation summarization.
  • Use a sliding window for recent messages.
  • Store key facts in a separate context.

Challenge: high costs

Cause: Inefficient provider usage.

Solution:

  • Cache common responses to avoid redundant LLM calls for repeated inputs.
  • Use smaller models for simple queries.
  • Implement usage monitoring and alerts.

Next steps

You've built a working voice agent. Here's where to go next.

Build a frontend for your agent: Connect your agent to a web UI using the agent-starter-react starter app, a ready-to-use React frontend built for LiveKit Agents. For more frontend options across web, mobile, and telephony, see the Agent Frontends documentation.

Connect to phone calls: Give your agent a real phone number with LiveKit Phone Numbers. Purchase a number directly from LiveKit Cloud with no third-party SIP trunk required, and your agent can start handling inbound calls immediately.

Add function calling: Let your agent take actions like booking appointments, querying databases, or controlling smart devices. LiveKit Agents supports function calling out of the box.

Support multiple languages: Configure language detection in STT and use multilingual TTS voices for global users.

Create custom voices: Clone a voice with ElevenLabs or train a custom Cartesia voice for brand consistency.

Add sentiment analysis: Detect user frustration and adjust your agent's tone dynamically.

Resources:

FAQ

How much does it cost to run a voice agent?

Using LiveKit Inference with the recommended stack, expect $0.0502/min of conversation:

  • Agent session: $0.0100/min
  • WebRTC connection: $0.0010/min
  • LLM (GPT-4.1 mini): $0.0015/min
  • STT (Deepgram Nova-3): $0.0077/min
  • TTS (Cartesia Sonic 3): $0.0300/min

On the Build plan (free): 1,000 agent minutes are included monthly, plus $2.50 in inference credits. Enough for roughly 50 minutes of conversation with the recommended stack. Once your credits run out, you can bring your own API keys for STT, LLM, and TTS providers, or upgrade to the Ship plan. See LiveKit Pricing for the full breakdown.

What's the minimum latency possible?

With optimized streaming and the right providers, you can achieve 200–500ms total latency from end of user speech to start of agent response. This feels natural and conversational. WebRTC's persistent connections eliminate the round-trip overhead that makes HTTP-based solutions feel sluggish.

Can I use open source models?

Yes. For STT, Whisper can be self-hosted using the OpenAI plugin with a StreamAdapter and VAD. For LLMs, Llama and Mistral are well-supported through Ollama (local), Together AI, Fireworks, and OpenRouter, all of which have official LiveKit Agents plugins. For TTS, you can implement a custom TTS node to integrate any self-hosted engine. Trade-offs: more infrastructure complexity and potentially higher latency, but lower per-minute costs at scale.

How do I handle multiple languages?

Configure Deepgram with language="multi" for automatic language detection, or set the language explicitly based on user preference. For TTS, use multilingual voices from ElevenLabs or Cartesia. Your LLM prompt should instruct the model to respond in the detected language.

What's the difference between voice agents and chatbots?

Chatbots handle text asynchronously. Users type, wait, read. Voice agents handle speech in real-time. Users talk naturally, expect immediate responses, and can interrupt. This requires streaming at every layer, sophisticated turn-taking, and sub-second latency. The technical bar is significantly higher, but the result feels like a real conversation.

Get started

You now have everything you need to build production-ready voice agents. Start with the code in this tutorial, customize the system prompt for your use case, and deploy.

Sign up for LiveKit Cloud to get started for free, or check out the full documentation for advanced features.

Questions? Join the LiveKit Discourse. The community is active and helpful.