A Python voice agent connects a speech-to-text model, a large language model, and a text-to-speech model in a streaming pipeline. LiveKit Agents provides the framework and real-time audio transport layer. This tutorial covers setup, implementation, local testing, and deployment.
You'll build a working voice agent in the next 30 minutes.
By the end of this tutorial, you'll have a voice agent that listens to users, understands their questions, and responds in real-time with natural speech.
What you'll create: A voice agent that answers questions in real-time using speech-to-text, an LLM, and text-to-speech.
Prerequisites: Basic Python knowledge. You don't need prior experience with audio processing or WebRTC.
Note: LiveKit Agents also supports TypeScript. The TypeScript implementation is covered in a separate tutorial.
What you'll learn:
- How the voice agent pipeline works (STT → LLM → TTS)
- How to connect each component for a real-time conversation
- How to test, debug, and deploy your agent
This guide is written by the LiveKit team, who maintain one of the most widely used open-source real-time communication stacks and power production voice agents across telephony, support, and AI platforms.
What is the STT-LLM-TTS pipeline?
Most voice agents follow the STT-LLM-TTS pipeline:
- Speech-to-Text (STT) converts the user's audio into text.
- Large Language Model (LLM) generates an intelligent response.
- Text-to-Speech (TTS) converts the response back into audio.
This modular approach gives you maximum flexibility. Swap providers, fine-tune each component, and debug issues at each stage independently.
Note: This pipeline goes by many names. You may see it referred to as:
- Cascaded pipeline: emphasizes the sequential, chained nature of the components
- Voice pipeline: generic term used broadly across the industry
- Conversational AI pipeline: common in enterprise and contact center contexts
- ASR → LLM → TTS pipeline: uses ASR (Automatic Speech Recognition) instead of STT
- Listen-Think-Speak pipeline: a more human-readable framing
- Turn-based voice pipeline: highlights the conversational turn-taking model
- Classic voice agent architecture: used to distinguish it from newer speech-to-speech models
All of these refer to the same fundamental architecture: audio in → text → LLM → text → audio out.
Speech-to-speech: the emerging alternative
A newer approach skips the text intermediary entirely. Speech-to-speech models (like OpenAI's Realtime API and Google's Gemini Live) process audio in and generate audio out directly.
Advantages:
- Lower latency (fewer processing steps)
- Preserves paralinguistic cues (tone, emphasis, emotion)
- More natural conversational dynamics
Trade-offs:
- Less visibility into what the model "heard" or "said"
- Harder to debug and evaluate
- Fewer provider options (still early days)
- Higher cost per minute in most cases
For production voice agents today, the cascaded pipeline remains the most practical choice. It's battle-tested, observable, and gives you fine-grained control. But speech-to-speech is maturing fast, and LiveKit Agents supports both approaches.
Why real-time matters
Traditional HTTP request/response patterns introduce latency at every step. Send audio, wait for transcription, send text, wait for response, send response, wait for audio. Each round trip adds 100–300ms.
WebRTC changes this. Instead of request/response, you maintain persistent connections that stream data in both directions simultaneously. Audio flows in while responses flow out. The result: total latency drops from seconds to hundreds of milliseconds.
Architecture overview:
Loading diagram…
Each component streams its output to the next. The LLM starts generating before the user finishes speaking. TTS starts synthesizing before the LLM finishes responding. This parallelization is what makes voice agents feel responsive.
How do you choose your model stack?
Your choice of providers affects latency, quality, and cost. LiveKit Inference simplifies this by giving you access to all major providers through a single API key, with no need to manage multiple accounts or credentials.
Note: The models below are popular options, not an exhaustive list. LiveKit Agents supports many more providers. Check the documentation for the full list.
STT options
| Provider | Model | Latency | Accuracy | Cost |
|---|---|---|---|---|
| Deepgram | Nova-3 | ~250ms | Excellent | $$$ |
| Deepgram | Nova-2 | ~250ms | Great | $$ |
| AssemblyAI | Universal-Streaming | ~300ms | Great | $ |
| Cartesia | Ink Whisper | ~80ms | Good | $ |
LLM options
| Provider | Model | Latency | Quality | Cost |
|---|---|---|---|---|
| OpenAI | GPT-4o | ~200ms | Excellent | $$$ |
| OpenAI | GPT-4.1 mini | ~200ms | Great | $ |
| Gemini 2.5 Flash | ~350ms | Great | $ | |
| DeepSeek | DeepSeek V3 | ~300ms | Great | $ |
TTS options
| Provider | Model | Latency | Voice Quality | Cost |
|---|---|---|---|---|
| Cartesia | Sonic 3 | ~90ms | Excellent | $$ |
| ElevenLabs | Flash v2.5 | ~150ms | Excellent | $$$ |
| Deepgram | Aura-2 | ~100ms | Good | $$ |
| Inworld | TTS 1.5 | ~200ms | Good | $ |
Setting up your development environment
Let's get your environment ready.
Install LiveKit Agents
1uv init --bare2uv add "livekit-agents[silero,turn-detector]~=1.4" python-dotenv
Note: When using LiveKit Inference (recommended), you don't need separate plugin packages for STT, LLM, or TTS. LiveKit routes those requests for you. You only need
silerofor VAD andturn-detectorfor the multilingual turn detection model.
API keys
With LiveKit Inference, you only need one account:
- LiveKit Cloud (free tier available)
LiveKit Inference provides access to STT, LLM, and TTS models through a single API key. No need to create separate accounts with Deepgram, OpenAI, or Cartesia. LiveKit handles all provider credentials and billing.
Environment variables
Create a .env.local file:
1LIVEKIT_URL=wss://your-project.livekit.cloud2LIVEKIT_API_KEY=your-api-key3LIVEKIT_API_SECRET=your-api-secret
Get your credentials from the LiveKit Cloud dashboard.
Building the voice agent
Time to write code. Create a new file called agent.py in your project directory. The steps below build up this file incrementally. The Complete implementation section at the end shows the finished file.
Step 1: Initialize the agent
Start with the agent entry point. Add this to agent.py:
1from dotenv import load_dotenv2from livekit import agents3from livekit.agents import AgentServer, AgentSession, Agent, room_io4from livekit.plugins import silero5from livekit.plugins.turn_detector.multilingual import MultilingualModel67load_dotenv(".env.local")89class Assistant(Agent):10def __init__(self):11super().__init__(12instructions="You are a helpful voice AI assistant."13)1415server = AgentServer()1617@server.rtc_session(agent_name="my-agent")18async def my_agent(ctx: agents.JobContext):19pass # We'll add components here2021if __name__ == "__main__":22agents.cli.run_app(server)
This defines your Assistant class (where agent instructions and personality live) and creates an AgentServer. The @server.rtc_session decorator registers the entry function that runs whenever a new room session starts.
The agent_name value is how LiveKit identifies which agent to start when a session is requested. It also switches the agent from automatic dispatch to explicit dispatch, so your agent only starts when explicitly requested rather than joining every new room automatically. This gives you precise control and is the right behavior for both development and production.
Step 2: Configure speech-to-text
Update your my_agent function in agent.py to add Deepgram for transcription:
1@server.rtc_session(agent_name="my-agent")2async def my_agent(ctx: agents.JobContext):3session = AgentSession(4stt="deepgram/nova-3:multi", # Deepgram Nova-3 via LiveKit Inference5)
Key options:
"deepgram/nova-3:multi"routes to Deepgram Nova-3 via LiveKit Inference with multilingual support.- No separate Deepgram API key needed. LiveKit Inference handles provider credentials.
Step 3: Configure the LLM
Agent instructions live in the Agent class. Update your Assistant class with a system prompt, then add llm to your AgentSession:
1class Assistant(Agent):2def __init__(self):3super().__init__(4instructions=(5"You are a helpful voice assistant. Keep responses concise, ideally under 2 sentences. Be conversational and friendly."6)7)89@server.rtc_session(agent_name="my-agent")10async def my_agent(ctx: agents.JobContext):11session = AgentSession(12stt="deepgram/nova-3:multi",13llm="openai/gpt-4.1-mini", # Great quality at low cost14)
Prompt tips for voice:
- Keep responses short. Long responses feel unnatural in conversation.
- Avoid lists and formatting. They don't translate to speech.
- Write for the ear, not the eye.
Step 4: Configure text-to-speech
Add tts to the AgentSession in my_agent:
1session = AgentSession(2stt="deepgram/nova-3:multi",3llm="openai/gpt-4.1-mini",4tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",5)
The TTS string uses the format provider/model:voice-id. Cartesia offers many voices. Browse the options and swap the voice UUID in the string.
Warning: If no voice UUID is provided, Cartesia will fall back to its default voice. This may produce inconsistent results across sessions and regions, as the default can change without notice. For production agents, always specify an explicit voice UUID to keep your agent's voice consistent.
Step 5: Add VAD and turn detection
Finalize your AgentSession with VAD and turn detection:
1session = AgentSession(2stt="deepgram/nova-3:multi",3llm="openai/gpt-4.1-mini",4tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",5vad=silero.VAD.load(),6turn_detection=MultilingualModel(),7)
The AgentSession handles:
- Voice Activity Detection (VAD): Silero detects when the user starts and stops speaking.
- Turn detection: The multilingual model determines when the user has finished their turn.
- Interruption handling: Stops speaking when the user interrupts.
Step 6: Start the agent
After the AgentSession, start the assistant and trigger the opening greeting:
1await session.start(2room=ctx.room,3agent=Assistant(),4)56await session.generate_reply(7instructions="Greet the user and offer your assistance."8)
session.start() connects the session to the room and attaches your Assistant agent. generate_reply() triggers the initial greeting so the agent speaks first rather than waiting for the user.
Complete implementation
Here's the full agent.py with everything in place:
1from dotenv import load_dotenv2from livekit import agents3from livekit.agents import AgentServer, AgentSession, Agent, room_io4from livekit.plugins import silero5from livekit.plugins.turn_detector.multilingual import MultilingualModel67load_dotenv(".env.local")89class Assistant(Agent):10def __init__(self):11super().__init__(12instructions=(13"You are a helpful voice assistant. Keep responses concise, ideally under 2 sentences. Be conversational and friendly."14)15)1617server = AgentServer()1819@server.rtc_session(agent_name="my-agent")20async def my_agent(ctx: agents.JobContext):21session = AgentSession(22stt="deepgram/nova-3:multi",23llm="openai/gpt-4.1-mini",24tts="cartesia/sonic-3:9626c31c-bec5-4cca-baa8-f8ba9e84c8bc",25vad=silero.VAD.load(),26turn_detection=MultilingualModel(),27)2829await session.start(30room=ctx.room,31agent=Assistant(),32)3334await session.generate_reply(35instructions="Greet the user and offer your assistance."36)3738if __name__ == "__main__":39agents.cli.run_app(server)
Download model files
Now that agent.py exists, download the required local model files for VAD and turn detection:
1uv run agent.py download-files
This only needs to be done once. The files are cached locally for all future runs.
How do you test a voice agent locally?
Let's verify everything works.
Local testing
Console mode
First, verify your agent works:
1uv run agent.py console
Console mode lets you interact with your agent directly in the terminal. Type messages or speak using your microphone. It's perfect for testing your LLM configuration and system prompt without deploying or opening a browser. Try both: start talking to test the full voice pipeline, or press Ctrl+T to enter text mode and type a message.
Test with LiveKit Playground
The LiveKit Agents Playground provides a browser-based interface to test your agent in development. No deployment required. Your agent runs locally while the Playground connects to it through LiveKit Cloud.
Step 1: Start your agent locally
Make sure your agent is running in dev mode:
1uv run agent.py dev
You should see registered worker in the console output. This means your local agent has connected to LiveKit Cloud and is ready to receive requests.
Step 2: Open the Playground
Go to agents-playground.livekit.io in your browser.
Step 3: Connect to your LiveKit project
Select your LiveKit Cloud project to connect automatically, or enter your connection info manually:
- LiveKit Cloud: Sign in and select your project.
- Manual: Enter your WebSocket URL and a generated token.
Step 4: Connect and test
- Enter "my-agent" in the Agent name field.
- Click Connect in the Playground.
- Grant microphone permissions when prompted.
- Speak naturally. Say "Hello" or ask a question.
- You should hear a response within 1 second.
When you connect, LiveKit Cloud routes the session to your locally running agent. You'll see activity in your terminal as the agent processes the conversation.
Step 5: Monitor the conversation
The Playground shows:
- Audio visualizer: See when you're speaking vs. when the agent is speaking.
- Transcript: Real-time text of both sides of the conversation.
- Connection status: Verify you're connected to the right room.
Troubleshooting:
- No response? Check that your agent is running and shows
registered worker. - Can't hear audio? Verify your browser's audio output settings.
- Permission denied? Grant microphone access in your browser settings.
Measuring latency
Good voice agents respond in under 1 second. LiveKit Agents emits metrics you can capture in code:
1@session.on("metrics_collected")2def on_metrics(metrics):3print(f"EOU: {metrics.eou_delay}ms")4print(f"LLM TTFT: {metrics.llm_ttft}ms")5print(f"TTS TTFB: {metrics.tts_ttfb}ms")
Total latency ≈ eou_delay + llm_ttft + tts_ttfb
Note: LiveKit Cloud users: the Agent Observability dashboard shows latency breakdowns with synchronized audio playback. Click anywhere in the timeline to see exact timing.
Target latencies:
- STT: <200ms
- LLM time to first token (TTFT): <300ms
- TTS time to first audio (TTFA): <300ms
- Total: <800ms
Debugging common issues
No response from agent:
- Verify microphone permissions.
- Confirm your agent is running and shows
registered workerin the terminal.
Audio cutting out:
- Check your VAD sensitivity settings.
- Make sure you have a stable network connection.
High latency (>3s):
- Switch to streaming mode for all components.
- Check if you're using batch instead of streaming STT.
- Verify your network latency to provider APIs.
STT accuracy problems:
- Confirm the correct language is set.
- Check audio quality (sample rate, noise).
- Try a different STT model.
LLM timeout errors:
- Reduce max tokens in your requests.
- Check the OpenAI status page for outages.
- Use LiveKit's built-in FallbackAdapter to automatically route to backup providers when the primary LLM fails.
Deploying to production
Your agent works locally. Let's deploy it.
Hosting options
LiveKit Cloud is the recommended way to deploy your agent. It handles infrastructure, scaling, and observability so you can focus on building. Deploy directly from your repo with a single command.
First, install the LiveKit CLI (lk) if you haven't already:
1# macOS2brew install livekit-cli34# Linux5curl -sSL https://get.livekit.io/cli | bash67# Windows8winget install LiveKit.LiveKitCLI
Then deploy your agent:
1lk agent create
Self-hosting:
If you need full infrastructure control or have specific compliance requirements, self-hosting is fully supported. Popular options include:
- Fly.io: Great for global distribution.
- Railway: Simple container deployments.
- AWS/GCP: Full control, more setup.
Environment variables
Set your production environment variables in your hosting platform.
Note: Never commit API keys to your repository.
Scaling considerations
- Each agent worker process handles one conversation at a time.
- Scale horizontally by running multiple instances.
- LiveKit Cloud auto-scales based on demand.
Cost estimates (with LiveKit Inference)
Using the recommended stack at $0.0502/min:
| Usage level | Monthly cost | Notes |
|---|---|---|
| 1,000 min | ~$95 | Ship plan ($50) + inference (~$45) |
| 5,000 min | ~$295 | Ship plan ($50) + inference (~$245) |
| 50,000 min | ~$2,950 | Scale plan ($500) + inference (~$2,450) |
Based on Deepgram Nova-3 + GPT-4.1 mini + Cartesia Sonic 3 via LiveKit Inference. See LiveKit Pricing for details.
Common challenges and solutions
Challenge: high latency
Cause: Non-streaming components or network issues.
Solution:
- Enable streaming for STT, LLM, and TTS.
- Use providers with low-latency APIs.
- Co-locate your agent with your AI models to minimize latency.
Challenge: poor audio quality
Cause: VAD misconfiguration or echo issues.
Solution:
- Tune VAD sensitivity thresholds.
- Enable echo cancellation in your WebRTC configuration.
- Test with headphones to isolate issues.
Challenge: context loss in long conversations
Cause: Exceeding LLM context window.
Solution:
- Implement conversation summarization.
- Use a sliding window for recent messages.
- Store key facts in a separate context.
Challenge: high costs
Cause: Inefficient provider usage.
Solution:
- Cache common responses to avoid redundant LLM calls for repeated inputs.
- Use smaller models for simple queries.
- Implement usage monitoring and alerts.
Next steps
You've built a working voice agent. Here's where to go next.
Build a frontend for your agent: Connect your agent to a web UI using the agent-starter-react starter app, a ready-to-use React frontend built for LiveKit Agents. For more frontend options across web, mobile, and telephony, see the Agent Frontends documentation.
Connect to phone calls: Give your agent a real phone number with LiveKit Phone Numbers. Purchase a number directly from LiveKit Cloud with no third-party SIP trunk required, and your agent can start handling inbound calls immediately.
Add function calling: Let your agent take actions like booking appointments, querying databases, or controlling smart devices. LiveKit Agents supports function calling out of the box.
Support multiple languages: Configure language detection in STT and use multilingual TTS voices for global users.
Create custom voices: Clone a voice with ElevenLabs or train a custom Cartesia voice for brand consistency.
Add sentiment analysis: Detect user frustration and adjust your agent's tone dynamically.
Resources:
FAQ
How much does it cost to run a voice agent?
Using LiveKit Inference with the recommended stack, expect $0.0502/min of conversation:
- Agent session: $0.0100/min
- WebRTC connection: $0.0010/min
- LLM (GPT-4.1 mini): $0.0015/min
- STT (Deepgram Nova-3): $0.0077/min
- TTS (Cartesia Sonic 3): $0.0300/min
On the Build plan (free): 1,000 agent minutes are included monthly, plus $2.50 in inference credits. Enough for roughly 50 minutes of conversation with the recommended stack. Once your credits run out, you can bring your own API keys for STT, LLM, and TTS providers, or upgrade to the Ship plan. See LiveKit Pricing for the full breakdown.
What's the minimum latency possible?
With optimized streaming and the right providers, you can achieve 200–500ms total latency from end of user speech to start of agent response. This feels natural and conversational. WebRTC's persistent connections eliminate the round-trip overhead that makes HTTP-based solutions feel sluggish.
Can I use open source models?
Yes. For STT, Whisper can be self-hosted using the OpenAI plugin with a StreamAdapter and VAD. For LLMs, Llama and Mistral are well-supported through Ollama (local), Together AI, Fireworks, and OpenRouter, all of which have official LiveKit Agents plugins. For TTS, you can implement a custom TTS node to integrate any self-hosted engine. Trade-offs: more infrastructure complexity and potentially higher latency, but lower per-minute costs at scale.
How do I handle multiple languages?
Configure Deepgram with language="multi" for automatic language detection, or set the language explicitly based on user preference. For TTS, use multilingual voices from ElevenLabs or Cartesia. Your LLM prompt should instruct the model to respond in the detected language.
What's the difference between voice agents and chatbots?
Chatbots handle text asynchronously. Users type, wait, read. Voice agents handle speech in real-time. Users talk naturally, expect immediate responses, and can interrupt. This requires streaming at every layer, sophisticated turn-taking, and sub-second latency. The technical bar is significantly higher, but the result feels like a real conversation.
Get started
You now have everything you need to build production-ready voice agents. Start with the code in this tutorial, customize the system prompt for your use case, and deploy.
Sign up for LiveKit Cloud to get started for free, or check out the full documentation for advanced features.
Questions? Join the LiveKit Discourse. The community is active and helpful.