Skip to main content

Build a multilingual voice agent that automatically switches languages

One of the most common questions developers ask when building voice AI applications is: "How do I detect what language the user is speaking and respond in that same language?" This tutorial walks you through building a voice agent that does exactly that.

You'll create a multilingual voice assistant using LiveKit Agents, Deepgram STT, OpenAI, and Rime TTS. The agent listens for the user's language, detects when they switch languages mid-conversation, and dynamically updates the TTS configuration to respond with a native-sounding voice in that language.

Try the demo (clone and run locally; the hosted preview is not always available). For the full source code including the Next.js frontend, see the rime-multilingual-demo repository on GitHub. You can also watch a video demo of the multilingual agent in action.

What you'll build

By the end of this tutorial, you'll have a voice agent that:

  • Supports English, Hindi, Spanish, Arabic, French, Portuguese, German, Japanese, Hebrew, and Tamil
  • Automatically detects the language the user is speaking
  • Switches TTS language settings on the fly using a single Rime voice
  • Responds naturally in the detected language
  • Optionally syncs the current language to the frontend via participant attributes

The key technique involves overriding the STT node in your agent to intercept speech events, extract the detected language, and update the TTS configuration before the agent responds.

Prerequisites

Before you start, make sure you have:

  • Python 3.11 or later installed
  • uv package manager installed
  • A LiveKit Cloud account (free tier works)
  • API keys from the following providers:

Step 1: Set up the project

Create a new directory and initialize the project:

1
mkdir rime-multilingual-agent
2
cd rime-multilingual-agent
3
uv init --bare

Step 2: Install dependencies

Install the LiveKit Agents framework and the packages you need:

1
uv add \
2
"livekit>=1.0.23" \
3
"livekit-agents[silero,turn-detector]>=1.3.12" \
4
"livekit-plugins-noise-cancellation>=0.2.5" \
5
"python-dotenv>=1.2.1"

This installs:

  • livekit-agents: The core agents framework with unified inference (STT, LLM, TTS)
  • silero: Voice Activity Detection (VAD)
  • turn-detector: Contextually-aware turn detection for natural conversations

STT, LLM, and TTS are configured via the framework's inference API using provider-prefixed models (e.g. deepgram/nova-3-general, openai/gpt-4o, rime/arcana). You supply the corresponding API keys in your environment.

Step 3: Configure environment variables

Create a .env file in your project directory:

1
LIVEKIT_API_KEY=<your_api_key>
2
LIVEKIT_API_SECRET=<your_api_secret>
3
LIVEKIT_URL=wss://<project-subdomain>.livekit.cloud

You can get your LiveKit credentials from the LiveKit Cloud dashboard under Settings > API Keys.

Step 4: Create the agent

Create a file named main.py and add the following code. I'll break down each section to explain what it does.

Import dependencies and configure logging

1
import logging
2
from typing import AsyncIterable
3
from dataclasses import dataclass
4
from dotenv import load_dotenv
5
from livekit.agents import (
6
Agent,
7
AgentServer,
8
AgentSession,
9
JobContext,
10
JobProcess,
11
ModelSettings,
12
RoomOutputOptions,
13
cli,
14
stt,
15
inference,
16
)
17
from livekit.plugins import silero
18
from livekit.plugins.turn_detector.multilingual import MultilingualModel
19
from livekit import rtc
20
21
logger = logging.getLogger("multilingual-agent")
22
23
load_dotenv()

Define language configurations

Next, create a dataclass to store TTS settings for each supported language. The current backend uses a single Rime voice (seraphina) and switches only the language code:

1
# Default configuration constants
2
DEFAULT_LANGUAGE = "eng"
3
DEFAULT_TTS_MODEL = "arcana"
4
DEFAULT_VOICE = "seraphina"
5
6
7
@dataclass
8
class LanguageConfig:
9
"""Configuration for TTS settings per language."""
10
11
lang: str
12
model: str = DEFAULT_TTS_MODEL

The LanguageConfig dataclass holds the Rime language code and model name. The framework uses a single voice across languages; Rime handles pronunciation per language.

Create the multilingual agent class

Now create the agent class that handles language detection and TTS switching:

1
class MultilingualAgent(Agent):
2
"""A multilingual voice agent that detects user language and responds accordingly."""
3
4
# TTS config per language. Keys are Rime 3-letter codes. Voice is always seraphina.
5
LANGUAGE_CONFIGS = {
6
"eng": LanguageConfig(lang="eng"),
7
"hin": LanguageConfig(lang="hin"),
8
"spa": LanguageConfig(lang="spa"),
9
"ara": LanguageConfig(lang="ara"),
10
"fra": LanguageConfig(lang="fra"),
11
"por": LanguageConfig(lang="por"),
12
"ger": LanguageConfig(lang="ger"),
13
"jpn": LanguageConfig(lang="jpn"),
14
"heb": LanguageConfig(lang="heb"),
15
"tam": LanguageConfig(lang="tam"),
16
}
17
18
# Display names for instructions. Keys match LANGUAGE_CONFIGS.
19
LANGUAGE_DISPLAY_NAMES = {
20
"eng": "English",
21
"hin": "Hindi",
22
"spa": "Spanish",
23
"ara": "Arabic",
24
"fra": "French",
25
"por": "Portuguese",
26
"ger": "German",
27
"jpn": "Japanese",
28
"heb": "Hebrew",
29
"tam": "Tamil",
30
}
31
32
# STT returns ISO 639-1 (e.g. "en", "es") or locale (e.g. "en-US"). Map to Rime codes.
33
STT_TO_RIME = {
34
"en": "eng",
35
"hi": "hin",
36
"es": "spa",
37
"ar": "ara",
38
"fr": "fra",
39
"pt": "por",
40
"de": "ger",
41
"ja": "jpn",
42
"he": "heb",
43
"ta": "tam",
44
}
45
46
SUPPORTED_LANGUAGES = list(LANGUAGE_CONFIGS.keys())
47
48
def __init__(self) -> None:
49
super().__init__(instructions=self._get_instructions())
50
self._current_language = DEFAULT_LANGUAGE
51
self._room: rtc.Room | None = None
52
53
def _get_instructions(self) -> str:
54
"""Get agent instructions in a clean, maintainable format."""
55
supported_languages = ", ".join(
56
self.LANGUAGE_DISPLAY_NAMES[lang] for lang in self.SUPPORTED_LANGUAGES
57
)
58
return (
59
"You are a voice assistant powered by Rime's text-to-speech technology. "
60
"You are here to showcase Rime's natural, expressive, and multilingual voice capabilities. "
61
"You respond in the same language the user speaks in. "
62
f"You support {supported_languages}. "
63
"If the user speaks in any other language, respond in English and politely let them know: "
64
f"'I only support {supported_languages}. Please speak in one of these languages.' "
65
"Keep your responses concise and to the point since this is a voice conversation. "
66
"Do not use emojis, asterisks, markdown, or other special characters in your responses. "
67
"You are curious, friendly, and have a sense of humor."
68
)

The LANGUAGE_CONFIGS dictionary maps Rime 3-letter language codes to TTS config. STT_TO_RIME maps the ISO codes returned by Deepgram to those Rime codes. The instructions are built from LANGUAGE_DISPLAY_NAMES so the list of supported languages stays in sync.

Override the STT node

This is the core technique for detecting language changes. Override the stt_node method to intercept speech-to-text events and check for language changes:

1
async def stt_node(
2
self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings
3
) -> AsyncIterable[stt.SpeechEvent]:
4
"""
5
Override STT node to detect language and update TTS configuration dynamically.
6
7
This method intercepts speech events to detect language changes and updates
8
the TTS settings to match the detected language for natural voice output.
9
"""
10
default_stt = super().stt_node(audio, model_settings)
11
12
async for event in default_stt:
13
if self._is_transcript_event(event):
14
await self._handle_language_detection(event)
15
yield event
16
17
def _is_transcript_event(self, event: stt.SpeechEvent) -> bool:
18
"""Check if event is a transcript event with language information."""
19
return (
20
event.type
21
in [
22
stt.SpeechEventType.INTERIM_TRANSCRIPT,
23
stt.SpeechEventType.FINAL_TRANSCRIPT,
24
]
25
and event.alternatives
26
)
27
28
async def _handle_language_detection(self, event: stt.SpeechEvent) -> None:
29
"""Update TTS from STT-detected language and sync to frontend via participant attributes."""
30
detected_language = event.alternatives[0].language
31
if not detected_language:
32
return
33
effective_language = self._update_tts_for_language(detected_language)
34
if effective_language != self._current_language:
35
self._current_language = effective_language
36
await self._publish_language_update(effective_language)
37
38
def _update_tts_for_language(self, language: str) -> str:
39
"""Update TTS configuration based on detected language.
40
41
Returns the effective Rime language code (the one actually used for TTS).
42
"""
43
base = language.split("-")[0].lower() if language else ""
44
rime_lang = self.STT_TO_RIME.get(base, base) if base else DEFAULT_LANGUAGE
45
effective_lang = rime_lang if rime_lang in self.LANGUAGE_CONFIGS else DEFAULT_LANGUAGE
46
config = self.LANGUAGE_CONFIGS.get(effective_lang, self.LANGUAGE_CONFIGS[DEFAULT_LANGUAGE])
47
logger.info(f"Updating TTS: detected={language} -> rime={effective_lang}")
48
self.session.tts.update_options(
49
model=f"rime/{config.model}",
50
language=config.lang,
51
)
52
return effective_lang
53
54
async def _publish_language_update(self, language_code: str) -> None:
55
"""Sync current language to the frontend via participant attributes (see LiveKit docs: participant attributes)."""
56
if not self._room:
57
return
58
try:
59
display_name = self.LANGUAGE_DISPLAY_NAMES.get(language_code, "English")
60
await self._room.local_participant.set_attributes({"current_language": display_name})
61
except Exception as e:
62
logger.warning("Failed to publish language update: %s", e)

The stt_node method receives audio frames and yields speech events. By iterating through the default STT output and checking each event, you get the detected language from transcript events. When the language changes, _update_tts_for_language maps the STT language (e.g. en or en-US) to a Rime code, updates TTS with update_options(), and returns the effective language. _publish_language_update writes the current language to the room participant's attributes so a frontend can show it (see the full demo repo for an example UI).

Add the greeting

Override on_enter to publish the initial language and greet the user when they connect:

1
async def on_enter(self) -> None:
2
"""Called when the agent session starts. Generate initial greeting."""
3
await self._publish_language_update(self._current_language)
4
self.session.generate_reply(
5
instructions="Greet the user and introduce yourself as a voice assistant powered by Rime's text-to-speech technology. Ask how you can help them."
6
)

Set up the server and entrypoint

The agent uses the AgentServer API: register a prewarm function and an RTC session entrypoint that configures the agent session:

1
def prewarm(proc: JobProcess) -> None:
2
"""Preload VAD model for faster startup."""
3
proc.userdata["vad"] = silero.VAD.load()
4
5
6
server = AgentServer()
7
server.setup_fnc = prewarm
8
9
10
@server.rtc_session(agent_name="rime-multilingual-agent")
11
async def entrypoint(ctx: JobContext) -> None:
12
"""Main entry point for the multilingual agent worker."""
13
ctx.log_context_fields = {"room": ctx.room.name}
14
15
session = AgentSession(
16
vad=ctx.proc.userdata["vad"],
17
stt=inference.STT(model="deepgram/nova-3-general", language="multi"),
18
llm=inference.LLM(model="openai/gpt-4o"),
19
tts=inference.TTS(
20
model=f"rime/{DEFAULT_TTS_MODEL}", voice=DEFAULT_VOICE, language=DEFAULT_LANGUAGE
21
),
22
turn_detection=MultilingualModel(),
23
)
24
25
async def log_usage() -> None:
26
for usage in session.usage.model_usage:
27
logger.info(f"Usage: {usage.provider}/{usage.model}: {usage}")
28
29
ctx.add_shutdown_callback(log_usage)
30
31
agent = MultilingualAgent()
32
agent._room = ctx.room
33
await session.start(
34
agent=agent,
35
room=ctx.room,
36
room_output_options=RoomOutputOptions(transcription_enabled=True),
37
)
38
39
40
if __name__ == "__main__":
41
cli.run_app(server)

Configuration notes:

  • inference.STT with model="deepgram/nova-3-general" and language="multi" enables automatic language detection.
  • inference.LLM and inference.TTS use provider-prefixed models (openai/gpt-4o, rime/arcana).
  • MultilingualModel for turn detection works with multilingual STT for natural turn-taking.
  • The agent is given a reference to the room (agent._room = ctx.room) so it can publish language updates to participant attributes.

Step 5: Download model files

Before running the agent for the first time, download the required model files for the turn detector and Silero VAD:

1
uv run main.py download-files

Step 6: Run the agent

Start by running the agent in console mode so you can test the voice pipeline locally with your microphone and speakers:

1
uv run main.py console

Want a visual debugging interface? Run the agent in dev mode (uv run main.py dev), then open the Agent Console, select your agent in the configuration panel, and start a session. Your agent will attach when dispatched (e.g. via LiveKit Cloud agent configuration). Interact with the agent to confirm language switching, and explore performance metrics and other aspects of your session.

Development mode

Connect to LiveKit Cloud for internet-accessible testing:

1
uv run main.py dev

Production mode

Run in production:

1
uv run main.py start

How it works

The language detection flow works like this:

  1. User speaks in any supported language.
  2. Deepgram STT (with language="multi") transcribes the speech and detects the language.
  3. The overridden stt_node intercepts the speech event and reads the detected language.
  4. If the language changed, _update_tts_for_language maps the STT code to a Rime code and updates TTS via update_options().
  5. Optionally, _publish_language_update writes the current language to the participant's attributes for the frontend.
  6. The LLM receives the transcript and generates a response in context.
  7. Rime TTS synthesizes the response using the updated language setting.

The instructions tell the LLM to respond in the same language as the user; the TTS update makes the spoken output use the correct Rime language.

Summary

This tutorial covered how to build a multilingual voice agent that automatically detects and responds in the user's language. The key techniques include:

  • Overriding the stt_node to intercept speech events and detect language changes
  • Mapping STT language codes to Rime (or your TTS provider) and using update_options() to change TTS settings mid-conversation
  • Configuring Deepgram STT with multilingual mode for automatic language detection
  • Using the MultilingualModel turn detector for natural conversation flow
  • Optionally syncing the current language to a frontend via participant attributes

For more information, check out:

Complete code

Here is the complete main.py file.

1
import logging
2
from typing import AsyncIterable
3
from dataclasses import dataclass
4
from dotenv import load_dotenv
5
from livekit.agents import (
6
Agent,
7
AgentServer,
8
AgentSession,
9
JobContext,
10
JobProcess,
11
ModelSettings,
12
RoomOutputOptions,
13
cli,
14
stt,
15
inference,
16
)
17
from livekit.plugins import silero
18
from livekit.plugins.turn_detector.multilingual import MultilingualModel
19
from livekit import rtc
20
21
22
logger = logging.getLogger("multilingual-agent")
23
24
25
load_dotenv()
26
27
28
# Default configuration constants
29
DEFAULT_LANGUAGE = "eng"
30
DEFAULT_TTS_MODEL = "arcana"
31
DEFAULT_VOICE = "seraphina"
32
33
34
@dataclass
35
class LanguageConfig:
36
"""Configuration for TTS settings per language."""
37
38
lang: str
39
model: str = DEFAULT_TTS_MODEL
40
41
42
class MultilingualAgent(Agent):
43
"""A multilingual voice agent that detects user language and responds accordingly."""
44
45
# TTS config per language. Keys are Rime 3-letter codes. Voice is always seraphina.
46
LANGUAGE_CONFIGS = {
47
"eng": LanguageConfig(lang="eng"),
48
"hin": LanguageConfig(lang="hin"),
49
"spa": LanguageConfig(lang="spa"),
50
"ara": LanguageConfig(lang="ara"),
51
"fra": LanguageConfig(lang="fra"),
52
"por": LanguageConfig(lang="por"),
53
"ger": LanguageConfig(lang="ger"),
54
"jpn": LanguageConfig(lang="jpn"),
55
"heb": LanguageConfig(lang="heb"),
56
"tam": LanguageConfig(lang="tam"),
57
}
58
59
LANGUAGE_DISPLAY_NAMES = {
60
"eng": "English",
61
"hin": "Hindi",
62
"spa": "Spanish",
63
"ara": "Arabic",
64
"fra": "French",
65
"por": "Portuguese",
66
"ger": "German",
67
"jpn": "Japanese",
68
"heb": "Hebrew",
69
"tam": "Tamil",
70
}
71
72
STT_TO_RIME = {
73
"en": "eng",
74
"hi": "hin",
75
"es": "spa",
76
"ar": "ara",
77
"fr": "fra",
78
"pt": "por",
79
"de": "ger",
80
"ja": "jpn",
81
"he": "heb",
82
"ta": "tam",
83
}
84
85
SUPPORTED_LANGUAGES = list(LANGUAGE_CONFIGS.keys())
86
87
def __init__(self) -> None:
88
super().__init__(instructions=self._get_instructions())
89
self._current_language = DEFAULT_LANGUAGE
90
self._room: rtc.Room | None = None
91
92
def _get_instructions(self) -> str:
93
"""Get agent instructions in a clean, maintainable format."""
94
supported_languages = ", ".join(
95
self.LANGUAGE_DISPLAY_NAMES[lang] for lang in self.SUPPORTED_LANGUAGES
96
)
97
return (
98
"You are a voice assistant powered by Rime's text-to-speech technology. "
99
"You are here to showcase Rime's natural, expressive, and multilingual voice capabilities. "
100
"You respond in the same language the user speaks in. "
101
f"You support {supported_languages}. "
102
"If the user speaks in any other language, respond in English and politely let them know: "
103
f"'I only support {supported_languages}. Please speak in one of these languages.' "
104
"Keep your responses concise and to the point since this is a voice conversation. "
105
"Do not use emojis, asterisks, markdown, or other special characters in your responses. "
106
"You are curious, friendly, and have a sense of humor."
107
)
108
109
async def stt_node(
110
self, audio: AsyncIterable[rtc.AudioFrame], model_settings: ModelSettings
111
) -> AsyncIterable[stt.SpeechEvent]:
112
"""
113
Override STT node to detect language and update TTS configuration dynamically.
114
115
This method intercepts speech events to detect language changes and updates
116
the TTS settings to match the detected language for natural voice output.
117
"""
118
default_stt = super().stt_node(audio, model_settings)
119
120
async for event in default_stt:
121
if self._is_transcript_event(event):
122
await self._handle_language_detection(event)
123
yield event
124
125
def _is_transcript_event(self, event: stt.SpeechEvent) -> bool:
126
"""Check if event is a transcript event with language information."""
127
return (
128
event.type
129
in [
130
stt.SpeechEventType.INTERIM_TRANSCRIPT,
131
stt.SpeechEventType.FINAL_TRANSCRIPT,
132
]
133
and event.alternatives
134
)
135
136
async def _handle_language_detection(self, event: stt.SpeechEvent) -> None:
137
"""Update TTS from STT-detected language and sync to frontend via participant attributes."""
138
detected_language = event.alternatives[0].language
139
if not detected_language:
140
return
141
effective_language = self._update_tts_for_language(detected_language)
142
if effective_language != self._current_language:
143
self._current_language = effective_language
144
await self._publish_language_update(effective_language)
145
146
def _update_tts_for_language(self, language: str) -> str:
147
"""Update TTS configuration based on detected language.
148
149
Returns the effective Rime language code (the one actually used for TTS).
150
"""
151
base = language.split("-")[0].lower() if language else ""
152
rime_lang = self.STT_TO_RIME.get(base, base) if base else DEFAULT_LANGUAGE
153
effective_lang = rime_lang if rime_lang in self.LANGUAGE_CONFIGS else DEFAULT_LANGUAGE
154
config = self.LANGUAGE_CONFIGS.get(effective_lang, self.LANGUAGE_CONFIGS[DEFAULT_LANGUAGE])
155
logger.info(f"Updating TTS: detected={language} -> rime={effective_lang}")
156
self.session.tts.update_options(
157
model=f"rime/{config.model}",
158
language=config.lang,
159
)
160
return effective_lang
161
162
async def _publish_language_update(self, language_code: str) -> None:
163
"""Sync current language to the frontend via participant attributes (see LiveKit docs: participant attributes)."""
164
if not self._room:
165
return
166
try:
167
display_name = self.LANGUAGE_DISPLAY_NAMES.get(language_code, "English")
168
await self._room.local_participant.set_attributes({"current_language": display_name})
169
except Exception as e:
170
logger.warning("Failed to publish language update: %s", e)
171
172
async def on_enter(self) -> None:
173
"""Called when the agent session starts. Generate initial greeting."""
174
await self._publish_language_update(self._current_language)
175
self.session.generate_reply(
176
instructions="Greet the user and introduce yourself as a voice assistant powered by Rime's text-to-speech technology. Ask how you can help them."
177
)
178
179
180
def prewarm(proc: JobProcess) -> None:
181
"""Preload VAD model for faster startup."""
182
proc.userdata["vad"] = silero.VAD.load()
183
184
185
server = AgentServer()
186
server.setup_fnc = prewarm
187
188
189
@server.rtc_session(agent_name="rime-multilingual-agent")
190
async def entrypoint(ctx: JobContext) -> None:
191
"""Main entry point for the multilingual agent worker."""
192
ctx.log_context_fields = {"room": ctx.room.name}
193
194
session = AgentSession(
195
vad=ctx.proc.userdata["vad"],
196
stt=inference.STT(model="deepgram/nova-3-general", language="multi"),
197
llm=inference.LLM(model="openai/gpt-4o"),
198
tts=inference.TTS(
199
model=f"rime/{DEFAULT_TTS_MODEL}", voice=DEFAULT_VOICE, language=DEFAULT_LANGUAGE
200
),
201
turn_detection=MultilingualModel(),
202
)
203
204
async def log_usage() -> None:
205
"""Log usage summary on shutdown."""
206
for usage in session.usage.model_usage:
207
logger.info(f"Usage: {usage.provider}/{usage.model}: {usage}")
208
209
ctx.add_shutdown_callback(log_usage)
210
211
agent = MultilingualAgent()
212
agent._room = ctx.room
213
await session.start(
214
agent=agent,
215
room=ctx.room,
216
room_output_options=RoomOutputOptions(transcription_enabled=True),
217
)
218
219
220
if __name__ == "__main__":
221
cli.run_app(server)