Multilingual speech-to-text on your laptop: NVIDIA's Nemotron 3.5 ASR

NVIDIA Nemotron 3.5 ASR, nemotron-3.5-asr-streaming-0.6b, is a 600M-parameter streaming speech recognition model that transcribes 40 language-locales — and it's small enough to run on a laptop. A language-ID prompt steers decoding, so a single set of weights handles English, Spanish, German, Japanese, and more. And it's fast: end-of-utterance latency is sub-100ms — quick enough that the transcript keeps pace with the talker.

This post is a tour of what it does and, mostly, how to use it — straight from NeMo, behind an OpenAI-compatible HTTP server, inside a LiveKit voice agent, and in a real app: a local teleprompter that scrolls under your voice in whatever language you're reading.

The features that we'll be using#

Multilingual from one checkpoint. Pick a language with a target_lang prompt (en-US, es-US, de-DE, fr-FR, ja-JP, …) or pass auto and let the model detect it from the audio. One set of weights covers 35 languages.
Real-time fast. It's a cache-aware FastConformer-RNNT: it processes each new audio chunk while reusing cached encoder context, so partial transcripts arrive while you're still talking, not after you stop. On an NVIDIA GPU end-of-utterance latency is around 100ms — far ahead of cloud APIs that have to round-trip your audio over the network.
Local and cheap. 600M parameters can run on CPU, Apple Silicon (MPS), or an NVIDIA GPU. Your audio never leaves the machine, and there's no per-minute bill.
A latency/accuracy dial. att_context_size sets the encoder's lookahead — [56,0] for the snappiest deltas up to [56,13] for the most accurate, with [56,3] the balanced default.

How to use it#

Directly in NeMo#

Load the model and pick a language. You set the target language once with a prompt, then stream audio through the model chunk by chunk:

1import soundfile as sf
2import nemo.collections.asr as nemo_asr
3from nemo.collections.asr.parts.utils.streaming_utils import CacheAwareStreamingAudioBuffer
4
5model = nemo_asr.models.ASRModel.from_pretrained("nvidia/nemotron-3.5-asr-streaming-0.6b").eval()
6model.set_inference_prompt("de-DE")                   # language (or "auto" to detect)
7model.encoder.set_default_att_context_size([56, 1])   # latency/accuracy dial

That single set_inference_prompt() call is all the language wiring you need — the cache-aware step injects the prompt internally. From there it's the same loop whether the audio is a live mic, a pre-recorded file, or a batch of clips: NeMo's streaming buffer slices the audio into the model's chunks, you step each one through conformer_stream_step(), and read back a growing transcript:

1audio, _ = sf.read("speech_de.wav", dtype="float32")
2buffer = CacheAwareStreamingAudioBuffer(model, online_normalization=False)
3buffer.append_audio(audio, stream_id=-1)
4
5cfg = model.encoder.streaming_cfg
6ch, t, ch_len = model.encoder.get_initial_cache_state(batch_size=1)
7hyps = None
8for step, (chunk, chunk_len) in enumerate(buffer):
9    _, _, ch, t, ch_len, hyps = model.conformer_stream_step(
10        processed_signal=chunk, processed_signal_length=chunk_len,
11        cache_last_channel=ch, cache_last_time=t, cache_last_channel_len=ch_len,
12        previous_hypotheses=hyps,
13        drop_extra_pre_encoded=cfg.drop_extra_pre_encoded if step else 0,
14        keep_all_outputs=buffer.is_buffer_empty(),
15        return_transcription=True,
16    )
17    print(hyps[0].text)   # the transcript so far

When you know the language up front, pin it with a target_lang code rather than auto. Decoding conditioned on a known language is both more accurate and more stable. auto is the right call when the language is unknown or mixed.

Behind an OpenAI-compatible server#

For some apps you might not be able to use NeMo directly. The demo teleprompter's STT server wraps the model behind an OpenAI-style /v1/audio/transcriptions endpoint, so any OpenAI-compatible client works. The language field selects the prompt but you can also omit it for auto-detect:

1# Pin a language
2curl http://localhost:8000/v1/audio/transcriptions \
3  -F file=@audio.wav \
4  -F model=nemotron-3.5-asr-streaming-0.6b \
5  -F language=es-US
6
7# Or stream deltas over Server-Sent Events
8curl http://localhost:8000/v1/audio/transcriptions \
9  -F file=@audio.wav \
10  -F model=nemotron-3.5-asr-streaming-0.6b \
11  -F stream=true

There's also a live WebSocket endpoint (/v1/audio/stream) that takes raw PCM in and emits transcript deltas out — send {"type": "config", "language": "ja-JP"} before the audio to pin a language.

In a LiveKit voice agent#

Because the server speaks the OpenAI API, it drops straight into LiveKit Agents via the OpenAI STT plugin:

1from livekit.agents import AgentSession
2from livekit.plugins import openai
3
4session = AgentSession(
5    stt=openai.STT(
6        model="nemotron-3.5-asr-streaming-0.6b",
7        base_url="http://localhost:8000/v1",
8        api_key="unused",
9        language="es-US",   # or omit for auto-detect
10    ),
11    # ... llm, tts, etc.
12)

For true word-by-word streaming over the WebSocket, the teleprompter ships a small custom plugin, LocalNemotronSTT, that emits interim and final transcripts into the AgentSession and sends the language as a config message when the socket opens.

Putting it to work: a multilingual teleprompter#

The model's combination — streaming, low-latency, multilingual, local — is exactly what a teleprompter wants. Open a script, hit start, read out loud, and the page scrolls under your voice word by word. With the multilingual model, the script can be in any of the supported languages, and the cursor follows the same way.

Since your voice is being synced with the frontend via a LiveKit Agent, you can join on any device. You can set up a tablet or your phone, and visit the URL of your local app to use the teleprompter there instead.

Setup#

The whole thing is one command. Clone the teleprompter and run the launcher:

1git clone https://github.com/ShayneP/local-teleprompter
2cd local-teleprompter
3./start.sh

start.sh boots four processes and wires them together: a local livekit-server in dev mode, the Nemotron STT server (port 8000), the LiveKit agent, and the Next.js frontend (port 3000). On the first run it installs everything — pulling torch and NVIDIA's NeMo toolkit is several GB and takes ~10 minutes; after that, it comes up in seconds. When it's ready, open http://localhost:3000, pick a script, and read. Ctrl-C brings it all down.

You'll need a few standard tools on the machine first — Python 3.10+, uv, Node 20+ with pnpm — and either macOS (Apple Silicon) or Linux with the appropriate NVIDIA drivers. A CUDA GPU is not required: on Apple Silicon the model runs on MPS automatically, and CPU works too, just slower.

Selecting the language#

Picking the language is one environment variable on the agent:

1# agent/src/agent.py
2# "auto" detects the language from the audio; pin one (es-US, de-DE, ja-JP, …) for best results.
3stt_language = os.environ.get("STT_LANGUAGE", "auto")
4
5session = AgentSession(
6    stt=LocalNemotronSTT(base_url=stt_base_url, language=stt_language),
7)

That's the whole control surface. Set STT_LANGUAGE=de-DE to pin German, or leave it on auto to let the model detect the language from your voice.

How the cursor follows your voice#

The model hands you a stream of words. Turning those words into the right position in the script — while you stumble over a phrase, skip a sentence, or re-read a paragraph — is a particularly interesting part. The matching logic lives in position-tracker.ts, and it's deliberately small:

Forward-only by default. Each newly-spoken word scans an 18-word lookahead from the current cursor and advances to the match. Reading straight through just walks the cursor forward.
Bigram confirmation for big jumps. A far match (more than a couple of words ahead) only commits if the previous spoken word also matched nearby. This stops a stray "the" or "of" from yanking the cursor ten words down the page.
Fuzzy, but tightly scoped. Matching tolerates a single-character difference (Levenshtein-1) only for words of five characters or more; short words must match exactly, so the and then never collide.
Auto re-anchor when you jump around. After four unmatched words in a row — you skipped ahead, or started re-reading — it scans a trailing window of what you just said against the whole script and re-anchors the cursor if enough of it lines up.

The thresholds behind each of those rules are constants at the top of the file, so if the matcher feels too jumpy or too sticky for your reading style, you can tune it without touching the algorithm.

None of this would feel natural without the model underneath it. The reason you can bounce around — drop a line, jump back, paraphrase — and watch the cursor keep up is that the transcript arrives both fast and accurately. Every time you change what you're saying, the matcher gets a correct word within milliseconds and can immediately change gears. A slower or sloppier transcript would leave the cursor lurching a beat behind your voice, and the whole illusion would fall apart. The clever matching is only half of it; the model's ~100ms, accurate stream is what makes it feel alive.

Why this is a big deal#

Plenty of models speak many languages. What sets this one apart is speed: on an NVIDIA GPU, end-of-utterance latency lands around 100ms — fast enough that nothing else in the multilingual streaming space really compares. You get that on a single 600M checkpoint that covers 40 language-locales, runs on the laptop in front of you, and reduces the whole stack to one dependency and one target_lang string.

07.16.2026