A Practical Guide to Prompting Gemini 3.1 Flash TTS
A working set of rules for getting natural, emotionally appropriate speech out of gemini-3.1-flash-tts-preview. Every tip here comes from a specific failure we hit and fixed. 3.1 Flash Preview is currently available in LiveKit Inference in beta, so rough edges and longer-than-average response times are the norm.
That said, the model is very expressive and extremely flexible. There are already things you can do with 3.1 Flash Preview that aren't possible with any other model!
The Short Version
If you do nothing else:
- Structure your prompt with explicit section labels and a
#### TRANSCRIPTdelimiter. Without it, Gemini will often read your stage directions aloud. This is a documented failure mode by Google rather than a bug. - Use commas between tagged clauses, not periods. Period-separated fragments sound chopped.
- Classify the emotional scene first, then pick tags. Don't force any universal templates. Laughter in an apology is bad!
Why Gemini TTS Is Different
Unlike traditional phoneme-based TTS, Gemini is an LLM that interprets your whole prompt as context. This means:
- Natural-language direction ("warm and unhurried") actually works
- But the model has to decide what's direction and what's speech
- Without discipline, the classifier confuses the two, and reads your direction aloud
The Gemini docs warn about this explicitly:
Vague prompts may fail to trigger the speech synthesis classifier, resulting in a rejected request or causing the model to read your style instructions and director's notes aloud. Validate your prompts by adding a clear preamble instructing the model to synthesize speech, and explicitly label where the actual spoken transcript begins.
This guide is the answer to what "validate your prompts" actually means in practice.
The Canonical Prompt Structure
Every prompt should look like this:
1Synthesize speech for the performance defined below. The profile, scene,2performance notes, and context are direction only. Do NOT speak them.3Speak ONLY the lines under #### TRANSCRIPT.45# AUDIO PROFILE: <First Name + Last Initial of your persona>6## "<One-line persona>"78## SCENE: <Short scene name>9<2 to 3 sentence scene. Setting, posture, a concrete detail, the vibe.>1011### PERFORMANCE12Style: <Tone and emotional register. Never "quiet" or "flat".>13Pace: <One specific rhythmic moment. Don't quote literal transcript words.>14Accent: <Short accent descriptor.>1516### CONTEXT17<1 to 2 sentences on who this character is and why they sound this way.>1819#### TRANSCRIPT20<The actual words, with inline audio tags, joined by commas.>
The #### TRANSCRIPT header is load-bearing. The synthesize-speech preamble at the top is load-bearing. Everything else is stylistic.
The Nine Rules
1. Always prepend a synthesize-speech preamble
This single paragraph is what reliably triggers the speech-synthesis classifier path instead of the "read it all" path.
1Synthesize speech for the performance defined below. The profile, scene,2performance notes, and context are direction only. Do NOT speak them.3Speak ONLY the lines under #### TRANSCRIPT.
Skip this and you'll hit intermittent failures where Gemini reads your whole prompt out loud.
2. Use #### TRANSCRIPT as the delimiter, exact header
We tested alternatives. Lower-level headers (##### TRANSCRIPT) work, but #### TRANSCRIPT is what the official docs example uses, and it's what we saw the most reliable classifier behavior with.
3. Use short section labels, not the docs' verbose ones
The current iteration of the official Gemini docs use ### DIRECTOR'S NOTES and ### SAMPLE CONTEXT. Don't. In our testing, the literal string DIRECTOR'S got spoken aloud.
Use these instead:
### PERFORMANCE(for Style / Pace / Accent)### CONTEXT(for sample context)
Apostrophes and multi-word section headers are classifier hazards.
4. Classify emotional register BEFORE picking audio tags
Don't use a universal tag template. Pick one register first.
| Register | When | Safe tags | Forbidden |
|---|---|---|---|
| EMPATHY | Customer upset, apologizing, acknowledging a problem | [sighs], [warmly], [thoughtfully], [gently] | [soft laugh], [cheerfully] |
| CLARIFY_PROBLEM | Confirming the details of a customer's issue | [thoughtfully], [warmly], [gently] | [soft laugh], [cheerfully], [sighs] |
| TRANSACTIONAL | Policy, transfers, troubleshooting, scheduling, handoffs | [warmly], [thoughtfully] | [soft laugh], [sighs], [cheerfully] |
| WARM_FRIENDLY | Greetings, closings, confirmations, upsells | [warmly], [thoughtfully], [cheerfully], [soft laugh] (max one) | (none) |
Never laugh at an upset customer. This is the easiest way to make an AI voice feel deeply wrong.
5. Stick to documented audio tags
The docs list "commonly used tags" and the non-exhaustive implication is that custom tags work. In practice, custom emotion tags produce noticeably weaker prosody. We tried [apologetically], [measured], [helpfully], [softly], [carefully]. All produced flatter delivery than the documented set.
The six tags we keep in rotation:
[warmly], [thoughtfully], [sighs], [gently], [soft laugh], [cheerfully]
For non-emotional modifiers (pacing, volume, character), custom tags are fine. Things like [whispers], [very slow], [like a cartoon dog] work well. It's the emotional adjectives specifically where coverage feels thin.
6. Write a scene, not a role label
Bad (too abstract, the model has nothing to latch onto):
1## SCENE: Customer service rep2A warm customer service rep explaining something clearly.
Good (concrete, sensory, specific):
1## SCENE: Late afternoon at the clinic front desk2Late afternoon light across the desk, calendar open to Tuesday. Mira has3a pen in hand, confirming an easy appointment. The favorite part of the day.
Scene-rich preambles move realism meaningfully. Generic role labels don't. The docs' own example ("It's 10:00 PM in a glass-walled studio overlooking the moonlit London skyline...") models this.
7. Never instruct flatness
This one bit us. We wrote for an empathy utterance:
1Style: voice dropped, quiet, no rush
Gemini took "quiet" and "no rush" literally. The prosody score crashed. The register was right, the instruction was wrong.
Avoid these in Style and Pace notes:
- "quiet", "quietly"
- "flat", "monotone"
- "no rush" (reads as "go slow and flat")
- "careful" (reads as "over-precise, stiff")
- "whispered" (unless you actually want whispering)
Better phrasing for quieter moods:
- "warm and sincere"
- "voice dropped half an octave but full of feeling"
- "patient and unhurried"
- "measured but present"
The register is carried by the text content and the audio tags. Style notes should amplify rather than dampen to avoid monotone 'ai-like' delivery.
8. Commas over periods in the transcript
Some TTS providers respond extremely well to extra periods in the transcript, especially when tone changes. This is because TTS generations tend to speak too quickly, and extra punctuation helps create a more "human-like" pacing while keeping the generation natural. This does not work for Gemini 3.1 Flash TTS.
Bad, sounds choppy:
1#### TRANSCRIPT2[warmly] Okay. [thoughtfully] So your appointment. [warmly] That's all set.3[cheerfully] Tuesday. [warmly] At three... [thoughtfully] PM.
Good (natural prose flow, tags still mark emotional pivots):
1#### TRANSCRIPT2[warmly] Okay, [thoughtfully] so your appointment, [warmly] that's all set.3[cheerfully] Tuesday, [warmly] at 6... [thoughtfully] PM.
Rule of thumb:
- Commas between tagged clauses within a sentence
- Periods only where the original text has real sentence endings
- Ellipses (
...) for a natural trailing pause (1 to 2 per utterance) - Em-dashes (
—) for a micro-pause mid-thought (1 per utterance)
9. Don't quote literal transcript words in Style/Pace notes
This occasionally creates failures where the engine will speak the director's notes.
Bad:
1Pace: A small lift at "oh" at the start, like the thought just came up.2Style: A small chuckle at "y'know", natural, not performative.
Good:
1Pace: A small lift at the opening, like the thought just came up.2Style: A natural chuckle partway through. The tell of someone who actually3believes it.
Describe the rhythm, don't name the words.
Full Working Example
1Synthesize speech for the performance defined below. The profile, scene,2performance notes, and context are direction only. Do NOT speak them.3Speak ONLY the lines under #### TRANSCRIPT.45# AUDIO PROFILE: Maria J.6## "The Senior Support Rep"78## SCENE: A tough moment in the call9The customer has shared something frustrating. Maria leans a little closer10to the mic, voice carrying real feeling, the kind of apology you actually mean.1112### PERFORMANCE13Style: Warm and sincere. Genuine concern. The voice carries feeling, not14flatness. A soft exhale at the opening is real, not performative. Never15amused, never casual.16Pace: Natural, with a small settling pause early on. The beat of someone17actually taking in what they heard.1819### CONTEXT20Maria is the rep who actually listens, and callers can hear the difference.21She takes ownership of getting things fixed.2223#### TRANSCRIPT24[sighs] Oh. [gently] I'm really sorry to hear that. [warmly] Lemme see25[thoughtfully] what I can do. [warmly] We'll get this sorted out26[gently] for you... [warmly] right away.
Gotchas We Hit So You Don't Have To
| Symptom | Cause | Fix |
|---|---|---|
| "DIRECTOR'S" is audibly in the output | Section header got read aloud | Use ### PERFORMANCE instead of ### DIRECTOR'S NOTES |
| Audio sounds monotone or dead | Style note says "quiet", "flat", "no rush" | Rewrite Style without flatness words |
| Character name gets spoken aloud | Phonetic or semantic collision between profile name and opening transcript word | Rename the character (Kiara D. becomes Morgan P.) |
| A word from CONTEXT bleeds into the start of the transcript | Section boundary ambiguous | Remove the collision word, or rephrase the CONTEXT ending |
Empty content.parts or 500 errors | Documented "text tokens instead of audio" preview bug | Retry up to 5 times with backoff |
| Speech sounds robotic despite good content | Period-separated fragments | Rewrite TRANSCRIPT with commas between tagged clauses |
| Laughter in an apology | Unified tag template across all scenarios | Classify emotional register first, use register-specific tag palette |
Custom tags like [apologetically] feel flat | Weak training coverage on non-documented adjectives | Stick to the documented set. [warmly], [thoughtfully], [sighs], [gently], [soft laugh], [cheerfully] |
Quick Checklist
Before sending a prompt:
- Synthesize-speech preamble at the top
-
#### TRANSCRIPTdelimiter present - Section labels are
PERFORMANCEandCONTEXT(not the docs' verbose ones) - Scene is concrete and sensory, not an abstract role
- Style and Pace notes don't instruct flatness
- Style and Pace notes don't quote literal transcript words
- Audio tags are register-appropriate (no laughing at upset customers)
- Audio tags are from the documented set
- Transcript uses commas between clauses, periods only at real sentence ends
Give it a try with LiveKit Agents and let us know how it works for your use case.