Skip to main content

A Practical Guide to Prompting Gemini 3.1 Flash TTS

A Practical Guide to Prompting Gemini 3.1 Flash TTS

A working set of rules for getting natural, emotionally appropriate speech out of gemini-3.1-flash-tts-preview. Every tip here comes from a specific failure we hit and fixed. 3.1 Flash Preview is currently available in LiveKit Inference in beta, so rough edges and longer-than-average response times are the norm.

That said, the model is very expressive and extremely flexible. There are already things you can do with 3.1 Flash Preview that aren't possible with any other model!


The Short Version

If you do nothing else:

  1. Structure your prompt with explicit section labels and a #### TRANSCRIPT delimiter. Without it, Gemini will often read your stage directions aloud. This is a documented failure mode by Google rather than a bug.
  2. Use commas between tagged clauses, not periods. Period-separated fragments sound chopped.
  3. Classify the emotional scene first, then pick tags. Don't force any universal templates. Laughter in an apology is bad!

Why Gemini TTS Is Different

Unlike traditional phoneme-based TTS, Gemini is an LLM that interprets your whole prompt as context. This means:

  • Natural-language direction ("warm and unhurried") actually works
  • But the model has to decide what's direction and what's speech
  • Without discipline, the classifier confuses the two, and reads your direction aloud

The Gemini docs warn about this explicitly:

Vague prompts may fail to trigger the speech synthesis classifier, resulting in a rejected request or causing the model to read your style instructions and director's notes aloud. Validate your prompts by adding a clear preamble instructing the model to synthesize speech, and explicitly label where the actual spoken transcript begins.

This guide is the answer to what "validate your prompts" actually means in practice.


The Canonical Prompt Structure

Every prompt should look like this:

1
Synthesize speech for the performance defined below. The profile, scene,
2
performance notes, and context are direction only. Do NOT speak them.
3
Speak ONLY the lines under #### TRANSCRIPT.
4
5
# AUDIO PROFILE: <First Name + Last Initial of your persona>
6
## "<One-line persona>"
7
8
## SCENE: <Short scene name>
9
<2 to 3 sentence scene. Setting, posture, a concrete detail, the vibe.>
10
11
### PERFORMANCE
12
Style: <Tone and emotional register. Never "quiet" or "flat".>
13
Pace: <One specific rhythmic moment. Don't quote literal transcript words.>
14
Accent: <Short accent descriptor.>
15
16
### CONTEXT
17
<1 to 2 sentences on who this character is and why they sound this way.>
18
19
#### TRANSCRIPT
20
<The actual words, with inline audio tags, joined by commas.>

The #### TRANSCRIPT header is load-bearing. The synthesize-speech preamble at the top is load-bearing. Everything else is stylistic.


The Nine Rules

1. Always prepend a synthesize-speech preamble

This single paragraph is what reliably triggers the speech-synthesis classifier path instead of the "read it all" path.

1
Synthesize speech for the performance defined below. The profile, scene,
2
performance notes, and context are direction only. Do NOT speak them.
3
Speak ONLY the lines under #### TRANSCRIPT.

Skip this and you'll hit intermittent failures where Gemini reads your whole prompt out loud.

2. Use #### TRANSCRIPT as the delimiter, exact header

We tested alternatives. Lower-level headers (##### TRANSCRIPT) work, but #### TRANSCRIPT is what the official docs example uses, and it's what we saw the most reliable classifier behavior with.

3. Use short section labels, not the docs' verbose ones

The current iteration of the official Gemini docs use ### DIRECTOR'S NOTES and ### SAMPLE CONTEXT. Don't. In our testing, the literal string DIRECTOR'S got spoken aloud.

Use these instead:

  • ### PERFORMANCE (for Style / Pace / Accent)
  • ### CONTEXT (for sample context)

Apostrophes and multi-word section headers are classifier hazards.

4. Classify emotional register BEFORE picking audio tags

Don't use a universal tag template. Pick one register first.

RegisterWhenSafe tagsForbidden
EMPATHYCustomer upset, apologizing, acknowledging a problem[sighs], [warmly], [thoughtfully], [gently][soft laugh], [cheerfully]
CLARIFY_PROBLEMConfirming the details of a customer's issue[thoughtfully], [warmly], [gently][soft laugh], [cheerfully], [sighs]
TRANSACTIONALPolicy, transfers, troubleshooting, scheduling, handoffs[warmly], [thoughtfully][soft laugh], [sighs], [cheerfully]
WARM_FRIENDLYGreetings, closings, confirmations, upsells[warmly], [thoughtfully], [cheerfully], [soft laugh] (max one)(none)

Never laugh at an upset customer. This is the easiest way to make an AI voice feel deeply wrong.

5. Stick to documented audio tags

The docs list "commonly used tags" and the non-exhaustive implication is that custom tags work. In practice, custom emotion tags produce noticeably weaker prosody. We tried [apologetically], [measured], [helpfully], [softly], [carefully]. All produced flatter delivery than the documented set.

The six tags we keep in rotation:

[warmly], [thoughtfully], [sighs], [gently], [soft laugh], [cheerfully]

For non-emotional modifiers (pacing, volume, character), custom tags are fine. Things like [whispers], [very slow], [like a cartoon dog] work well. It's the emotional adjectives specifically where coverage feels thin.

6. Write a scene, not a role label

Bad (too abstract, the model has nothing to latch onto):

1
## SCENE: Customer service rep
2
A warm customer service rep explaining something clearly.

Good (concrete, sensory, specific):

1
## SCENE: Late afternoon at the clinic front desk
2
Late afternoon light across the desk, calendar open to Tuesday. Mira has
3
a pen in hand, confirming an easy appointment. The favorite part of the day.

Scene-rich preambles move realism meaningfully. Generic role labels don't. The docs' own example ("It's 10:00 PM in a glass-walled studio overlooking the moonlit London skyline...") models this.

7. Never instruct flatness

This one bit us. We wrote for an empathy utterance:

1
Style: voice dropped, quiet, no rush

Gemini took "quiet" and "no rush" literally. The prosody score crashed. The register was right, the instruction was wrong.

Avoid these in Style and Pace notes:

  • "quiet", "quietly"
  • "flat", "monotone"
  • "no rush" (reads as "go slow and flat")
  • "careful" (reads as "over-precise, stiff")
  • "whispered" (unless you actually want whispering)

Better phrasing for quieter moods:

  • "warm and sincere"
  • "voice dropped half an octave but full of feeling"
  • "patient and unhurried"
  • "measured but present"

The register is carried by the text content and the audio tags. Style notes should amplify rather than dampen to avoid monotone 'ai-like' delivery.

8. Commas over periods in the transcript

Some TTS providers respond extremely well to extra periods in the transcript, especially when tone changes. This is because TTS generations tend to speak too quickly, and extra punctuation helps create a more "human-like" pacing while keeping the generation natural. This does not work for Gemini 3.1 Flash TTS.

Bad, sounds choppy:

1
#### TRANSCRIPT
2
[warmly] Okay. [thoughtfully] So your appointment. [warmly] That's all set.
3
[cheerfully] Tuesday. [warmly] At three... [thoughtfully] PM.

Good (natural prose flow, tags still mark emotional pivots):

1
#### TRANSCRIPT
2
[warmly] Okay, [thoughtfully] so your appointment, [warmly] that's all set.
3
[cheerfully] Tuesday, [warmly] at 6... [thoughtfully] PM.

Rule of thumb:

  • Commas between tagged clauses within a sentence
  • Periods only where the original text has real sentence endings
  • Ellipses (...) for a natural trailing pause (1 to 2 per utterance)
  • Em-dashes () for a micro-pause mid-thought (1 per utterance)

9. Don't quote literal transcript words in Style/Pace notes

This occasionally creates failures where the engine will speak the director's notes.

Bad:

1
Pace: A small lift at "oh" at the start, like the thought just came up.
2
Style: A small chuckle at "y'know", natural, not performative.

Good:

1
Pace: A small lift at the opening, like the thought just came up.
2
Style: A natural chuckle partway through. The tell of someone who actually
3
believes it.

Describe the rhythm, don't name the words.


Full Working Example

1
Synthesize speech for the performance defined below. The profile, scene,
2
performance notes, and context are direction only. Do NOT speak them.
3
Speak ONLY the lines under #### TRANSCRIPT.
4
5
# AUDIO PROFILE: Maria J.
6
## "The Senior Support Rep"
7
8
## SCENE: A tough moment in the call
9
The customer has shared something frustrating. Maria leans a little closer
10
to the mic, voice carrying real feeling, the kind of apology you actually mean.
11
12
### PERFORMANCE
13
Style: Warm and sincere. Genuine concern. The voice carries feeling, not
14
flatness. A soft exhale at the opening is real, not performative. Never
15
amused, never casual.
16
Pace: Natural, with a small settling pause early on. The beat of someone
17
actually taking in what they heard.
18
19
### CONTEXT
20
Maria is the rep who actually listens, and callers can hear the difference.
21
She takes ownership of getting things fixed.
22
23
#### TRANSCRIPT
24
[sighs] Oh. [gently] I'm really sorry to hear that. [warmly] Lemme see
25
[thoughtfully] what I can do. [warmly] We'll get this sorted out
26
[gently] for you... [warmly] right away.

Gotchas We Hit So You Don't Have To

SymptomCauseFix
"DIRECTOR'S" is audibly in the outputSection header got read aloudUse ### PERFORMANCE instead of ### DIRECTOR'S NOTES
Audio sounds monotone or deadStyle note says "quiet", "flat", "no rush"Rewrite Style without flatness words
Character name gets spoken aloudPhonetic or semantic collision between profile name and opening transcript wordRename the character (Kiara D. becomes Morgan P.)
A word from CONTEXT bleeds into the start of the transcriptSection boundary ambiguousRemove the collision word, or rephrase the CONTEXT ending
Empty content.parts or 500 errorsDocumented "text tokens instead of audio" preview bugRetry up to 5 times with backoff
Speech sounds robotic despite good contentPeriod-separated fragmentsRewrite TRANSCRIPT with commas between tagged clauses
Laughter in an apologyUnified tag template across all scenariosClassify emotional register first, use register-specific tag palette
Custom tags like [apologetically] feel flatWeak training coverage on non-documented adjectivesStick to the documented set. [warmly], [thoughtfully], [sighs], [gently], [soft laugh], [cheerfully]

Quick Checklist

Before sending a prompt:

  • Synthesize-speech preamble at the top
  • #### TRANSCRIPT delimiter present
  • Section labels are PERFORMANCE and CONTEXT (not the docs' verbose ones)
  • Scene is concrete and sensory, not an abstract role
  • Style and Pace notes don't instruct flatness
  • Style and Pace notes don't quote literal transcript words
  • Audio tags are register-appropriate (no laughing at upset customers)
  • Audio tags are from the documented set
  • Transcript uses commas between clauses, periods only at real sentence ends

Give it a try with LiveKit Agents and let us know how it works for your use case.