A Practical Guide to Prompting Gemini 3.1 Flash TTS

A working set of rules for getting natural, emotionally appropriate speech out of gemini-3.1-flash-tts-preview. Every tip here comes from a specific failure we hit and fixed. 3.1 Flash Preview is currently available in beta, so rough edges and longer-than-average response times are the norm.

That said, the model is very expressive and extremely flexible. There are already things you can do with 3.1 Flash Preview that aren't possible with any other model!

The Short Version#

If you do nothing else:

Structure your prompt with explicit section labels and a #### TRANSCRIPT delimiter. Without it, Gemini will often read your stage directions aloud. This is a documented failure mode by Google rather than a bug.
Use commas between tagged clauses, not periods. Period-separated fragments sound chopped.
Classify the emotional scene first, then pick tags. Don't force any universal templates. Laughter in an apology is bad!

Why Gemini TTS Is Different#

Unlike traditional phoneme-based TTS, Gemini is an LLM that interprets your whole prompt as context. This means:

Natural-language direction ("warm and unhurried") actually works
But the model has to decide what's direction and what's speech
Without discipline, the classifier confuses the two, and reads your direction aloud

The Gemini docs warn about this explicitly:

Vague prompts may fail to trigger the speech synthesis classifier, resulting in a rejected request or causing the model to read your style instructions and director's notes aloud. Validate your prompts by adding a clear preamble instructing the model to synthesize speech, and explicitly label where the actual spoken transcript begins.

This guide is the answer to what "validate your prompts" actually means in practice.

The Canonical Prompt Structure#

Every prompt should look like this:

1Synthesize speech for the performance defined below. The profile, scene,
2performance notes, and context are direction only. Do NOT speak them.
3Speak ONLY the lines under #### TRANSCRIPT.
4
5# AUDIO PROFILE: <First Name + Last Initial of your persona>
6## "<One-line persona>"
7
8## SCENE: <Short scene name>
9<2 to 3 sentence scene. Setting, posture, a concrete detail, the vibe.>
10
11### PERFORMANCE
12Style: <Tone and emotional register. Never "quiet" or "flat".>
13Pace: <One specific rhythmic moment. Don't quote literal transcript words.>
14Accent: <Short accent descriptor.>
15
16### CONTEXT
17<1 to 2 sentences on who this character is and why they sound this way.>
18
19#### TRANSCRIPT
20<The actual words, with inline audio tags, joined by commas.>

The #### TRANSCRIPT header is load-bearing. The synthesize-speech preamble at the top is load-bearing. Everything else is stylistic.

The Nine Rules#

1. Always prepend a synthesize-speech preamble#

This single paragraph is what reliably triggers the speech-synthesis classifier path instead of the "read it all" path.

1Synthesize speech for the performance defined below. The profile, scene,
2performance notes, and context are direction only. Do NOT speak them.
3Speak ONLY the lines under #### TRANSCRIPT.

Skip this and you'll hit intermittent failures where Gemini reads your whole prompt out loud.

2. Use `#### TRANSCRIPT` as the delimiter, exact header#

We tested alternatives. Lower-level headers (##### TRANSCRIPT) work, but #### TRANSCRIPT is what the official docs example uses, and it's what we saw the most reliable classifier behavior with.

3. Use short section labels, not the docs' verbose ones#

The current iteration of the official Gemini docs use ### DIRECTOR'S NOTES and ### SAMPLE CONTEXT. Don't. In our testing, the literal string DIRECTOR'S got spoken aloud.

Use these instead:

### PERFORMANCE (for Style / Pace / Accent)
### CONTEXT (for sample context)

Apostrophes and multi-word section headers are classifier hazards.

4. Classify emotional register BEFORE picking audio tags#

Don't use a universal tag template. Pick one register first.

Register	When	Safe tags	Forbidden
EMPATHY	Customer upset, apologizing, acknowledging a problem	`[sighs]`, `[warmly]`, `[thoughtfully]`, `[gently]`	`[soft laugh]`, `[cheerfully]`
CLARIFY_PROBLEM	Confirming the details of a customer's issue	`[thoughtfully]`, `[warmly]`, `[gently]`	`[soft laugh]`, `[cheerfully]`, `[sighs]`
TRANSACTIONAL	Policy, transfers, troubleshooting, scheduling, handoffs	`[warmly]`, `[thoughtfully]`	`[soft laugh]`, `[sighs]`, `[cheerfully]`
WARM_FRIENDLY	Greetings, closings, confirmations, upsells	`[warmly]`, `[thoughtfully]`, `[cheerfully]`, `[soft laugh]` (max one)	(none)

Never laugh at an upset customer. This is the easiest way to make an AI voice feel deeply wrong.

5. Stick to documented audio tags#

The docs list "commonly used tags" and the non-exhaustive implication is that custom tags work. In practice, custom emotion tags produce noticeably weaker prosody. We tried [apologetically], [measured], [helpfully], [softly], [carefully]. All produced flatter delivery than the documented set.

The six tags we keep in rotation:

[warmly], [thoughtfully], [sighs], [gently], [soft laugh], [cheerfully]

For non-emotional modifiers (pacing, volume, character), custom tags are fine. Things like [whispers], [very slow], [like a cartoon dog] work well. It's the emotional adjectives specifically where coverage feels thin.

6. Write a scene, not a role label#

Bad (too abstract, the model has nothing to latch onto):

1## SCENE: Customer service rep
2A warm customer service rep explaining something clearly.

Good (concrete, sensory, specific):

1## SCENE: Late afternoon at the clinic front desk
2Late afternoon light across the desk, calendar open to Tuesday. Mira has
3a pen in hand, confirming an easy appointment. The favorite part of the day.

Scene-rich preambles move realism meaningfully. Generic role labels don't. The docs' own example ("It's 10:00 PM in a glass-walled studio overlooking the moonlit London skyline...") models this.

7. Never instruct flatness#

This one bit us. We wrote for an empathy utterance:

1Style: voice dropped, quiet, no rush

Gemini took "quiet" and "no rush" literally. The prosody score crashed. The register was right, the instruction was wrong.

Avoid these in Style and Pace notes:

"quiet", "quietly"
"flat", "monotone"
"no rush" (reads as "go slow and flat")
"careful" (reads as "over-precise, stiff")
"whispered" (unless you actually want whispering)

Better phrasing for quieter moods:

"warm and sincere"
"voice dropped half an octave but full of feeling"
"patient and unhurried"
"measured but present"

The register is carried by the text content and the audio tags. Style notes should amplify rather than dampen to avoid monotone 'ai-like' delivery.

8. Commas over periods in the transcript#

Some TTS providers respond extremely well to extra periods in the transcript, especially when tone changes. This is because TTS generations tend to speak too quickly, and extra punctuation helps create a more "human-like" pacing while keeping the generation natural. This does not work for Gemini 3.1 Flash TTS.

Bad, sounds choppy:

1#### TRANSCRIPT
2[warmly] Okay. [thoughtfully] So your appointment. [warmly] That's all set.
3[cheerfully] Tuesday. [warmly] At three... [thoughtfully] PM.

Good (natural prose flow, tags still mark emotional pivots):

1#### TRANSCRIPT
2[warmly] Okay, [thoughtfully] so your appointment, [warmly] that's all set.
3[cheerfully] Tuesday, [warmly] at 6... [thoughtfully] PM.

Rule of thumb:

Commas between tagged clauses within a sentence
Periods only where the original text has real sentence endings
Ellipses (...) for a natural trailing pause (1 to 2 per utterance)
Em-dashes (—) for a micro-pause mid-thought (1 per utterance)

9. Don't quote literal transcript words in Style/Pace notes#

This occasionally creates failures where the engine will speak the director's notes.

Bad:

1Pace: A small lift at "oh" at the start, like the thought just came up.
2Style: A small chuckle at "y'know", natural, not performative.

Good:

1Pace: A small lift at the opening, like the thought just came up.
2Style: A natural chuckle partway through. The tell of someone who actually
3believes it.

Describe the rhythm, don't name the words.

Full Working Example#

1Synthesize speech for the performance defined below. The profile, scene,
2performance notes, and context are direction only. Do NOT speak them.
3Speak ONLY the lines under #### TRANSCRIPT.
4
5# AUDIO PROFILE: Maria J.
6## "The Senior Support Rep"
7
8## SCENE: A tough moment in the call
9The customer has shared something frustrating. Maria leans a little closer
10to the mic, voice carrying real feeling, the kind of apology you actually mean.
11
12### PERFORMANCE
13Style: Warm and sincere. Genuine concern. The voice carries feeling, not
14flatness. A soft exhale at the opening is real, not performative. Never
15amused, never casual.
16Pace: Natural, with a small settling pause early on. The beat of someone
17actually taking in what they heard.
18
19### CONTEXT
20Maria is the rep who actually listens, and callers can hear the difference.
21She takes ownership of getting things fixed.
22
23#### TRANSCRIPT
24[sighs] Oh. [gently] I'm really sorry to hear that. [warmly] Lemme see
25[thoughtfully] what I can do. [warmly] We'll get this sorted out
26[gently] for you... [warmly] right away.

Gotchas We Hit So You Don't Have To#

Symptom	Cause	Fix
"DIRECTOR'S" is audibly in the output	Section header got read aloud	Use `### PERFORMANCE` instead of `### DIRECTOR'S NOTES`
Audio sounds monotone or dead	Style note says "quiet", "flat", "no rush"	Rewrite Style without flatness words
Character name gets spoken aloud	Phonetic or semantic collision between profile name and opening transcript word	Rename the character (`Kiara D.` becomes `Morgan P.`)
A word from CONTEXT bleeds into the start of the transcript	Section boundary ambiguous	Remove the collision word, or rephrase the CONTEXT ending
Empty `content.parts` or 500 errors	Documented "text tokens instead of audio" preview bug	Retry up to 5 times with backoff
Speech sounds robotic despite good content	Period-separated fragments	Rewrite TRANSCRIPT with commas between tagged clauses
Laughter in an apology	Unified tag template across all scenarios	Classify emotional register first, use register-specific tag palette
Custom tags like `[apologetically]` feel flat	Weak training coverage on non-documented adjectives	Stick to the documented set. `[warmly]`, `[thoughtfully]`, `[sighs]`, `[gently]`, `[soft laugh]`, `[cheerfully]`