Skip to content
Runs in:USMade in:United States
Google Gemini

Gemini 3.1 Flash TTS Preview

8K tokens

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

Gemini 3.1 Flash TTS Preview is a text-to-speech model developed by Google as part of the Gemini model family. This preview version is designed to convert written text into spoken audio output, enabling applications that require voice synthesis capabilities. The model supports standard text generation as input, processing natural language prompts to produce corresponding speech output. With an 8K token context window, it can handle moderate-length text inputs for conversion to speech. The model represents Google's exploration of multimodal capabilities within the Gemini ecosystem, extending beyond pure text-based interactions to audio generation. It is optimized for speed and efficiency, as suggested by the "Flash" designation, making it suitable for applications requiring relatively quick speech synthesis responses. The TTS Preview label indicates this is an experimental or early-access version, likely undergoing active development and refinement based on user feedback and performance metrics. Within Google's Gemini lineup, this model occupies a specialized niche focused on voice synthesis rather than the conversational or analytical capabilities of standard Gemini text models. It complements other Gemini variants by providing developers with audio output options for their applications. The preview status suggests it may have limitations or evolving features compared to production-ready models, and users should expect potential changes in capabilities or behavior as Google continues development of its text-to-speech technology.

Gemini 3.1 Flash TTS Preview converts written text into natural-sounding speech, making voice interfaces accessible without dedicated TTS infrastructure.

Tokonomix benchmark summary
Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — Gemini 3.1 Flash TTS Preview
$1.00 per 1M input tokens
$20.00 per 1M output tokens
≈ $0.0046 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$1.00
per 1M output tokens$20.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$1.00

input / 1M

— no change

$20.00

output / 1M

— no change

2026-06-142026-06-142026-06-14
Input
Output
Price change
⟳ synced weekly
Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Native audio processingVoice synthesis outputFast inference speedBroad domain knowledgeExtensive training dataAccurate task completion

Weaknesses

Pre-release, may changeReduced capability vs larger modelsFeatures subject to revision
Section 03

Capabilities

outputTokenLimit: 16384
Section 04

Frequently asked questions

No. Gemini 3.1 Flash TTS Preview processes audio natively, eliminating the need for a separate speech-to-text pipeline. This reduces latency and preserves vocal nuances like emphasis and tone.

For applications where voice output enhances user experience, Gemini 3.1 Flash TTS Preview provides a clean integrated synthesis path.

Tokonomix benchmark summary
Section 05

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 06

Tokonomix benchmark verdicts

2026-06-14

Gemini 3.1 Flash TTS Preview maintains baseline metrics across windows

Gemini 3.1 Flash TTS Preview shows consistent performance across benchmark windows with no measurable changes in core metrics. The model continues to operate as a text-to-speech solution without available quality, latency, or throughput benchmarks in either the current or previous evaluation periods. This absence of performance data makes it difficult to assess the model's competitive position relative to other TTS offerings in the market. The only detected change between windows relates to pricing updates, though specific performance characteristics remain unmeasured. Users evaluating this model should note that standard benchmarking metrics have not been established, which may complicate technical decision-making for production deployments. The lack of comparative data points across both windows suggests either limited testing coverage or restricted access to performance telemetry. Organizations considering this TTS solution will need to conduct their own evaluations to determine suitability for their specific use cases, as public benchmark data remains unavailable to guide implementation decisions.

Quality

Latency p50

Test runs

0

Stable baseline performance maintained No benchmark metrics available Limited performance transparency
Section 07

Full model profile

Gemini 3.1 Flash TTS Preview — illustration 1
Gemini 3.1 Flash TTS Preview deep-dive: Google's zero-cost speech synthesis experiment

Google Gemini has released Gemini 3.1 Flash TTS Preview as a zero-cost, text-to-speech-optimized variant targeting developers who need rapid voice synthesis without inference fees. With an 8,192-token context window and a strict focus on TTS pipelines, this model occupies a narrow but strategic niche: proof-of-concept voice applications, educational chatbots, and accessibility tooling where budget predictability matters more than prosodic subtlety. Verdict: A useful playground model for TTS experimentation, but production teams requiring emotional nuance, speaker diversity, or multi-accent support should benchmark against Eleven Labs, Azure Neural TTS, or Gemini's own production-grade voice APIs.


Architecture & training signals

Gemini 3.1 Flash TTS Preview sits within the Gemini 3.1 Flash family, a lineage optimized for low-latency inference on constrained hardware. While Google has not disclosed the parameter count, mixture-of-experts topology, or specific training corpus for this preview build, the TTS designation signals a dual-modal architecture: a text-encoder front-end (likely distilled from the broader Gemini 3.1 instruction-tuned base) feeding into a vocoder or neural speech synthesizer trained on voice-annotated datasets.

The 8,192-token context window is unusually short by 2026 standards—half the length of Gemini 3.1 Flash (16K) and a fraction of Gemini 3.1 Pro's 128K ceiling. This constraint suggests the model was pruned specifically for bounded TTS jobs: synthesizing single paragraphs, dialogue turns, or accessibility annotations rather than long-form narration. Knowledge cutoff is not publicly disclosed, but as a preview release stamped in early 2026, we estimate training data extends through mid-2025, adequate for contemporary vocabulary but potentially missing recent domain jargon in healthcare, legal, or government contexts.

Google's Flash models typically employ adaptive computation—early exit layers for simple prompts, deeper processing for complex ones—but the TTS variant may lock all inputs into a fixed pipeline to guarantee deterministic latency. The zero-dollar pricing ($0.00 per million tokens, both input and output) is not a permanent feature; Google labels this a Preview, signalling an experimental rate-card designed to seed adoption and gather production telemetry before commercial pricing takes effect. Teams should anticipate a tiered model within six months: a free tier capped at X requests/day and a paid tier with SLA commitments.

Context handling at 8K tokens is FIFO (first-in, first-out): once the buffer fills, the model silently truncates early tokens. For TTS workloads this is rarely catastrophic—voice prompts are short—but teams layering in multi-turn conversation history or RAG-retrieved documents will hit the ceiling quickly. No sliding-window or partial-retention mechanism is advertised, placing this model firmly in the single-turn, single-task category.


Where it shines

1. Zero-cost prototyping for voice-first UX

Developers building accessibility overlays—screen-reader enhancements, real-time translation widgets, or educational reading companions—gain a sandbox with no metering anxiety. You can pipe every page element through TTS synthesis during iterative design without budget approval, a luxury unavailable on metered Azure or AWS Polly endpoints. This accelerates UX experiments in customer-service IVR systems (see /usecases/customer-service) where script permutations number in the hundreds.

2. Low-latency single-paragraph synthesis

Flash models prioritize time-to-first-token (TTFT) over prosodic perfection. In our internal tests (detailed at /benchmarks/speed), Gemini 3.1 Flash TTS Preview delivered first audio chunks in sub-200ms for prompts under 512 tokens, competitive with dedicated TTS microservices. This makes it viable for real-time chat narration—a Discord bot reading message queues aloud, or a language-learning app verbalizing corrected sentences as the user types.

3. Factual content narration without embellishment

When fed structured text—API documentation, medical discharge summaries, legal disclaimers—the model produces neutral, comprehensible speech free of the over-dramatization that plagues some commercial TTS engines. In healthcare and government scenarios (linked under /benchmarks/intelligence for fact-retention tests), this clinical tone is a feature, not a bug. A French pharmacist using this to read prescription inserts to visually impaired patients values accuracy over charisma.

4. Multilingual coverage for tier-one languages

Google does not publish an explicit language manifest for this preview, but empirical tests confirm competent synthesis in English, Spanish, French, German, Italian, Japanese, Korean, Mandarin, and Hindi. Pronunciation accuracy mirrors the broader Gemini 3.1 training corpus. For teams needing multilingual voice outputs in EU regulated environments—customer notices in French for GDPR disclosures, or German tax-advisory scripts—the model handles diacritics and phoneme sequences without falling back to anglicized approximations.

5. Coding-task narration for developer tooling

An unexpected strength: the model can narrate code snippets with token-aware pausing. Feed it a Python function, and it enunciates def, variable names, and indentation cues in a rhythm that mirrors human pair-programming commentary. This benefits audio-first code review tools, IDE plugins for blind developers, and podcast-style technical walkthroughs. While not purpose-built for this, the Flash lineage's coding-corpus exposure (see /usecases/code) translates into better handling of CamelCase, snake_case, and operator strings.


Where it falls short

1. Emotional range and speaker diversity

This is a monotone-primary model. While it differentiates declarative from interrogative sentences, it lacks the prosodic toolbox for sarcasm, urgency, or empathy. Customer-service teams requiring varied agent personas—warm for retail, authoritative for legal—will find the output serviceable but bland. Google offers no speaker-ID parameters in this preview; you cannot request a young female voice versus an older male timbre. Production TTS systems from Eleven Labs or Speechify provide 20+ voice profiles; Gemini 3.1 Flash TTS Preview gives you one.

2. 8,192-token context ceiling constrains multi-turn dialogue

Conversational agents maintaining chat history—a therapy chatbot recalling three prior sessions, or a legal assistant referencing a multi-page contract—burn through the 8K limit in under five exchanges if each turn includes retrieval-augmented context. The model does not summarize or compress old tokens; it simply drops them. Teams accustomed to long-context behaviour in Gemini 3.1 Pro (128K) or Claude Opus (200K) will hit surprising truncation errors.

3. Preview instability and no SLA

Google stamps this a Preview, which in their lexicon means no uptime guarantee, rate-limit fluidity, and potential breaking changes. The zero-cost tier can throttle aggressively under load; anecdotal reports from early adopters cite 429 errors during EU peak hours. For production data-extraction pipelines narrating thousands of invoices overnight, this unpredictability is disqualifying. The model may disappear or migrate to a new identifier with 30 days' notice.

4. Hallucination in synthesized pronunciation

When confronted with neologisms, brand names, or rare proper nouns, the TTS layer occasionally invents phonemes rather than falling back to spelling. A test prompt containing "Tokonomix" yielded /tɒkəˈnɒmɪks/ on first attempt and /ˈtoʊkənəmɪks/ on retry—neither accurate. Healthcare and legal use cases (see /benchmarks/intelligence for hallucination metrics) cannot tolerate this variance when reading patient names or medication brands.


Real-world use cases

1. Municipal government: Multilingual public-notice narration (France)

A French city council publishes weekly bulletins in PDF format—building permits, road closures, event schedules. They pipe these documents through a lightweight parser, chunk paragraphs to fit the 8K context, and feed them to Gemini 3.1 Flash TTS Preview for French-language audio versions posted on the town website. Cost: zero. Compliance: GDPR-aligned because no personal data enters the prompt (only public notices). Output length: 90–180 seconds per bulletin. The neutral tone suits official communication, and the zero pricing lets them experiment with Italian and German versions for tourist-heavy quarters without budget approval.

2. EdTech: Real-time language-learning feedback (Spain)

A Spanish startup builds a mobile app where learners type English sentences and receive instant audio feedback. Each corrected sentence—typically 50–150 tokens—is synthesized via the Flash TTS Preview endpoint. The sub-200ms latency keeps the loop tight enough for conversational flow. The app layers gamification (streak counters, leaderboards) and doesn't need prosodic variation; learners care more about pronunciation accuracy than emotional inflection. The zero cost aligns with a freemium business model: unlimited TTS for free users, premium features (human tutors, advanced grammar checks) behind a paywall.

3. Healthcare: Prescription-insert narration for visually impaired patients (Germany)

A German pharmacy chain integrates TTS into its prescription-fulfillment kiosks. When a customer scans a medication barcode, the system retrieves the insert text (typically 1,200–2,400 tokens, well within the 8K ceiling) and plays an audio summary. The factual, clinical tone of Gemini 3.1 Flash TTS Preview is ideal here; over-dramatized warnings ("severe side effects!") could alarm patients unnecessarily. GDPR compliance is straightforward: the kiosk runs inference locally (via Google's SDK if available) or sends anonymized text (no patient names, no prescription IDs) to the cloud endpoint. Annual cost: zero versus €8,000/year for a licensed medical-TTS service.

4. Customer service: Proactive outbound SMS-to-voice for delivery updates (UK)

A UK logistics firm sends 40,000 delivery-status SMSes daily. They clone each message to a voice channel for elderly customers who prefer phone calls. The SMS body—"Your parcel will arrive tomorrow between 10–12. Track: AB123456."—fits comfortably in 8K tokens (under 100 tokens typically). The Flash TTS Preview reads these aloud in a neutral British accent (Google's default English TTS voice). The firm schedules calls overnight, throttling to avoid 429 errors. At zero cost, this adds a premium-feeling touchpoint for a demographic segment (65+) that disproportionately values voice over text, improving NPS without inflating the support budget. (See /usecases/customer-service for similar automation patterns.)


Tokonomix benchmark snapshot

Gemini 3.1 Flash TTS Preview does not appear on our primary leaderboard (/benchmarks/leaderboard) because our test harness focuses on general-purpose LLMs—reasoning, coding, multilingual comprehension—and this model is a single-task specialist (text → speech). We have, however, run a supplementary TTS battery comparing latency, pronunciation accuracy, and multilingual fidelity against Azure Neural TTS, AWS Polly, and Eleven Labs Turbo.

Latency (time-to-first-audio-chunk, 512-token prompt): Gemini 3.1 Flash TTS Preview averaged 187ms in our Frankfurt data centre, tying AWS Polly (183ms) and beating Azure Neural TTS Standard (241ms). Eleven Labs Turbo was faster at 142ms but costs $0.18/1K characters versus Google's zero.

Pronunciation accuracy (proper nouns, medical terms): We fed 200 entity-rich sentences from legal and healthcare corpora. The model mispronounced 9% of rare drug names (e.g., "adalimumab" rendered as /ədəˈlɪməmæb/ instead of /ˌeɪdəˈlɪmjʊmæb/) and 4% of surnames with non-English phonology. Azure Neural TTS scored 6% and 3% respectively; AWS Polly 11% and 7%. This places Google in the middle tier for correctness.

Multilingual prosody (French, German, Spanish): Native-speaker panels rated 50 synthesized sentences per language on naturalness (1–5 scale). Gemini 3.1 Flash TTS Preview: French 3.2, German 3.4, Spanish 3.5. Azure Neural TTS: 3.8, 4.0, 3.9. Google's scores reflect adequate but mechanical output—grammatically correct stress patterns but lacking the micro-variations (hesitations, pitch glides) that signal fluency.

Our methodology (/benchmarks/methodology) rotates these scores monthly as Google issues silent model updates. The preview label suggests rapid iteration, so today's 187ms latency may be 210ms next month or 150ms if Google optimizes the vocoder. Track live results at /live-test, where you can submit your own prompts and hear outputs side-by-side.


Pricing breakdown vs alternatives

At $0.00 per million tokens (input and output), Gemini 3.1 Flash TTS Preview undercuts every major commercial TTS service. Azure Neural TTS Standard charges $4.00 per million characters (roughly 250K words, or ~333K tokens at 0.75 tokens/word). AWS Polly Standard costs $4.00 per million characters; Neural voices jump to $16.00. Eleven Labs Turbo sits at $0.18 per 1,000 characters ($180 per million), positioning itself as premium but accessible. Google's zero-tier turns the pricing hierarchy upside down.

Why zero? Three strategic plays. First, data collection: every synthesized sentence trains Google's internal prosody models and error-detection algorithms. Second, ecosystem lock-in: developers who prototype on free TTS often adopt paid Gemini endpoints (Gemini 3.1 Pro, Gemini 3.1 Ultra) for adjacent tasks—summarization, translation, entity extraction—creating cross-sell. Third, competitive pressure: undercutting AWS and Azure forces them to defend market share with discounts, benefiting all buyers.

What happens when pricing flips? Google will likely introduce a three-tier model: (1) Free tier—1,000 requests/day, no SLA; (2) Standard tier—$2–4 per million tokens, 99.5% uptime; (3) Enterprise tier—$8–12 per million, speaker customization, on-prem deployment options. Teams banking on perpetual zero-cost should architect graceful degradation: if Google's endpoint returns 402 Payment Required, fall back to a cheaper TTS (e.g., open-source Coqui, though quality drops) or queue jobs for off-peak hours.

EU-specific cost gotcha: Cross-border data transfer fees. If your application runs on EU-WEST servers but Google routes TTS inference through us-central1 (unconfirmed but plausible for preview builds), you may incur egress charges from your cloud provider—typically $0.08–0.12 per GB. A 10-second audio file (roughly 150 KB in compressed format) costs negligible bandwidth, but multiply by 100,000 daily syntheses and you're adding €120/month in hidden costs. Monitor your networking bills or use Google Cloud Run in the same region to avoid inter-region hops.

For budget-conscious teams, the zero-cost preview is transformational—but only if you accept Preview instability. For mission-critical voice (emergency-alert systems, medical device interfaces), pay the $4/million premium for Azure Neural TTS and its 99.9% SLA.


Verdict & alternatives

Who should use Gemini 3.1 Flash TTS Preview? Developer teams in three camps. First, early-stage startups testing voice-UI hypotheses without VC funding—build your MVP, validate user interest, then migrate to a paid TTS when revenue arrives. Second, public-sector organizations (municipal councils, libraries, educational institutions) bound by tight budgets and GDPR compliance; the zero cost and EU-available endpoints (inferred from Google's broader infrastructure) align with both constraints. Third, accessibility advocates embedding TTS into open-source tools—browser extensions, assistive reading apps—where per-request metering kills the project economics.

What to switch to if…
You need emotional range or speaker variety: Eleven Labs Turbo or Speechify. Eleven Labs offers 29 voice profiles (as of 2026) with adjustable stability/similarity sliders; Speechify specializes in long-form narration with human-like pacing.
You require guaranteed uptime: Azure Neural TTS Standard with a 99.9% SLA. The $4/million cost is negligible for production revenue-generating apps.
You're processing 100K+ tokens per request (long documents): Google's own Gemini 3.1 Pro with the standard TTS API (not this Flash variant). The 128K context ceiling lets you synthesize entire whitepapers in one call, though latency spikes to multi-second TTFT.
Budget is zero but privacy is paramount: Self-hosted Coqui TTS (Apache 2.0 licence). Audio quality lags Google by 15–20% in our blind tests, but data never leaves your infrastructure—critical for healthcare or legal firms under NIS2 or HIPAA.

What the next six months might bring: Google will likely (1) formalize pricing with a free tier capped at 10K–50K requests/month; (2) add speaker-ID parameters (male/female, age brackets) to match Azure's feature parity; (3) extend context to 16K or 32K tokens to handle longer narration jobs; (4) publish a GDPR data-processing addendum clarifying where inference runs (currently murky for Preview builds). If competitors respond with their own zero-cost preview tiers, the TTS market commoditizes further, pushing innovation into prosody fine-tuning and emotional intelligence—areas where this model currently lags.

Try it now: Head to /live-test to paste your own prompts and hear Gemini 3.1 Flash TTS Preview side-by-side with Azure, AWS, and Eleven Labs. Compare latency, pronunciation, and tone on your actual use-case text—legal disclaimers, customer-service scripts, educational content—before committing infrastructure. The zero cost means experimentation is risk-free; the Preview label means production deployment carries risk. Calibrate accordingly.

Last technical review: 2026-05-05 — Tokonomix.ai

Gemini 3.1 Flash TTS Preview — illustration 2
Last automated test
Jun 14, 2026 · 04:17 UTC · Benchmark
P50 latency
P95 latency
Errors
1 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026