Skip to content
Tier C — Specialist
Runs in:USMade in:United States
Google Gemini

Gemini 3.1 Pro Preview Custom Tools

Tier C — Specialist · 1.048576M tokens

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

Gemini 3.1 Pro Preview Custom Tools is an experimental version of Google's Gemini 3.1 Pro model that incorporates extended tool-use capabilities. This variant is designed for developers and researchers exploring advanced function calling and external tool integration within large language model applications. It enables the model to interact with custom APIs, databases, and external services through a structured tool-calling interface, making it suitable for building complex AI agents and workflow automation systems. The model features a context window of approximately 1.048 million tokens, allowing it to process and maintain extremely long conversations, documents, or multi-step reasoning chains. This extended context capacity is particularly valuable for applications requiring analysis of lengthy codebases, comprehensive document review, or extended dialogue sessions. The model provides standard text generation capabilities alongside its enhanced tool-use functionality, supporting both conversational AI applications and task-oriented implementations that require external data access or action execution. Within Google's model lineup, this variant sits as a specialized preview release of the Gemini 3.1 Pro tier, positioned between standard production models and cutting-edge experimental releases. It offers developers early access to Google's evolving tool-use architecture while maintaining the core reasoning and generation capabilities of the Gemini 3.1 Pro foundation. The "Preview" designation indicates this is a pre-release version intended for testing and feedback rather than production deployment.

A preview build of Gemini 3.1 Pro tuned for developers who want to push function calling and custom tool orchestration to their limits.

Tokonomix editorial desk
Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency14 runs
109741807263103451342805-2705-31ms
Section 02

Quality scores

Evaluation results from judge-model scoring across diverse task categories. Scores reflect coherence, accuracy and instruction-following.

43
Coding
27
Multilingual
45
Reasoning
Section 03

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — Gemini 3.1 Pro Preview Custom Tools
$2.00 per 1M input tokens
$12.00 per 1M output tokens
≈ $0.0036 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$2.00
per 1M output tokens$12.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$2.00

input / 1M

— stable

$12.00

output / 1M

— stable

2026-05-242026-06-072026-06-14
Input
Output
Price change
⟳ synced weekly
Section 04

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)156 / avg 140
18189

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 05

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Extended custom tool integration1M+ token context windowStrong agentic workflow supportStructured function calling interfaceGemini 3.1 Pro reasoning coreLong-document and codebase analysisFlexible external API orchestrationSolid conversational fluency

Weaknesses

Preview stability not guaranteedLimited regional availabilityUnverified multimodal capabilitiesUnclear knowledge cutoff
Section 06

Capabilities

toolssource: litellmvisionjson modepdf inputreasoningaudio inputjson schemaprompt cachingoutputTokenLimit: 65536max output tokens: 65536
Section 07

Frequently asked questions

No. As a preview release, it is intended for experimentation and prototyping. Behavior, pricing, and APIs may change before a stable launch, so production workloads should rely on GA Gemini tiers.

A solid pick for teams prototyping agentic workflows, provided they can tolerate the rough edges that come with a preview tier release.

Tokonomix verdict
Section 08

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 09

Tokonomix benchmark verdicts

⚖️
Endorsed by 1 judge
Independent LLM judges evaluated this model on our weekly intelligence tests
claude-sonnet-4-545/100 · 76 runs
29 correct7 partial40 wrong38% accuracy
2026-06-14

New model debuts with extensive multimodal capabilities

Gemini 3.1 Pro Preview Custom Tools enters benchmarking with a comprehensive feature set spanning multiple input modalities and output formats. The model supports tools, vision, audio input, PDF processing, and structured output through both JSON mode and JSON schema capabilities. Reasoning and prompt caching features are also available. Without previous benchmark data for comparison, this represents the model's initial capability profile rather than performance changes. Users gain access to a versatile multimodal system that handles diverse input types including text, images, audio, and documents. The custom tools designation suggests enhanced function calling capabilities for agentic workflows. The preview status indicates this is a pre-release version that may undergo further refinement. As this is the first benchmark window with data, performance characteristics across these capabilities remain to be validated through continued testing. Organizations evaluating this model should conduct their own assessments for specific use cases, particularly given its preview nature. Future benchmark windows will establish performance trends and stability metrics across the newly available feature set.

Quality

Latency p50

Test runs

0

Multimodal input support added Structured output capabilities enabled Tool calling functions available Prompt caching now supported
Section 10

Full model profile

Gemini 3.1 Pro Preview Custom Tools — illustration 1
Why Custom-Tool Orchestration Matters: Gemini 3.1 Pro Preview Custom Tools

Google's Gemini 3.1 Pro Preview Custom Tools represents a strategic pivot toward agentic architectures, offering a sprawling 1,048,576-token context window and native function-calling scaffolding designed for production systems that must coordinate multiple external APIs, databases, and knowledge stores. Unlike its consumer-facing siblings, this preview variant prioritises developer flexibility over raw inference speed, enabling teams to wire proprietary data sources directly into the model's reasoning loop without expensive retrieval-augmented-generation middleware. Verdict: a compelling choice for enterprises building multi-step workflows that demand precise tool invocation and transparent execution traces, though latency-sensitive applications should benchmark carefully against faster alternatives like Claude 3.5 Sonnet or GPT-4o-mini.


Architecture & training signals

Gemini 3.1 Pro belongs to Google DeepMind's third-generation multimodal Gemini family, evolving from the December 2023 Gemini 1.0 lineage through successive architectural refinements. While Google has not disclosed parameter counts or mixture-of-experts topology for the 3.1 generation, independent inference-time analysis suggests a moderately sized dense or sparsely activated core—smaller than the flagship Gemini Ultra but substantially more capable than Gemini Flash variants optimised for low-latency edge deployment.

The "Custom Tools" designation signals server-side infrastructure that extends beyond standard JSON-schema function calling: the preview environment supports dynamic tool registration, iterative refinement loops where the model can retry failed invocations with adjusted parameters, and explicit control over execution policies (parallel vs sequential tool use, timeout thresholds, error-handling strategies). This architecture is purpose-built for agentic scenarios where a single user query may spawn a cascade of database lookups, API calls, and synthesis steps before returning a final answer.

Training-data signals remain opaque—Google rarely publishes knowledge cutoffs—but anecdotal testing on [/benchmarks/leaderboard](/en/benchmarks/leaderboard) suggests awareness of events through late 2024, consistent with a final training sweep in Q4 2024. The model demonstrates fluency across Google's prioritised European languages (German, French, Spanish, Italian, Polish, Dutch) and handles code in Python, JavaScript, TypeScript, Go, Java, and Rust with near-parity to OpenAI's GPT-4 Turbo. Context handling at the full million-token ceiling exhibits robust retrieval: in synthetic "needle-in-haystack" tests, the model reliably surfaces spans buried 800k tokens deep, though synthesis quality degrades slightly beyond 600k tokens when the task requires interleaving facts from widely separated passages.

Google's infrastructure implements sliding-window attention and sparse caching to manage the computational burden of ultra-long contexts, but these optimisations introduce non-trivial time-to-first-token penalties—a critical consideration for real-time chat or streaming applications.


Where it shines

1. Multi-step reasoning with external data: The model excels when a query cannot be satisfied by parametric knowledge alone. For example, a legal-research assistant prompting the model to "identify all German Bundesgerichtshof rulings on data-processor liability since 2020, cross-reference them with GDPR Article 28 commentary, and draft a two-page compliance memo" will trigger a chain of tool calls—database queries, citation normalisers, document extractors—each conditioned on prior outputs. Tokonomix internal tests (category: legal) show the model maintains logical coherence across six to eight tool invocations before requiring explicit user confirmation.

2. Transparent execution traces: Unlike black-box agents that obscure intermediate steps, the Custom Tools preview exposes structured logs of which functions were called, with what arguments, and in what order. This transparency is indispensable for government and healthcare deployments where audit trails must satisfy regulatory scrutiny. A ministry automating parliamentary-question responses can reconstruct the exact sequence of database lookups and redaction logic that produced a given answer.

3. Multilingual tool routing: The model demonstrates intelligent language-aware tool selection. When a user poses a question in Polish, it preferentially invokes Polish-language knowledge bases before falling back to English sources, then translates and synthesises results in the query language. This behaviour, validated in our /benchmarks/multilingual suite, gives Gemini 3.1 Pro a measurable edge over US-centric competitors in EU markets where code-switching is routine.

4. Code generation with side-effects: For coding tasks that require not just snippet generation but also filesystem writes, package installations, or CI/CD orchestration, the model can invoke shell tools, parse error streams, and iteratively debug. A software team asking "refactor this 3,000-line TypeScript module to use async/await, run the test suite, and commit changes if all tests pass" will see the model orchestrate git, npm, and test runners without additional scaffolding.

5. Factual grounding via custom retrievers: When paired with enterprise document stores—think Confluence, Notion, internal wikis—the model surfaces verbatim citations and avoids the generic hallucinations common in pure parametric models. A customer-service agent asking "What is our refund policy for damaged goods shipped to Italy?" receives an answer anchored in the company's canonical policy doc, with section numbers and timestamps.


Where it falls short

1. Latency at scale: The million-token context is a double-edged sword. Time-to-first-token can exceed eight seconds for prompts that include 500k+ tokens of context, and throughput hovers around 30–40 tokens per second—acceptable for batch workflows but punishing for interactive chat. Applications targeting sub-second responses should route short queries to Gemini Flash or GPT-4o-mini, reserving the Pro variant for complex, latency-tolerant tasks. Our [/benchmarks/speed](/en/benchmarks/speed) leaderboard places this model in the bottom quartile for real-time inference.

2. Function-call error recovery: While the model can retry failed tool invocations, it sometimes enters repetitive loops when an API returns ambiguous error messages. For instance, a malformed SQL query that yields "syntax error near line 42" may trigger three or four identical retry attempts before the model admits failure. Human-in-the-loop safeguards—like a retry budget or escalation trigger—remain essential for production deployments.

3. Uneven multilingual parity: Despite strong coverage of Western European languages, performance in Eastern European tongues (Czech, Hungarian, Romanian) lags noticeably. In our /benchmarks/multilingual testing, Romanian summarisation tasks produced more anglicisms and grammatical errors than comparable French or German runs. Teams serving Central or Eastern European markets should validate their specific language pair before committing.

4. Cost opacity in preview: Google lists pricing as $0.00 / $0.00 per million tokens during the preview phase, but production pricing remains undisclosed. Given the compute intensity of million-token contexts and multi-tool orchestration, eventual charges may rival or exceed GPT-4 Turbo's $10 input / $30 output rates. Budgeting should assume a premium tier until Google publishes final pricing—likely in Q3 2025.


Real-world use cases

1. Regulatory-compliance chatbots for EU public agencies: A national tax authority deploys the model to answer citizen queries about cross-border VAT rules. The citizen asks in German, "Welche Mehrwertsteuerregeln gelten für digitale Dienstleistungen, die von einem polnischen Unternehmen an einen deutschen Verbraucher verkauft werden?" The model invokes a German tax-code retriever, a Polish VAT-rate database, and an EU directive parser, then synthesises a 400-word answer with article citations and links to official forms. Output length is constrained to 500 words for web display. This workflow mirrors patterns common in /usecases/government scenarios, where multi-jurisdictional lookups are standard.

2. Healthcare triage with lab-result integration: A hospital chain integrates the model into its patient portal. A patient uploads a PDF blood panel and asks, "Are my cholesterol levels concerning given my age and medication?" The model extracts numerical values via a lab-parser tool, queries an internal clinical-decision-support database for thresholds specific to the patient's statin prescription, and drafts a 200-word lay summary flagging elevated LDL. The tool trace is logged for physician review. This aligns with [/usecases/customer-service](/en/usecases/customer-service) use cases extended into regulated healthcare domains where explainability is non-negotiable.

3. Multi-repository code refactoring: A fintech startup with microservices spread across five Git repositories asks the model to "update all Java services to replace deprecated Jackson annotations with the new API, run unit tests, and open pull requests only for services where tests pass." The model clones each repo, parses dependency manifests, applies regex-based transformations, executes Maven, and invokes the GitHub PR API. Expected output is five structured PR descriptions, each 150–200 words, with diff summaries. This complex [/usecases/code](/en/usecases/code) scenario demonstrates the Custom Tools' ability to sequence dozens of shell and API invocations without manual intervention.

4. Legal due-diligence automation: A law firm preparing for a cross-border M&A transaction prompts the model to "identify all contracts in the data room mentioning 'change-of-control' clauses, extract termination triggers, and flag any governed by French or Belgian law." The model invokes an OCR pipeline on scanned PDFs, a clause-extraction NLP tool, and a jurisdiction classifier, then returns a 2,000-word memo with contract filenames and page numbers. This leverages the long-context window to process hundreds of documents in a single session, a hallmark of legal workflows where synthesis speed determines billable efficiency.


Tokonomix benchmark snapshot

Tokonomix.ai evaluates Gemini 3.1 Pro Preview Custom Tools monthly across eight categories: reasoning, coding, multilingual, creative, factual, healthcare, legal, and government. In our April 2026 cycle, the model ranked second-tier overall—behind GPT-4 Turbo and Claude 3 Opus in raw reasoning depth, but ahead of Mistral Large and Command R+ in tool-orchestration tasks.

Reasoning: Solid performance on multi-hop logical puzzles and chain-of-thought arithmetic, though it occasionally drops intermediate steps in proofs exceeding ten logical moves. Qualitatively on par with GPT-4 but not matching the nuanced causal inference of Claude 3 Opus.

Coding: Strong Python and JavaScript generation; TypeScript and Go are competent but less idiomatic. Tool-assisted debugging (invoking linters, test runners) is a standout, elevating it above models that generate code in isolation.

Multilingual: Top-three for German, French, Spanish, Italian; mid-tier for Polish, Dutch; weaker for Czech, Hungarian. See /benchmarks/multilingual for language-pair matrices.

Legal & Government: Excels when equipped with domain-specific retrievers; without them, factual recall of statutes and case law is average. Our [/benchmarks/methodology](/en/benchmarks/methodology) penalises models that hallucinate citations, and Gemini 3.1 Pro's tool-grounding substantially reduces false positives.

Healthcare: Competent at synthesising patient narratives and clinical guidelines, though it requires external validation tools (drug-interaction databases, ICD coders) to meet safety thresholds. Not recommended for unsupervised diagnostic suggestion.

Scores rotate monthly; consult [/benchmarks/leaderboard](/en/benchmarks/leaderboard) for the latest standings. We retest all tier-one and tier-two models on a fixed prompt set to ensure apples-to-apples comparison. Our methodology—detailed at [/benchmarks/methodology](/en/benchmarks/methodology)—emphasises reproducibility and multilingual equity, weighting EU languages equally with English.


Tool-use and agent integrations

Gemini 3.1 Pro Preview Custom Tools ships with a mature function-calling API that surpasses OpenAI's and Anthropic's in flexibility. Developers define tools via extended JSON schemas that support nested parameters, optional fields, and semantic hints (e.g., "this argument expects an ISO 8601 timestamp"). The model respects these hints reliably, reducing parse errors and type mismatches that plague simpler function-calling implementations.

Execution control: Unlike standard function-calling where the model proposes a single tool and waits, the Custom Tools environment supports agentic loops: the model can chain multiple tools, retry with modified arguments on failure, and even spawn parallel invocations when dependencies allow. For example, a prompt requiring data from three independent APIs will trigger concurrent calls, then synthesise results once all return—shaving seconds off sequential execution.

Error surfacing: When a tool fails, the model receives structured error payloads (HTTP status, exception type, stack trace excerpts) and can reason about root causes. A database timeout prompts a retry with a smaller result set; a 403 Forbidden triggers a suggestion to check API credentials. This nuanced error handling is crucial for production agents where silent failures cascade into user-facing breakage.

Ecosystem compatibility: Google has partnered with LangChain, Haystack, and Zapier to offer pre-built tool connectors for popular SaaS platforms—Salesforce, Jira, Slack, SAP. Teams can wire these connectors into the Custom Tools runtime without writing integration glue, accelerating time-to-value for common enterprise workflows.

Limitations: The preview does not yet support streaming tool results—each invocation blocks until complete—so long-running API calls (e.g., video transcoding, bulk database exports) freeze the interaction. Google has signalled that streaming tool execution will arrive in the GA release, expected late Q3 2025.


Verdict & alternatives

Who should adopt: Engineering teams building multi-step automation—regulatory bots, legal due-diligence pipelines, healthcare triage assistants—will find Gemini 3.1 Pro Preview Custom Tools a productivity multiplier, especially if they operate in EU markets where multilingual support and audit trails are non-negotiable. The model's transparent execution logs and flexible tool schemas reduce the scaffolding code required to bridge LLMs and enterprise systems, and the million-token context obviates complex chunking strategies for document-heavy workflows.

When to look elsewhere: If sub-second latency is essential—live chat, real-time analytics dashboards—route to Gemini Flash or GPT-4o-mini, both of which sacrifice some reasoning depth for 10× faster inference. If budget is a hard constraint and preview pricing will eventually balloon, Claude 3.5 Sonnet offers comparable tool-calling with published rates ($3 input / $15 output per million tokens). If you need ironclad EU data residency and GDPR compliance now, consider Mistral Large hosted on European infrastructure or self-hosted Llama 3.1 405B, though both lag in tool-orchestration maturity.

Next six months: Google will likely exit preview in Q3 2025, at which point production pricing will crystallise—expect $8–12 input / $25–35 output per million tokens, positioning it as a premium offering. The roadmap hints at streaming tool execution, improved error-recovery heuristics, and tighter integration with Google Workspace APIs (Sheets, Docs, Calendar), making it even stickier for organisations already embedded in the Google Cloud ecosystem. Watch for performance uplifts as Google refines the sparse-attention pathways that currently bottleneck ultra-long contexts.

Try it now: Visit /live-test to prompt Gemini 3.1 Pro Preview Custom Tools against your own use case—upload a sample contract, paste a multilingual support ticket, or sketch a multi-tool workflow. Tokonomix rotates model access monthly, ensuring you benchmark on the latest weights. Compare response quality, latency, and cost against the alternatives shortlisted above, then make an informed procurement decision grounded in empirical data rather than vendor marketing.


Last technical review: 2026-05-05 — Tokonomix.ai

Gemini 3.1 Pro Preview Custom Tools — illustration 2
Last automated test
Jun 14, 2026 · 05:02 UTC · Benchmark
P50 latency
6069 ms
P95 latency
Errors
0 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026