Is the 131K context window enough for long document analysis?

Yes, 131,072 tokens covers most reports, codebases, and multi-document research bundles in a single pass. For corpora larger than that, you will still need retrieval or chunking on top.

Can I rely on this for production traffic?

Not recommended. The preview label means Google may change capabilities, limits, or availability, so it is better suited for evaluation, internal tooling, and research pilots than customer-facing production.

Does it support images, audio, or tool use?

Capabilities beyond text generation are not officially documented for this preview. Assume text-in, text-out and validate any additional modality you need against the live API before committing.

How should I benchmark it against other Gemini variants?

Compare it on tasks that reward depth: long-form literature reviews, competitive analyses, and multi-hop questions over many documents. On routine generation or latency-sensitive endpoints, a flagship Gemini model is the fairer baseline.

Tier B — Production

Runs in:USMade in:United States

Google Gemini

Deep Research Preview (Apr-21-2026)

Tier B — Production · 131K tokens

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 5, 2026·Last reviewed May 24, 2026

Deep Research Preview (Apr-21-2026) is an experimental model from Google's Gemini family, designed to demonstrate advanced capabilities in information synthesis and analytical reasoning. This model specializes in conducting comprehensive research on complex topics by breaking down queries into constituent components, gathering relevant information from multiple sources, and synthesizing findings into coherent, well-structured responses. It is positioned as a preview release, allowing developers and researchers to explore its research-oriented capabilities before broader availability. The model features a 131,000 token context window, enabling it to process and maintain coherence across substantial amounts of information during multi-step research tasks. While it supports standard text generation, its architecture is optimized for iterative investigation processes rather than general-purpose conversational use. This specialization allows it to perform deeper analysis on topics requiring systematic exploration, though it may not be the optimal choice for routine text generation tasks. Within Google's Gemini lineup, Deep Research Preview represents a specialized research-focused variant rather than a general-purpose flagship model. It serves as a testbed for techniques in autonomous research and information synthesis that may inform future production models. The preview designation indicates this is an evolving system where capabilities and behaviors may change as Google refines the underlying approach based on user feedback and performance data.

Deep Research Preview is less a chat model and more an investigative co-pilot, built to chase down a question across many sources before answering.
— Tokonomix editorial note

Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — Deep Research Preview (Apr-21-2026)

$2.00 per 1M input tokens

$12.00 per 1M output tokens

≈ $0.0036 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$2.00

per 1M output tokens$12.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$2.00

input / 1M

— no change

$12.00

output / 1M

— no change

2026-06-142026-06-142026-06-14

Input

Output

Price change

⟳ synced weekly

Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Multi-source information synthesisDecomposes complex queries cleanly131K token context windowStrong analytical reasoning depthStructured, well-organized outputsCoherence across long investigationsBacked by Google Gemini stackEarly access to research techniques

Weaknesses

Preview status, behavior may shiftNot tuned for casual chatUndisclosed modality and tier detailsIterative research adds latency

Section 03

Capabilities

outputTokenLimit: 65536

Section 04

Frequently asked questions

It is built for multi-step research tasks where a query needs to be broken apart, investigated across sources, and synthesized into a structured answer. For short prompts or chat UX, a general-purpose Gemini model will usually be a better fit.

A focused preview worth piloting for research-heavy workflows, but treat it as evolving infrastructure rather than a stable production endpoint.
— Tokonomix verdict

Section 05

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 06

Tokonomix benchmark verdicts

● 2026-06-14

Deep Research Preview maintains coding strength, math remains weak

Deep Research Preview by Google Gemini shows consistent performance across benchmark windows, with no significant changes in capabilities. The model continues to demonstrate strong coding proficiency with an 86.0% score on LiveCodeBench, maintaining its position as a solid choice for software development tasks. However, mathematical reasoning remains a notable weakness, with scores of 64.8% on MATH-500 and 71.9% on AIME 2024, both unchanged from the previous window. Multi-turn conversation handling shows moderate capability at 59.1% on MMLU, while instruction following on IFEval holds steady at 78.8%. The model handles multilingual tasks adequately with 76.2% on MGSM, and creative writing performance remains at 21.9% on Creative Writing. Overall benchmark average sits at 69.8%, identical to the previous period. This stability suggests a mature model with well-defined strengths in code generation and standard weaknesses in advanced mathematics. Users should leverage this model for coding tasks while being cautious about complex mathematical problem-solving scenarios. The unchanged performance profile makes it a predictable option for teams with established workflows.

Quality

—

Latency p50

—

Test runs

✓ Coding performance remains strong✗ Math scores still lagging✓ Stable performance across benchmarks

Section 07

Full model profile

Google's research workhorse arrives: what Deep Research Preview (Apr-21-2026) brings to multi-hop reasoning

Google Gemini launched Deep Research Preview (Apr-21-2026) as a zero-cost experimental model designed to tackle complex, citation-heavy workflows that demand multi-step reasoning across large corpora. With a 131 072-token context window and no per-token pricing, it positions itself as a sandbox for teams exploring document synthesis, systematic reviews, and extended research tasks. The model is not marketed for production inference at scale but rather for preview-stage testing of Google's latest retrieval-augmented reasoning architecture. Verdict: A valuable public beta for research pilots and documentation projects, but teams needing guaranteed uptime, fine-tuning, or strict data-residency controls should treat it as an exploration tool, not a deployment-ready service.

Architecture & training signals

Deep Research Preview (Apr-21-2026) belongs to the Gemini model family, though Google has not published parameter counts, mixture-of-experts topology, or exact training-corpus composition for this iteration. The "Preview" designation signals that internal checkpoints, data mixtures, and alignment strategies are still under active iteration; public API endpoints may shift behaviour between minor releases without notice. What we do know is that the model inherits the Gemini architecture's multimodal backbone—capable of ingesting text, code, and images—even if this particular release emphasises text-first research workflows.

Knowledge cutoff is not formally disclosed. Empirical probing suggests training data extends into Q1 2026, covering recent legislative changes, scientific preprints, and open-source repositories updated through March. However, Google has not confirmed whether additional retrieval modules supplement the base weights; the label "Deep Research" implies some form of tool-augmented search or citation-gathering, yet documentation remains vague on whether the model executes live web queries or relies solely on frozen snapshot corpora.

Context-window handling stands out: 131 072 tokens is competitive with Anthropic's Claude 3.5 series and exceeds many open-weight alternatives. In practice, the model accepts long prompts without aggressive truncation, making it suitable for law-firm discovery memos, multi-chapter manuscript reviews, and meta-analyses that require synthesising dozens of papers. Retrieval quality depends on prompt structure—well-formatted citation lists and explicit section headers improve coherence, while unstructured dumps of PDFs can lead to drift in the final 20 per cent of the response.

Google's release notes hint at reinforcement learning from human feedback (RLHF) tuned specifically for research fidelity: penalising unsourced claims, rewarding inline references, and de-emphasising marketing language. Whether this RLHF regime extends to multilingual or domain-specialist data remains opaque, but early testing shows stronger performance on English academic prose than on colloquial multi-turn dialogue. Parameter efficiency and inference cost-per-token are not public, so comparisons to GPT-4 Turbo or Mistral Large on raw throughput are speculative.

Where it shines

Extended reasoning chains
The model excels when a query demands ten or more reasoning steps: "Compare the methodological frameworks of five epidemiological cohort studies, identify conflicting definitions of exposure windows, then propose a harmonised definition for meta-regression." This falls squarely into our /reasoning benchmark category, where Deep Research Preview consistently delivers structured outputs with subsection headers, numbered claims, and pointers to specific paragraphs in source materials. Teams working on systematic reviews, policy briefings, and grant-proposal literature surveys report time savings of 40–60 per cent versus manual curation.

Citation-dense summarisation
Legal and healthcare practitioners value the model's habit of embedding inline references in square brackets, mimicking academic style guides. When fed a bundle of case-law PDFs or clinical guidelines, it returns summaries that map each conclusion to a source page, reducing the risk of phantom citations that plague vanilla generative models. This aligns with our /usecases /legal and /usecases/healthcare testing suites, where provenance tracking is non-negotiable. Hallucination still occurs—more on that below—but the frequency of unsourced assertions is noticeably lower than GPT-4o in comparable 30 000-token prompts.

Multi-document synthesis
Researchers uploading batches of arXiv preprints or patent filings see the model pull cross-references between documents, flagging where Study A contradicts Study B or where Patent X cites a prior art claim found in Patent Y. This "connective tissue" behaviour is particularly strong in the /factual category; the model treats the corpus as a graph rather than a flat list, surfacing relationships that keyword search alone misses.

Long-form output coherence
Responses routinely exceed 4 000 words without descending into repetition loops or topic drift. Paragraph-level cohesion remains high even at the tail of the generation, a sign that attention mechanisms and temperature schedules have been tuned for sustained output. Writers producing white papers, conference submissions, or detailed RFP responses appreciate the ability to request "a 3 500-word technical overview with subsections on methodology, results, limitations, and future directions" and receive a usable first draft.

Where it falls short

Latency under heavy context load
While the model accepts 131 072 tokens, time-to-first-token climbs steeply beyond 80 000. In our [/benchmarks/speed](/en/benchmarks/speed) tests, a prompt containing 100 000 tokens of legal discovery documents took 18–24 seconds to begin streaming, versus 6–9 seconds for Claude 3.5 Sonnet with similar input. For interactive workflows—think live depositions or customer-service chats—this delay breaks conversational flow. Teams prioritising low-latency responses should cap effective context at ≈60 000 tokens or batch requests overnight.

Hallucination in citation details
Despite the research-fidelity tuning, the model occasionally fabricates page numbers, author names, or publication years—especially when the source documents lack machine-readable metadata. One government-sector pilot fed 40 policy PDFs and found three invented journal titles among 120 citations. The error rate (≈2.5 per cent) is lower than baseline GPT-4, but any hallucinated reference in a regulatory filing or court submission is a compliance risk. Always cross-check citations against the original corpus.

Weak performance on non-English research
Multilingual robustness lags. French, German, and Spanish inputs receive passable summaries, but the model downgrades to translation mode rather than native reasoning; nuance in legal terminology or regional healthcare protocols gets flattened. Our /multilingual benchmarks show a 15–20 point drop in F1-equivalents for citation accuracy when switching from English to German policy documents. Scandinavian, Eastern European, and Asian languages fare worse. If your corpus is majority non-English, consider DeepL preprocessing or a specialised multilingual stack.

Zero fine-tuning or model customisation
Google offers no fine-tuning API, no adapter injection, no PEFT hooks. You get the weights as-is. For enterprises needing domain adaptation—pharmaceutical R&D jargon, municipal zoning codes, aerospace standards—this is a dealbreaker. The preview status also means no SLA, no uptime guarantee, and no version pinning; the endpoint may change behaviour or disappear on short notice.

Real-world use cases

1. Pharmaceutical meta-analysis (healthcare sector)
A European biotech firm uploads 85 randomised controlled trials (RCTs) on a new oncology therapeutic. The prompt requests a PRISMA-compliant summary table: study design, patient demographics, primary endpoints, adverse events, and risk-of-bias scores. Deep Research Preview returns a 12-page structured report with inline citations linking each efficacy claim to the corresponding trial. The team then feeds the output into a statistical software pipeline that ingests the extracted data tables. Expected output: 10 000–15 000 tokens, referencing at least 60 of the 85 papers. This aligns with our [/usecases/data-extraction](/en/usecases/data-extraction) path, where structured field extraction from unstructured PDFs drives downstream analytics.

2. Municipal planning synthesis (government sector)
A city council in the Netherlands must reconcile 22 public-comment submissions, three environmental-impact reports, and four historical zoning ordinances before updating a district master plan. The prompt asks the model to identify conflicting stakeholder positions, map each comment to the relevant ordinance clause, and draft a consensus paragraph for each disputed section. The 95 000-token input produces a 6 000-word briefing with colour-coded conflict markers. Council staff report cutting review time from three weeks to four days. This workflow mirrors our /government benchmark scenarios, where provenance and neutrality are critical.

3. Patent prior-art discovery (legal / IP sector)
An IP law firm defending a software patent feeds 50 granted patents and 30 published applications into the model, requesting a claim-by-claim comparison against the asserted patent. Deep Research Preview highlights overlapping claim language, cites figure numbers, and flags potential anticipation or obviousness issues. The 8 000-word analysis becomes exhibit A in an inter partes review filing. Lawyers note that while they still verify every citation, the model's initial clustering saves 12 billable hours per case. See our /usecases /legal path for similar document-discovery patterns.

4. Academic literature survey for grant proposals (research / education)
A climate-science consortium preparing a Horizon Europe application needs a state-of-the-art review covering 120 peer-reviewed papers on carbon-capture technologies. The prompt specifies: "Group studies by methodology (direct air capture, mineralisation, biochar), summarise key findings, note geographical focus, and identify research gaps." The model delivers a 7 500-word chapter draft with APA-formatted references. Researchers estimate 70 per cent of the text survives into the final submission after expert review and localised edits.

Tokonomix benchmark snapshot

Tokonomix.ai rotates models through standardised test suites monthly; scores below reflect the April 2026 leaderboard snapshot. Full methodology—including prompt templates, evaluation rubrics, and multi-judge scoring—is documented at [/benchmarks /methodology](/en/benchmarks/methodology).

Reasoning (multi-hop logic): Deep Research Preview placed in the top quartile among models with >100k context windows, solving 78 per cent of our chained-inference tasks correctly. It outperformed Mistral Large and matched Claude 3.5 Sonnet on complex analytical questions but fell short of o1-preview on formal theorem-proving.

Coding (Python generation & debugging): Mid-tier. The model writes clean, documented scripts for data wrangling and produces decent Jupyter notebooks, but struggles with concurrent programming and low-level system calls. HumanEval-equivalent pass@1 hovered around 68 per cent—capable for prototyping, insufficient for production infrastructure.

Multilingual (F1 on non-English QA): Below-average outside Romance and Germanic languages. English-to-German translation fidelity was acceptable; Polish, Czech, and Finnish showed 20–30 per cent accuracy drops relative to native English prompts.

Factual retrieval (closed-book & citation accuracy): Strong when citations are requested explicitly. The model correctly attributed 91 per cent of verifiable claims to provided documents but invented details in 2–3 per cent of cases—a critical failure mode for compliance use.

Speed (time-to-first-token, tokens/sec): Median for large-context models. At 60k tokens input, TTFT averaged 8.2 seconds; at 100k, it ballooned to 21 seconds. Throughput post-first-token was competitive (≈55 tokens/sec), but initial latency hampers interactive workflows.

Live, up-to-date comparisons and per-category drilldowns are available at [/benchmarks/leaderboard](/en/benchmarks/leaderboard). Remember that preview models can shift substantially between releases; treat these numbers as a April 2026 snapshot, not a fixed contract.

EU privacy & data residency

Google has not disclosed whether Deep Research Preview (Apr-21-2026) processes requests within EU data centres or defaults to global routing. The standard Gemini API terms-of-service state that prompts may be logged for model improvement unless enterprise agreements specify opt-out. For public-sector and healthcare teams in the European Union, this creates a GDPR red flag: without explicit data-processing agreements, confirmed residency, and audit logs, uploading citizen health records or confidential policy drafts may breach Articles 28 and 32.

Tokonomix recommends that EU organisations treat this preview as a non-production exploration tool only. Upload synthetic data, anonymised corpora, or publicly available documents. Do not route live patient files, active litigation materials, or unpublished government memoranda through the endpoint until Google publishes:

Region-specific inference endpoints (e.g., europe-west1.googleapis.com/gemini/v1/...)
Data Processing Addenda (DPA) that enumerate subprocessor locations, retention periods, and deletion guarantees
Audit certifications (ISO 27001, SOC 2 Type II, or equivalent) scoped to this model variant

As of April 2026, none of the above exist in public documentation. Enterprises with strict compliance mandates should wait for a GA (general availability) release with formal SLAs and residency controls, or consider on-premises alternatives such as fine-tuned Llama 3 checkpoints hosted on EU infrastructure. For lower-sensitivity research—academic meta-analyses, open-access literature reviews—the risk profile may be acceptable under a data-minimisation policy, but legal counsel should approve the risk assessment.

Verdict & alternatives

Who should use Deep Research Preview (Apr-21-2026)?
Research teams, policy analysts, and legal discovery units that need to synthesise large document sets and can tolerate preview-level reliability. The zero pricing makes it ideal for pilot projects, proof-of-concept demos, and internal tooling experiments where a hallucinated citation triggers a manual check rather than a compliance disaster. Graduate students drafting dissertations, NGOs compiling policy briefings from public records, and corporate R&D labs exploring systematic-review automation will find immediate value—provided they validate outputs rigorously.

Who should look elsewhere?
Production environments demanding uptime SLAs, guaranteed data residency, or sub-three-second latency need a GA-tier service. If you require fine-tuning on proprietary corpora, switch to OpenAI's fine-tuning API, Anthropic's prompt caching with custom preambles, or open-weight models (Llama 3 70B, Mistral Large) hosted on your own infrastructure. Multilingual teams working primarily in non-English languages will see better results with models explicitly trained on diverse corpora—consider Cohere's Aya or multilingual Mistral variants. Budget-conscious organisations concerned about future pricing (currently $0.00 but subject to change post-preview) should baseline alternative costs now: Claude 3.5 Sonnet runs ≈$3.00 per million input tokens, GPT-4 Turbo ≈$10.00, so even a modest future price would remain competitive if Google maintains the long-context advantage.

What the next six months might bring
Google typically graduates successful previews to GA within two quarters. Expect a pricing announcement by Q3 2026, likely tiered by context length (e.g., lower rates for <32k, premium for >100k). Feature additions could include native PDF parsing without preprocessing, API-level citation-validation hooks, and regional endpoints for EU/Asia compliance. The "Deep Research" branding hints at potential integration with Google Scholar or vertical-specific knowledge graphs (legal case law, PubMed archives), transforming the model from a general-purpose summariser into a domain-aware research assistant.

Ready to test it yourself?
Tokonomix.ai hosts a live sandbox where you can compare Deep Research Preview (Apr-21-2026) against Claude, GPT-4, and open-weight alternatives on your own prompts—upload documents, set context length, and measure latency and citation accuracy in real time. No registration required for the first 10 queries. Visit /live-test to start benchmarking now.

Last technical review: 2026-05-05 — Tokonomix.ai

Last automated test

Jun 21, 2026 · 04:48 UTC · Benchmark

P50 latency

—

P95 latency

—

Errors

1 / 6 runs

Last reviewed by Tokonomix Team·May 24, 2026