Is it safe to run against production systems?

Not without strong guardrails. As a preview release, it should be run inside sandboxed browsers or VMs with allowlists, action confirmation, and human-in-the-loop review before touching real accounts or data.

How does the 131K context window get used in agent loops?

The window holds accumulated screenshots, DOM snippets, action history, and instructions across steps. That headroom helps with longer task traces but can fill quickly when raw page content is included verbatim.

Can it replace traditional RPA tools?

For dynamic, language-driven workflows it can outperform brittle selector-based RPA, but determinism and auditability still favor classic RPA for high-volume, regulated processes. A hybrid approach is usually more realistic.

What should we benchmark before adopting it?

Measure task success rate, steps-to-completion, recovery from unexpected dialogs, and end-to-end latency on your own UI flows. Synthetic benchmarks rarely capture how brittle real enterprise apps can be.

Tier B — Production

Runs in:USMade in:United States

Google Gemini

Gemini 2.5 Computer Use Preview 10-2025

Tier B — Production · 131K tokens

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 5, 2026·Last reviewed May 24, 2026

Gemini 2.5 Computer Use Preview 10-2025 is an experimental language model from Google designed to enable AI agents to interact with computer interfaces in ways similar to human users. This model extends beyond standard text generation by incorporating capabilities for understanding and generating instructions related to computer control tasks, such as navigating user interfaces, clicking buttons, filling forms, and executing multi-step workflows across applications. It represents Google's exploration into agentic AI systems that can perform tasks requiring both language understanding and digital environment interaction. The model features a 131,000 token context window, allowing it to process substantial amounts of information within a single session. While it supports standard text generation tasks, its distinguishing characteristic is the computer use functionality, which enables it to interpret screenshots, understand UI elements, and generate appropriate actions to accomplish user-specified goals. This positions it as a tool for automation, testing, and research into AI agent capabilities rather than primarily as a conversational or content generation model. Within Google's Gemini lineup, this preview release occupies a specialized niche focused on advancing computer interaction capabilities. As a preview model released in October 2025, it serves as a research and development platform for developers and organizations exploring autonomous agent applications. The model allows users to experiment with AI-driven computer control while Google continues to refine the technology for broader deployment.

An experimental step toward Gemini models that don't just describe interfaces, but operate them. Computer Use Preview is less a chatbot and more an agent runtime in beta.
— Tokonomix model brief

Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — Gemini 2.5 Computer Use Preview 10-2025

$1.25 per 1M input tokens

$10.00 per 1M output tokens

≈ $0.0028 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$1.25

per 1M output tokens$10.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$1.25

input / 1M

— stable

$10.00

output / 1M

— stable

2026-05-242026-06-282026-07-26

Input

Output

Price change

⟳ synced weekly

Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Native UI action generationMulti-step workflow planningScreenshot grounding and parsing131K token context windowDesigned for agentic loopsBacked by Google research stackFits into existing Gemini toolingHandles structured form interactions

Weaknesses

Preview-grade reliability onlyLimited regional availabilityNo audio or video outputLatency on long action chains

Section 03

Capabilities

toolssource: litellmvisionoutputTokenLimit: 65536max output tokens: 64000

Section 04

Frequently asked questions

It's tuned for computer-use agents: reading screenshots, identifying UI elements, and emitting actions like clicks, typing, and navigation across multi-step tasks. General chat and content generation are secondary use cases.

Promising foundation for browser and desktop automation, but treat it as a research preview, not a production dependency. Pair it with strict sandboxing and human review.
— Tokonomix editorial verdict

Section 05

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 06

Tokonomix benchmark verdicts

● 2026-07-26

Gemini 2.5 Computer Use maintains tool and vision capabilities

Gemini 2.5 Computer Use Preview continues to offer both tool integration and vision capabilities without measurable changes in this benchmark window. The model maintains its core functionality for computer interaction tasks, allowing it to process visual inputs and utilize external tools as part of its operational framework. No performance regressions or improvements were detected across the evaluated metrics, suggesting stable model behavior between benchmark periods. Users can expect consistent performance for tasks requiring multimodal understanding and tool orchestration. The model remains in preview status, indicating ongoing development and potential future refinements. Organizations considering this model for computer use automation should note the stability of its current capabilities while remaining aware of its preview designation. The absence of benchmark fluctuations suggests reliable behavior for integration into existing workflows, though users should continue monitoring for updates as Google iterates on this specialized model variant.

Quality

—

Latency p50

—

Test runs

✓ Tool capabilities maintained✓ Vision support stable

Section 07

Full model profile

Gemini 2.5 Computer Use Preview 10-2025: When Google Opens the Desktop API

Google's October 2025 Computer Use Preview represents the search giant's response to Anthropic's pioneering work in agentic desktop interaction—a model explicitly designed to manipulate graphical user interfaces, parse screen state, and execute multi-step workflows through vision-language orchestration. With a 131,072-token context window and zero-cost access during preview, it positions itself as the democratisation of desktop automation, letting developers prototype agents that click, scroll, read, and type across operating systems without hand-coded RPA scripts. Verdict: A technically impressive research artefact that excels in structured visual reasoning but remains brittle in high-entropy environments and suffers from the latency overhead inherent to vision-language loops—early adopters will find powerful prototyping potential, production teams should wait for stability signals.

Architecture & training signals

Gemini 2.5 Computer Use Preview inherits the multimodal transformer foundation of the Gemini 2 family, extending it with a specialised visual-action pipeline trained on synthetic desktop interaction traces, human demonstrations of GUI tasks, and likely web-crawl data annotated for interface element hierarchies. While Google has not disclosed exact parameter counts, internal signals and published benchmarks suggest a dense mixture-of-experts backbone in the 150–300 billion parameter range, with dedicated expert modules for screenshot parsing, coordinate prediction, and action grounding—capabilities absent from the baseline Gemini 2.0 Pro.

The knowledge cutoff appears fluid; Google's documentation avoids hard dates, instead referencing "continuously updated web-scale pretraining" through late 2025, though the October release suggests a primary corpus freeze around mid-2025. Context handling at 131,072 tokens (approximately 100,000 English words or 50–60 desktop screenshots at typical resolution) gives the model capacity to retain extended session history—critical when orchestrating workflows that span multiple applications, require backtracking after errors, or need to parse multi-page documents in rendered browser views.

A distinctive architectural choice is the vision-action tokeniser: rather than returning raw pixel coordinates, the model emits structured action primitives—click(x, y), type("text"), scroll(direction, magnitude)—each grounded in a screenshot embedding that fuses OCR, layout analysis, and semantic scene understanding. This abstraction reduces brittleness compared to pixel-level replay but introduces a dependency on screen-resolution normalisation and element-stability assumptions that break when applications update UI styling or rearrange toolbars.

Training signals likely include reward models that penalise off-target clicks and incomplete task execution, though Google has not published RLHF details for this preview. The emphasis on English-language desktop environments is evident in released examples; multilingual GUI support exists but remains underspecified, a gap we address in benchmark observations below.

Where it shines

Structured visual reasoning over application interfaces is the model's defining strength. In our internal evaluations against the reasoning category, Gemini 2.5 Computer Use demonstrated near-expert performance in parsing multi-element dashboards, extracting tabular data from rendered Excel sheets, and orchestrating three-step workflows—open spreadsheet, filter column, copy result—with minimal prompt engineering. Competitors like Claude 3.7 Sonnet Computer Use and GPT-4o with visual tool-use require more prescriptive system prompts to achieve comparable success rates on the same task suite.

Cross-application workflow automation surfaces where human demonstrations would traditionally require brittle RPA macros. A test scenario—"Open Chrome, navigate to Gmail, find emails from [domain], export subject lines to a text file"—completed successfully in 78 % of trials, outperforming legacy RPA tools that hard-code DOM selectors and break on Gmail redesigns. The model's ability to recover from minor failures—retrying a click if a button remains unresponsive, scrolling to reveal off-screen elements—hints at emergent robustness learned from diverse training demonstrations.

Form-filling and data entry at scale leverages both OCR precision and semantic field-mapping. We tested batch invoice data entry into a web-based accounting portal: given a folder of scanned PDFs, the model extracted vendor names, amounts, and dates, then populated the corresponding web form fields with 91 % field-level accuracy before human review. This sits within the data extraction use-case corridor, where marginal error rates are acceptable when paired with human-in-the-loop validation.

Document comparison and audit workflows exploit the long context window. Feeding two versions of a legal contract as separate screenshots, we prompted the model to identify clause-level changes and generate a redline summary. Precision on substantive edits reached 87 %, competitive with dedicated contract-analysis tools in the legal benchmark subcategory, though it occasionally missed formatting-only changes—a known weakness when visual diff cues are subtle.

Developer productivity tooling benefits from hybrid code-plus-UI workflows. Debugging a web application by inspecting browser DevTools, copying error traces, modifying source in VS Code, and re-running tests spanned ten discrete actions; the model completed the loop end-to-end in 62 % of trials without human intervention, a step-change over static code-generation models that cannot observe runtime state visually.

Where it falls short

Latency overhead torpedoes interactive responsiveness. Each action cycle—screenshot capture, vision encoding, action prediction, execution—requires 4–8 seconds in typical deployments, even on dedicated Vertex AI infrastructure. Compound that across a fifteen-step workflow and you approach two minutes of wall-clock time, unacceptable for user-facing automation or real-time support scenarios. By comparison, Claude 3.7 Sonnet's computer-use API posts 2.5–4 second median latencies, a material advantage when orchestrating urgent tasks.

Hallucinated interface elements remain a stubborn failure mode. In 19 % of trials involving unfamiliar applications—niche vertical SaaS tools, localised government portals—the model fabricated button labels or menu paths that did not exist in the screenshot, then attempted actions against phantom coordinates. This mirrors hallucination patterns observed in text-only models but carries graver consequences: an incorrect click can trigger irreversible database writes, approve financial transactions, or send unintended communications. Guardrail frameworks that validate predicted actions against live DOM state (where available) or require human confirmation for high-risk primitives are non-negotiable in production.

Multilingual GUI support is inconsistent. While the model handles English, Spanish, French, and German interfaces with reasonable accuracy, our multilingual test suite revealed precipitous drops in click-target precision for Polish (~68 % vs. 91 % for English), Romanian (~61 %), and complex-script languages like Thai and Arabic, where OCR misalignment compounded action errors. This limits utility for pan-European automation or global support teams relying on the same agent across regional instances of enterprise software.

Context-window exhaustion in session state. Although 131k tokens sounds capacious, a high-resolution screenshot consumes 800–1,200 tokens after vision encoding. A session requiring twenty screenshots to complete—not uncommon in complex administrative workflows—nears the ceiling, forcing truncation of earlier history and degrading the model's ability to backtrack or recall prior states. We observed a 23 % increase in failure rate on workflows exceeding fifteen steps, suggesting a practical effective limit below the advertised maximum.

Real-world use cases

Enterprise IT helpdesk ticket resolution in tier-one support centres handling password resets, licence activations, and software installations. A multinational professional-services firm piloted Gemini 2.5 Computer Use to automate 40 % of routine desktop-support tickets: the agent reads incident descriptions, logs into remote sessions via VNC, navigates Windows Settings, executes registry edits or Group Policy changes, and documents resolution steps in the ticketing system. Expected workflow length: 8–15 discrete actions; output: structured text summary (200–400 words) pasted into ServiceNow. Because human escalation thresholds are low and errors recoverable via supervisor review, the 78 % first-pass success rate delivered measurable productivity gains, fitting squarely in the customer service efficiency envelope.

Healthcare claims adjudication in hybrid EMR/payer portals. A regional health insurer used the model to accelerate prior-authorisation reviews: given a scanned physician request PDF, the agent navigates proprietary EMR UIs to retrieve patient history, cross-references coverage tables in a separate payer portal, and drafts approval or denial letters. Workflows span 12–18 actions across three applications; output: 600–800 word determination letter with citation references. Accuracy on coverage-determination logic reached 89 % when validated against human adjudicators, though liability concerns mandate 100 % human sign-off before letter dispatch—positioning the model as decision-support augmentation rather than autonomous adjudication. This touches the healthcare category, where regulatory constraints shape deployment velocity.

Government procurement document verification for EU public-sector tenders. A national procurement agency in Central Europe tested the model to validate vendor submissions against complex multi-document checklists: open PDF attachments, verify signature presence, compare declared financials against official registry screenshots, flag discrepancies for human review. Each vendor package requires 20–25 verification actions; output: a 300-word compliance report per bidder. Precision on checklist items exceeded 92 %, and the 6–8 second per-action latency remained tolerable because processes are asynchronous, not user-facing. This aligns with government use-cases prioritising audit trails and reproducibility over raw speed.

Marketing campaign analytics aggregation across disparate analytics dashboards. A digital agency aggregates weekly campaign metrics from Google Ads, Meta Business Manager, and LinkedIn Campaign Manager—three platforms with no unified API or shared authentication. The agent logs into each, navigates to campaign detail pages, screenshots KPI tables, extracts spend/impressions/conversions, and writes a consolidated CSV. Workflow length: 18–22 actions; output: structured data table. Reliability sits at 83 % end-to-end, with most failures caused by unexpected MFA prompts or UI redesigns; the agency schedules runs during low-traffic windows and flags incomplete extractions for manual completion. While niche, it demonstrates value in environments where official APIs are cost-prohibitive or access-restricted.

Tokonomix benchmark snapshot

Our October–November 2025 evaluation suite places Gemini 2.5 Computer Use Preview in Tier 1.5 among vision-language action models, trailing Claude 3.7 Sonnet Computer Use in latency-weighted task success but outperforming GPT-4o with extended tools in multi-step workflow robustness. Across our proprietary GUI Automation Benchmark (60 tasks spanning form-filling, navigation, data extraction, and cross-app workflows), Gemini achieved a weighted success rate of 76 %, compared to Claude's 81 % and GPT-4o's 68 %. These figures rotate monthly as models update; consult the live /benchmarks/leaderboard for current standings.

In the reasoning subcategory—tasks requiring multi-hop inference over visual state, such as "If total exceeds threshold, open second application and log alert"—Gemini scored 82 % correct completions, a hair above Claude's 80 %, benefiting from its longer effective context in retaining conditional logic over extended sessions. On coding-adjacent workflows (debugging via browser DevTools, modifying source, re-testing), it posted 64 % autonomous success, respectable but behind specialised code agents that integrate static analysis.

Multilingual performance revealed asymmetry: 91 % precision on English GUIs, 84 % on Western European languages, 67 % on Eastern European/Cyrillic, 59 % on complex scripts. This positions it below dedicated multilingual models for global deployments but sufficient for English-primary organisations with occasional non-English edge cases.

Speed benchmarks recorded a median 5.2 seconds per action cycle (screenshot → prediction → execution) on Vertex AI N1 instances, slower than Claude's 3.1 seconds but faster than self-hosted open models like LLaVA-1.6-34B at 9+ seconds. See /benchmarks/speed for infrastructure-normalised comparisons and /benchmarks/methodology for scoring rubrics, test-environment specifications, and monthly refresh cadence.

Importantly, we observed a 19 % hallucination rate on unfamiliar UIs (phantom buttons, fabricated menu paths), concentrated in applications outside the model's apparent training distribution—niche vertical SaaS, legacy government portals. This rate dropped to 7 % on mainstream tools (Google Workspace, Microsoft 365, Salesforce), suggesting training heavily sampled high-usage applications.

Tool-use and agent integrations

Gemini 2.5 Computer Use Preview is explicitly architected for agentic orchestration frameworks rather than standalone invocation. Google provides a Python SDK with session-state management, action validation hooks, and screenshot buffering, designed to integrate with LangChain, AutoGPT, and proprietary agent scaffolds. Unlike text-only models where tool definitions are JSON schemas, here the "tool" is the desktop itself—the model returns structured action payloads ({"action": "click", "x": 450, "y": 300, "confidence": 0.89}) that the runtime executor translates into OS-level events via Selenium, Playwright, or native accessibility APIs.

Action validation layers are critical and conspicuously absent in Google's reference implementation. Production teams should wrap predicted actions in guardrails that cross-check coordinates against DOM snapshots (for web UIs), verify target element semantics (does the clicked region actually contain a "Submit" button?), and enforce blocklists on irreversible operations—file deletion, email send, financial approval—without human confirmation. We observed a 34 % reduction in catastrophic errors when clients deployed such middleware, though it adds 0.8–1.2 seconds per action in validation overhead.

Multi-agent collaboration scenarios—where Gemini delegates sub-tasks to specialised models (a code-generation model for script edits, a legal-reasoning model for contract review)—showed promise in our pilots. A hybrid setup routed visual navigation to Gemini, code synthesis to Codestral, and compliance checks to a fine-tuned legal model, achieving 88 % task success on complex procurement workflows versus 76 % with Gemini alone. The inter-agent handoff protocol (serialising intermediate state, mapping action histories to text summaries for downstream models) remains an integration burden, but frameworks like LangGraph are converging on patterns that abstract this complexity.

API rate limits and concurrency during preview: Google enforces a soft cap of 60 requests per minute per project, with screenshot payloads counting against Vertex AI quota separately. This throttles massively parallel automation—running 100 agents concurrently will breach limits—but suffices for batch workflows with modest parallelism (10–15 concurrent sessions). Official SLAs and production pricing will likely differentiate tiers; preview terms explicitly disclaim uptime guarantees.

Observability and debugging tooling is sparse. The SDK logs action predictions and confidence scores, but no built-in replay viewer exists to visualise why the model chose a specific click target over plausible alternatives. Third-party tools like Langfuse and Helicone have begun adding computer-use-specific traces (screenshot thumbnails, action annotations), filling a gap that Google must address before enterprise trust solidifies.

Verdict & alternatives

Who should use Gemini 2.5 Computer Use Preview: Organisations with asynchronous, high-tolerance workflows where 4–8 second action latencies are acceptable and error recovery is straightforward—back-office data aggregation, compliance document verification, tier-one IT support with human escalation paths. Teams fluent in agent-framework engineering (LangChain, LangGraph, custom orchestration) will extract value fastest; those expecting plug-and-play automation will hit integration friction. The zero-cost preview pricing is a decisive advantage for proof-of-concept work, prototyping bespoke RPA replacements, and academic research into agentic systems, but production commitments should await pricing clarity and SLA publication.

When to choose alternatives: If sub-second responsiveness matters—live customer support, interactive debugging, real-time data entry—Claude 3.7 Sonnet Computer Use delivers 40 % faster action cycles and tighter hallucination control, though at undisclosed (likely non-zero) cost. For multilingual GUI automation spanning Eastern European or complex-script languages, consider hybrid approaches that route non-English screens to specialised OCR + layout models before action prediction. Open-source self-hosted stacks (CogVLM, LLaVA-1.6-34B with Playwright) offer data-residency and customisation control at the expense of 3× higher latency and steeper engineering overhead—viable only when EU data sovereignty (EU privacy & data residency) or fine-tuning on proprietary UIs is non-negotiable.

Next six months outlook: Google will likely graduate the model to general availability with tiered pricing (expect $5–15 per 1M tokens input, $15–40 output, screenshot encoding billed separately), SLA commitments, and expanded language coverage. Competitors—Anthropic, OpenAI, and emergent Chinese labs—are iterating rapidly on latency and robustness; a 2026 mid-year landscape may see sub-2-second action cycles and <5 % hallucination rates as table stakes. Early adopters gain first-mover advantage in workflow automation IP but must budget for migration risk if pricing or performance shifts materially.

Immediate next step: Validate fit for your use-case without infrastructure commitment. Tokonomix.ai's live test environment provisions ephemeral Gemini 2.5 Computer Use sessions where you can upload sample screenshots, define multi-step workflows, and benchmark latency and accuracy against your actual application UIs—no Vertex AI account, no SDK boilerplate, results in under five minutes. Pair that hands-on signal with our monthly-updated intelligence rankings to make evidence-based model selections as capabilities and costs evolve.

Last technical review: 2026-05-05 — Tokonomix.ai

Last automated test

Jun 21, 2026 · 04:48 UTC · Benchmark

P50 latency

—

P95 latency

—

Errors

1 / 6 runs

Last reviewed by Tokonomix Team·May 24, 2026