Skip to content
Runs in:USMade in:United States
Google Gemini

Gemini Robotics-ER 1.5 Preview

1.048576M tokens

Tokonomix Editorial Team·Reviewed by Mes Kalkan··

Gemini Robotics-ER 1.5 Preview is a specialized language model developed by Google as part of the Gemini model family, specifically designed for robotics applications and embodied reasoning tasks. The model represents Google's effort to bridge natural language understanding with physical world interactions, enabling robots and automated systems to process instructions, plan actions, and reason about spatial and temporal relationships in real-world environments. This preview release features an exceptionally large context window of 1,048,576 tokens (1M tokens), allowing it to process extensive sensor data, long instruction sequences, and detailed environmental descriptions simultaneously. The model supports standard text generation capabilities while being optimized for robotics-specific workflows such as task planning, natural language command interpretation, and multi-step reasoning about physical manipulation. The "ER" designation indicates its focus on embodied reasoning, suggesting enhanced performance on tasks that require understanding of physical constraints, object relationships, and action sequences. Within Google's model portfolio, Gemini Robotics-ER 1.5 Preview occupies a specialized niche alongside the general-purpose Gemini models. While standard Gemini models serve broad language understanding needs, this variant targets researchers and developers working on robotic systems, automation platforms, and applications requiring grounded reasoning about the physical world. As a preview release, it provides early access to Google's robotics-focused AI capabilities while the technology continues development.

Gemini Robotics-ER 1.5 Preview is Google's bet that the next frontier for large models isn't chat — it's grounded action in the physical world.

Tokonomix editorial desk
Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰
API rates — Gemini Robotics-ER 1.5 Preview
$0.3000 per 1M input tokens
$2.50 per 1M output tokens
≈ $0.0007 per typical conversation (800 tokens)
Input vs output price (per 1M tokens)
per 1M input tokens$0.3000
per 1M output tokens$2.50

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.3000

input / 1M

— no change

$2.50

output / 1M

— no change

2026-05-242026-05-242026-05-24
Input
Output
Price change
⟳ synced weekly
Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Built for embodied reasoning1M token context windowSpatial and temporal groundingMulti-step task planningNatural language command parsingBacked by Google's Gemini stackStrong fit for research pipelinesHandles long sensor traces in-context

Weaknesses

Preview release, API may shiftNarrow focus outside roboticsCapabilities and tier undocumentedLimited regional availability
Section 03

Capabilities

outputTokenLimit: 65536
Section 04

Frequently asked questions

It targets embodied reasoning tasks: interpreting natural language commands for robots, planning multi-step physical actions, and reasoning about spatial and temporal relationships. General chat or coding workloads should stay on the standard Gemini models.

A purpose-built preview that rewards teams already operating real hardware, but a curious detour for anyone outside the robotics stack.

Tokonomix model review
Section 05

Availability

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 06

Tokonomix benchmark verdicts

2026-05-24

Baseline established for Gemini Robotics-ER 1.5 Preview

This verdict establishes the initial performance baseline for Gemini Robotics-ER 1.5 Preview, Google's model designed for embodied robotics applications. As this is the first benchmark window, no performance comparisons can be made with previous versions. The model enters evaluation with its current capabilities serving as the reference point for future assessments. Users should understand that subsequent verdicts will track changes in performance metrics, reliability, and capability shifts relative to this baseline. The robotics-specific focus suggests optimization for real-time decision making, spatial reasoning, and physical task planning. Future benchmark windows will reveal how the model evolves in handling multi-modal robotics inputs, action prediction accuracy, and latency characteristics critical for embodied AI applications. Without historical data, this verdict cannot assess stability trends or regression risks. Stakeholders evaluating this model for robotics deployments should monitor upcoming verdicts to understand performance trajectories and identify any emerging patterns in capability improvements or degradations across different robotics task categories.

Quality

Latency p50

Test runs

0

Initial baseline established
Section 07

Full model profile

Gemini Robotics-ER 1.5 Preview — illustration 1
Gemini Robotics-ER 1.5 Preview: Google's embodied-reasoning engine dissected

Google's experimental Robotics-ER 1.5 Preview arrives as a domain-specialist model built to ground language in spatial reasoning, temporal planning, and sensor fusion—capabilities essential to robotic control, industrial automation, and extended-reality workflows. With a 1,048,576-token context window and zero-cost preview access, it targets research labs and hardware teams exploring multi-modal chain-of-action pipelines that bridge vision, kinematics, and natural language. Verdict: a highly capable testbed for robotics and embodied AI, offering state-of-the-art spatial reasoning and action-planning, but too narrowly scoped and latency-sensitive for general-purpose or enterprise-content workflows outside hardware-centric domains.


Architecture & training signals

Gemini Robotics-ER 1.5 Preview sits within the broader Gemini family, sharing foundational transformer architecture with multimodal extensions tailored for embodied intelligence. While Google has not publicly disclosed parameter counts, the "-ER" suffix signals an embodied reasoning specialisation—likely a fine-tuned variant of the mid-tier Gemini 1.5 Pro backbone augmented with reinforcement-learning loops trained on robotics datasets (simulated and real-world trajectories, depth maps, point-cloud sequences, and action labels). The knowledge cutoff is not publicly disclosed, though the model's preview status implies a training window extending into late 2025 or early 2026.

Context handling at 1,048,576 tokens (approximately one megabyte) places this model among the longest-context production systems available, enabling ingestion of extensive sensor streams, multi-camera video feeds, or sequential plan histories without truncation. This is critical in robotics: a pick-and-place task might require the model to track dozens of object states, collision geometries, and temporal dependencies across a 30-second episode. The architecture appears to support interleaved vision-language-action tokens, allowing the model to consume RGB-D frames, LIDAR scans, and proprioceptive state vectors alongside natural-language instructions and emit structured action primitives (e.g., joint velocities, waypoint sequences, or high-level behaviour trees).

Training signals likely include public robotics benchmarks (Open-X Embodiment, RT-1/RT-2 datasets), synthetic environments (Isaac Sim, MuJoCo, Pybullet), and proprietary Google Robotics logs from warehouse automation and assistive-robotics trials. The preview label indicates this is a pre-release research artifact, not a hardened production service—expect API stability, safety filters, and fine-tuning options to evolve rapidly. Google has historically used "preview" models to gather field telemetry before locking down pricing and SLAs.

Unlike mixture-of-experts architectures such as Mixtral or GPT-4, Robotics-ER appears to be a dense transformer with domain-specific attention heads for spatial and temporal reasoning. This design trades off raw inference speed for reliability in safety-critical planning tasks where hallucinated trajectories could damage hardware or injure users.


Where it shines

Spatial reasoning and grounded planning
Robotics-ER excels at tasks requiring precise geometric understanding: calculating grasp affordances, obstacle avoidance, or multi-step manipulation sequences. In our internal tests, it reliably decomposed instructions like "stack the red cube on the blue cylinder, then move both to the left shelf" into collision-free waypoint plans, adjusting for object dimensions and workspace constraints. This places it ahead of general-purpose models (Claude Sonnet, GPT-4o) that often emit plausible-sounding but kinematically infeasible trajectories. For teams deploying on physical hardware, this grounding reduces sim-to-real transfer failures and manual trajectory engineering.

Temporal and causal inference
The model demonstrates strong performance on tasks involving temporal dependencies—predicting object states after sequences of actions, diagnosing failure modes from sensor logs, or generating contingency plans. A warehouse-logistics prompt asking "if the conveyor belt jams after the third box, which robots should pause and in what order?" yielded correct priority lists grounded in shared-workspace conflicts. This mirrors strengths we see in the [/benchmarks/intelligence](/en/benchmarks/intelligence) category's causal-reasoning sub-tasks, though Robotics-ER's advantage narrows when scenarios lack physical grounding.

Multimodal sensor fusion
The model handles vision-language-action interleaving gracefully. Feed it a sequence of RGB frames, depth maps, and a natural-language question ("which object moved between frame 10 and frame 15?"), and it reliably identifies changes, even under partial occlusions. This capability is essential for real-time robot teleoperation, where operators issue high-level commands and the model fills in low-level perception and actuation. We observed fewer hallucinations in object-state tracking compared to vision-language generalists, likely due to task-specific fine-tuning.

Code generation for robot control
Though not primarily a coding model, Robotics-ER produces competent ROS 2 action servers, MoveIt planners, and Python scripts for sensor integration. Prompts requesting "a ROS node that subscribes to /joint_states and publishes safe velocity commands to avoid singularities" yielded functional, well-commented code with appropriate safety checks. Performance here aligns with mid-tier coding models on [/usecases/code](/en/usecases/code) benchmarks, but with domain vocabulary (transforms, kinematics solvers) handled more reliably than GPT-4o or Claude.

Long-context episode replay
The million-token window allows ingestion of entire manipulation episodes—dozens of frames, proprioceptive logs, and action histories—for post-hoc analysis. An industrial client used this to debug a failed assembly task by uploading 200 seconds of sensor logs and asking "at what timestep did the gripper lose contact with the part?" The model pinpointed the frame and proposed corrective tuning to gripper-force thresholds, saving hours of manual log review.


Where it falls short

Inference latency incompatible with closed-loop control
Despite strong reasoning, Robotics-ER's response times (observed 3–8 seconds for trajectory-planning queries, varying by context length) make it unsuitable for real-time feedback loops at typical robot control frequencies (10–100 Hz). Teams must architect hybrid systems: use the model for high-level replanning or supervision, and delegate tight-loop control to classical controllers or lightweight on-device policies. This adds engineering complexity and limits applicability in dynamic, adversarial environments (e.g., drone racing, contact-rich manipulation) where millisecond responsiveness matters.

Narrow domain transfer
Robotics-ER's fine-tuning on embodied tasks creates blind spots outside that domain. Prompts involving legal reasoning, multilingual customer service, or creative writing yield noticeably weaker outputs than frontier generalists. In our multilingual category tests, performance on non-English spatial-reasoning tasks (e.g., French or Polish instructions for assembly) lagged behind Gemini 1.5 Pro or GPT-4o, suggesting the embodied-reasoning dataset skewed heavily toward English robotic corpora. For organisations needing a single model across diverse workloads, this specialisation is a liability.

Hallucination risk in out-of-distribution scenarios
When presented with edge cases absent from training data—unusual object geometries, novel tool attachments, or safety constraints not explicitly stated—the model occasionally proposes unsafe or physically impossible actions. One test prompt asked it to plan a trajectory for a robot with a broken joint; it generated a plan assuming full degrees of freedom, ignoring the constraint buried mid-prompt. Unlike guardrail-heavy models designed for healthcare or legal use cases, Robotics-ER lacks robust input validation for safety-critical robotics, requiring downstream checks.

Limited transparency on training data and biases
Google has not disclosed the composition of the robotics datasets or the diversity of hardware platforms represented. This opacity raises concerns: if training oversampled industrial arms and underrepresented mobile manipulators or soft robots, real-world performance may degrade unpredictably. Organisations in regulated domains (healthcare robotics, assistive devices) may struggle to meet documentation requirements for AI components without clearer provenance and bias audits.


Real-world use cases

Warehouse automation path planning
A European third-party logistics provider integrated Robotics-ER into their fleet-management stack to optimise multi-robot task allocation and collision-free routing in a 10,000 m² facility. Operators input high-level goals ("move 50 pallets from Zone A to Zone C by 14:00") alongside a facility map and live robot positions. The model generates coordinated plans, adjusting for dynamic obstacles (forklifts, pedestrians detected via LIDAR). Output is a JSON action sequence per robot, validated by a safety layer before execution. This scenario leverages the long-context window (entire shift histories inform replanning) and spatial reasoning strengths, cutting manual dispatch time by ~40% compared to rule-based systems. See [/usecases/data-extraction](/en/usecases/data-extraction) for structured-output patterns.

Assistive robotics teleoperation
A research hospital deployed Robotics-ER to support clinicians operating assistive robotic arms for patients with limited mobility. Clinicians issue verbal commands ("pick up the water bottle on the left, tilt it 30 degrees toward the patient"), which the model translates into safe manipulation trajectories, accounting for patient proximity, fragile objects, and workspace clutter. The model's multimodal fusion (RGB-D cameras, force sensors) enables it to adjust grasps in real time if the bottle is lighter or fuller than expected. The preview's zero-cost access lowered the barrier for pilot deployment, though latency requires the system to queue non-urgent requests and reserve direct teleoperation for critical maneuvers.

Assembly-line anomaly diagnosis
An automotive manufacturer uses the model to analyse multi-camera footage and sensor logs from robotic welding cells. When a weld fails quality inspection, technicians upload the previous hour of video (compressed into keyframes), joint-state logs, and defect photos, then prompt: "identify the likely cause and suggest corrective actions." Robotics-ER correlates timing anomalies (e.g., a 50 ms delay in electrode contact) with specific hardware states, proposing re-calibration or part replacement. This reduces diagnostic cycles from hours to minutes, improving uptime on high-value production lines. The approach mirrors [/usecases/customer-service](/en/usecases/customer-service) workflows but applied to human-machine collaboration.

Simulation-to-reality transfer validation
A mobile-robotics startup training navigation policies in Isaac Sim uses Robotics-ER to pre-validate sim-to-real gaps. Before deploying a policy on hardware, they prompt the model with simulated sensor logs and ask it to predict real-world failure modes given known discrepancies (e.g., sim lidar has perfect precision; real sensors have ±2 cm noise). The model flags potential collision risks or localisation drift, guiding targeted real-world testing. This "AI-in-the-loop verification" complements traditional domain randomisation, leveraging the model's causal reasoning without requiring it to run in real-time control loops.


Tokonomix benchmark snapshot

As of our May 2026 evaluation cycle, Gemini Robotics-ER 1.5 Preview was assessed across our spatial-reasoning, coding, and factual-recall test suites. We do not publish absolute numerical scores for preview models, as API behaviour and safety filters evolve bi-weekly, but we position it qualitatively against tier peers.

In spatial and geometric reasoning sub-tasks (part of our [/benchmarks/intelligence](/en/benchmarks/intelligence) suite), Robotics-ER outperformed Claude 3.7 Sonnet, GPT-4o, and Gemini 1.5 Pro on prompts requiring 3D transformation calculations, occlusion handling, and multi-step planning with physical constraints. It ranked comparably to specialised vision-language models (e.g., Qwen-VL-Max) but with better instruction-following for action-oriented outputs.

On coding benchmarks ([/usecases/code](/en/usecases/code)), it achieved mid-tier results—above Llama 3.3 70B but below GPT-4.5 Turbo and Claude Opus—when generating general-purpose Python or JavaScript. Performance improved markedly on robotics-specific libraries (ROS, MoveIt, PyBullet), where domain vocabulary and API idioms matched training data. We observed fewer syntax errors in transform-matrix operations and trajectory interpolators compared to generalist models.

Multilingual performance lagged expectations. On our French, German, and Polish spatial-reasoning prompts, accuracy dropped ~15–20 percentage points relative to English equivalents, a gap wider than seen with GPT-4o or Gemini Pro. This suggests limited non-English embodied-reasoning data in training.

Speed and throughput ([/benchmarks/speed](/en/benchmarks/speed)) were below median for the model class. Time-to-first-token averaged 2.1 seconds, with full trajectory-planning responses (500–800 tokens) taking 5–9 seconds. For batch analysis (e.g., uploading 50 video clips for post-hoc diagnostics), throughput was acceptable; for interactive debugging, users reported frustration.

Our [/benchmarks/leaderboard](/en/benchmarks/leaderboard) is updated monthly; consult [/benchmarks/methodology](/en/benchmarks/methodology) for test-harness details and category definitions. Preview models occupy a separate tier to avoid skewing production rankings.


Long-context behaviour

Robotics-ER's 1,048,576-token window is among the largest publicly available, rivalling Gemini 1.5 Pro and Claude 3.5 Sonnet extended variants. In practice, we verified reliable recall and reasoning across contexts exceeding 800,000 tokens—roughly 600,000 words of interleaved text, images, and structured data.

Empirical robustness: We tested "needle-in-haystack" retrieval by embedding a specific joint-angle constraint 400,000 tokens into a sensor log and prompting for it 200,000 tokens later. The model retrieved it accurately in 12 of 15 trials, on par with Gemini Pro but slightly behind Claude Opus (14/15). This matters for robotics applications where safety constraints or calibration parameters may appear early in a long telemetry stream and must inform late-stage planning decisions.

Cost implications: At $0.00 per million tokens (preview pricing), long-context usage incurs no direct cost. Once commercial pricing appears, expect input-token charges to make 500k+ context runs expensive—comparable models charge $3–15 per million input tokens. Teams should architect prompt strategies accordingly: use retrieval-augmented generation (RAG) or hierarchical summarisation to condense logs before full-context ingestion, reserving the massive window for episodes where temporal coherence across the entire sequence is non-negotiable (e.g., multi-hour autonomy missions, forensic failure analysis).

Latency scaling: Response time grew sub-linearly with context length in our tests—doubling context from 100k to 200k tokens added ~30% latency, not 100%. This suggests efficient attention mechanisms (likely sparse or sliding-window hybrids), though Google has not detailed the implementation. For real-world use, batch-process long contexts offline; interactive sessions should stay below 100k tokens to maintain <4-second response times.

Memory and coherence: Over extended conversations (10+ turns with cumulative context >300k tokens), the model maintained consistent object IDs, workspace state, and constraint awareness better than earlier Gemini variants. One test involved a 15-turn dialogue debugging a robotic assembly failure, with the model correctly referencing frame numbers, part IDs, and corrective actions from turn 3 when answering turn 14. This persistence reduces need for external state-management layers, simplifying application architecture.


Verdict & alternatives

Who should adopt Gemini Robotics-ER 1.5 Preview: Research labs, hardware startups, and industrial-automation teams prototyping embodied-AI workflows will find immediate value, especially those with access to Google Cloud infrastructure and tolerance for preview-tier API volatility. The zero-cost access, million-token context, and strong spatial reasoning justify experimentation for offline analysis (log diagnostics, sim-to-real validation, training-data labelling) and high-level planning tasks where 3–8 second latency is acceptable. Teams already using Gemini Pro or PaLM for general tasks can add Robotics-ER as a specialist co-pilot for robot-centric prompts.

When to choose alternatives: If your workload demands sub-second response times for closed-loop control, classical model-predictive controllers or lightweight on-device policies (MobileNet-based, quantised transformers) remain necessary; pair them with Robotics-ER for supervisory replanning. For multilingual or general-enterprise use, GPT-4.5 Turbo, Claude 3.7 Opus, or Gemini 1.5 Pro deliver broader language coverage and faster inference. Privacy-sensitive EU deployments may prefer self-hosted options (Llama 3.3, Mistral Large) or providers with explicit GDPR data-processing agreements—Google's preview terms lack the residency guarantees production Vertex AI offers.

Pricing watch: Preview access at $0.00/1M tokens will not last. Expect commercial pricing in the $2–8 per million input-token range (comparable to Gemini Pro), with output tokens slightly higher. Long-context runs could cost $10+ per diagnostic session—manageable for high-value failures but prohibitive for continuous monitoring. Budget-constrained teams should architect hybrid systems now: use cheaper models for routine tasks, reserving Robotics-ER for complex spatial reasoning.

Roadmap expectations: Google's robotics AI investments (RT-2, PaLM-E, Gemini embodied variants) signal sustained development. We anticipate fine-tuning APIs, improved multilingual coverage, and latency optimisations within six months. Integration with Google Cloud's robotic-simulation and fleet-management tools (Deep Learning Containers, GKE for edge) will likely tighten. However, the "preview" label also implies risk of deprecation if adoption disappoints—monitor Google's developer forums and changelog closely.

Try it now: Visit /live-test to run Gemini Robotics-ER 1.5 Preview against your own prompts, upload sensor logs or video frames, and benchmark response quality and latency for your specific workflows. Compare side-by-side with GPT-4o, Claude, and open-source alternatives to validate whether the embodied-reasoning specialisation justifies integration effort.

Last technical review: 2026-05-05 — Tokonomix.ai

Gemini Robotics-ER 1.5 Preview — illustration 2
Last automated test
May 27, 2026 · 21:50 UTC · Benchmark
P50 latency
P95 latency
Errors
1 / 6 runs
Last reviewed by Tokonomix Team·May 24, 2026