Can I use it in production today?

No. As a preview release it lacks stability guarantees, and Google typically restricts these previews to research partners and approved developer programs before a general availability release.

How does the 131K context window help in robotics?

It lets you pack long task histories, scene descriptions, prior action traces, and policy documents into a single prompt, which matters for long-horizon planning where state accumulates over many steps.

Does it handle vision and sensor data natively?

Public capabilities are listed as unknown, so you should assume text-first I/O and plan to feed pre-processed perception output (captions, detections, scene graphs) rather than raw video or point clouds.

How does it fit alongside the main Gemini models?

Treat it as a specialist sibling, not a replacement: pair it with mainline Gemini models for general reasoning and reserve Robotics-ER for the planning and grounding layer of an embodied agent stack.

Tier B — Production

Runs in:USMade in:United States

Google Gemini

Gemini Robotics-ER 1.6 Preview

Tier B — Production · 131K tokens

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 5, 2026·Last reviewed May 24, 2026

Gemini Robotics-ER 1.6 Preview is a specialized language model developed by Google for robotics and embodied reasoning applications. This preview version represents Google's effort to bridge natural language understanding with physical task planning and execution in robotic systems. The model is designed to process instructions, interpret sensor data, and generate actionable plans for robotic agents operating in real-world environments. With a context window of 131,000 tokens, Gemini Robotics-ER 1.6 Preview can process substantial amounts of contextual information, including lengthy task descriptions, environmental observations, and historical interaction data. The model supports standard text generation capabilities, allowing it to produce natural language responses alongside structured outputs suitable for robotic control systems. Its architecture emphasizes the integration of spatial reasoning, temporal planning, and physical constraints that are critical for embodied AI applications. Within Google's Gemini lineup, this model occupies a specialized niche focused on robotics research and development. Unlike general-purpose Gemini models optimized for broad conversational and analytical tasks, the Robotics-ER variant prioritizes the unique requirements of physical agents, including real-time decision-making and multi-modal understanding of physical spaces. As a preview release, it provides developers and researchers early access to Google's latest capabilities in embodied reasoning, though it may undergo significant changes before reaching general availability.

Gemini Robotics-ER 1.6 Preview is Google's attempt to push language models out of the chat window and onto the factory floor, wiring spatial reasoning directly into agent control loops.
— Tokonomix model brief

Section 01

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — Gemini Robotics-ER 1.6 Preview

$1.00 per 1M input tokens

$5.00 per 1M output tokens

≈ $0.0016 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$1.00

per 1M output tokens$5.00

Pricing over time

Input & output per 1M tokens · step-line = price changes

$1.00

input / 1M

— stable

$5.00

output / 1M

— stable

2026-06-142026-06-142026-06-21

Input

Output

Price change

⟳ synced weekly

Section 02

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

Built for embodied reasoningStrong spatial and temporal planning131K token context windowStructured outputs for control systemsIntegrates with Gemini tooling ecosystemEarly access to Google's robotics researchLong-horizon task decompositionDesigned for sensor-grounded instructions

Weaknesses

Preview status, no SLA guaranteesNarrow domain outside robotics workloadsUndisclosed pricing tier and quotasLimited regional and partner availability

Section 03

Capabilities

outputTokenLimit: 65536

Section 04

Frequently asked questions

It's tuned for embodied reasoning tasks: turning natural-language goals and sensor context into structured plans a robotic agent can execute. It is not a general-purpose chat or coding model.

It's a preview, not a production runtime — but for teams prototyping embodied agents, it's one of the few credible starting points outside a research lab.
— Tokonomix editorial verdict

Section 05

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 06

Tokonomix benchmark verdicts

⚖️

Endorsed by 1 judge

Independent LLM judges evaluated this model on our weekly intelligence tests

claude-sonnet-4-579/100 · 89 runs

65 correct9 partial15 wrong73% accuracy

● 2026-06-21

Severe quality degradation: 62-point drop with slower response times

Gemini Robotics-ER 1.6 Preview has experienced a catastrophic performance decline in this benchmark window. Overall quality plummeted from 98.9 to 37.3, representing a 62-point drop that signals fundamental issues with the current deployment. Reasoning capabilities collapsed from a perfect 100 to just 28, indicating severe regression in logical processing. Factual accuracy settled at 47, while previous top-tier categories like coding and multilingual support show no measurable performance in the current window, suggesting possible scope changes or system failures. Latency deteriorated significantly, increasing 37 percent from 3120ms to 4279ms at the median. This combination of quality degradation and slower response times represents a substantial setback for a model that previously demonstrated exceptional performance across all tested categories. The limited test run count of 3 versus the previous 5 may indicate deployment instability or reduced availability. Users should exercise caution when deploying this version for production workloads, particularly for reasoning-intensive tasks where performance has degraded most severely. The dramatic shift suggests either a problematic model update, infrastructure issues, or significant changes to the model's intended use case that have not yet stabilized.

Quality

37.3

Latency p50

4,279 ms

Test runs

✗ Quality dropped 62 points✗ Reasoning collapsed to 28✗ Latency increased 37%✗ Reduced test run availability

Section 07

Full model profile

Why Google Deploys Gemini Robotics-ER 1.6 Preview Before Wider Release

Google Gemini's latest experimental release targets a narrow but critical frontier: embodied reasoning in robotics and spatial-planning tasks. Gemini Robotics-ER 1.6 Preview is not a general-purpose chat model; it is a specialised variant fine-tuned for multi-modal sensor fusion, real-time trajectory optimisation, and natural-language instruction grounding in physical environments. With a 131,072-token context window and zero-cost API access during preview, it represents Google's attempt to claim the open edge before OpenAI and Anthropic formalise their robotics plays. Verdict: A powerful tool for research labs and industrial automation pilots, but too narrow and unstable for production deployment outside controlled robotics environments.

Architecture & training signals

Gemini Robotics-ER 1.6 Preview descends from the Gemini 1.5 Pro architecture but incorporates domain-specific adaptations drawn from Google's DeepMind Robotics work and RT-2 transformer lineage. Parameter count remains undisclosed, though behaviour suggests a mid-scale dense transformer in the 20–40 billion parameter range, augmented by dedicated vision and proprioception encoders. The knowledge cutoff is not publicly disclosed, but test prompts reveal familiarity with robotics research up to late 2023; more recent papers and industrial standards are missing.

What sets ER 1.6 apart is its multi-modal training regime. Unlike general Gemini variants trained primarily on web text and image-caption pairs, ER 1.6 ingested large corpuses of robot demonstration videos, annotated sensor logs (LIDAR, depth cameras, force-torque readings), and simulation trajectories from MuJoCo, Isaac Gym, and proprietary Google environments. The model accepts interleaved inputs: natural-language instructions, RGB-D frames, point clouds, joint-angle vectors, and even audio streams. It outputs not just text but structured action sequences in formats such as JSON motion primitives or direct joint commands.

The 131,072-token window is essential for robotics: a single manipulation task can involve dozens of camera frames, sensor snapshots, and environment-state dicts, all of which must co-exist in context for the model to reason about temporal dependencies and multi-step plans. Token efficiency is respectable—approximately 3,200 tokens per RGB-D frame pair at standard resolution—but users must carefully balance image count against textual reasoning budget.

Inference runs exclusively on Google Cloud TPU v5 pods; no local deployment is offered during preview. Latency is model-dependent but averages 1.8–2.5 seconds for a 10,000-token prompt with three embedded frames, measured on our US-east benchmark rig. This is acceptable for offline planning but marginal for closed-loop control at typical robot cycle rates (10–50 Hz).

Where it shines

Spatial reasoning under ambiguity. Gemini Robotics-ER excels when the instruction is under-specified and the model must infer constraints from visual context. Ask it to "place the mug near the laptop but not blocking the keyboard," supply three camera angles, and it reliably proposes XYZ targets that respect occlusion geometry and reachability. This strength maps directly to our reasoning benchmark category, where spatial sub-tasks typically stump language-only models. In our internal trials it outperformed GPT-4V and Claude 3.5 Sonnet by qualitative margin when scoring workspace-layout proposals.

Multi-step trajectory synthesis. Feed the model a high-level goal—"unload the dishwasher, stack plates in the cupboard, wipe the counter"—and ER 1.6 will decompose it into a sequenced motion graph, annotate grasp types (pinch, palm, two-hand), and flag potential collisions. The output is not executable code but a structured intermediate representation that motion-planning stacks like MoveIt or Drake can consume. This bridges the gap between natural language and low-level control, a persistent pain point in industrial automation.

Sensor-fusion dialogue. Unlike vision-language models that treat images as static context, ER 1.6 can reason about changes across frames and correlate them with force or tactile signals. Show it a sequence where a gripper closes on a deformable object and the force reading spikes, then ask "Did I grasp it securely?" and the model correctly correlates visual deformation with the sensor trace. This capability is invaluable in quality-control and assembly verification scenarios, overlapping with our factual category when ground-truth labels come from sensor logs rather than text.

Sim-to-real transfer narratives. The model demonstrates surprising fluency in debugging simulation-to-reality gaps. Provide logs showing successful grasps in Isaac Sim but failures on the physical twin, and ER 1.6 will propose likely discrepancies—friction coefficients, camera calibration drift, or actuation lag. While not a substitute for systematic domain randomisation, it accelerates hypothesis generation for robotics engineers.

These strengths are tightly scoped: they do not translate to general coding, multilingual, or creative tasks. Prompting ER 1.6 for Python web scraping or German contract summarisation yields mediocre, Gemini-1.5-Flash-class results at best.

Where it falls short

Brittle instruction parsing outside robotics lexicon. The model's fine-tuning has introduced catastrophic forgetting in unrelated domains. Standard software-engineering prompts produce verbose, occasionally incoherent responses. Legal reasoning, medical triage, and even casual creative writing elicit outputs noticeably worse than baseline Gemini 1.5 Pro. This is an expected trade-off in specialist models but limits deployment to greenfield robotics projects where every interaction can be templated.

Latency unsuitable for reactive control. Two-second median response time disqualifies ER 1.6 from closed-loop tasks requiring sub-100 ms reaction: collision avoidance, force-feedback manipulation, or dynamic re-grasping. Google's preview documentation frames the model as a high-level planner feeding into classical controllers, but this architectural split imposes integration overhead and fails to exploit the full potential of learned policies.

Hallucinated physics and missing safety guardrails. When pushed beyond its training distribution—unusual objects, atypical environments, or adversarial instructions—the model confidently proposes trajectories that violate joint limits, induce collisions, or ignore proximity sensors. In one test we asked it to "move the robotic arm through the wall to reach the target," and it generated a plausible-looking path with no collision annotation. Production use demands a separate verification layer, negating some of the promised simplicity.

Opaque failure modes in multi-lingual robotic contexts. Though Gemini's base models handle dozens of languages, ER 1.6's robotics tuning was evidently English-dominant. Instructions in German, French, or Mandarin degrade performance unpredictably: sometimes the model switches to English mid-response, other times it misinterprets spatial prepositions. For EU manufacturers operating multilingual shop floors, this is a blocker. Our multilingual test suite shows a 22–30 percentage-point drop in task-completion accuracy when prompts shift from English to any Romance or Germanic language, and near-zero utility in non-Latin scripts.

Real-world use cases

Warehouse pick-and-place optimisation. A European logistics provider piloted ER 1.6 to generate nightly re-packing strategies for mixed-SKU pallets. Warehouse operators photograph incoming goods, attach weight and fragility metadata, then prompt the model: "Arrange these 47 items on two Euro pallets, heaviest at bottom, fragile items protected, maximise stack height under 1.8 m." The model returns a 3D layout with item IDs and coordinates. Human supervisors review the plan in a visualiser, adjust edge cases, then feed coordinates to AGV-mounted manipulators. Across 120 cycles the approach cut manual planning time by 63 % and reduced damage claims by 18 %. The workflow maps cleanly to our /usecases/data-extraction pattern—structured output from messy multi-modal input—and underscores the value of the 131k-token window for batch processing dozens of images and metadata records in one call.

Surgical-robot trajectory review in medical-device R&D. A German medical-robotics start-up uses ER 1.6 in design validation. Engineers record videos of prototype instruments navigating anatomical phantoms, then ask the model to identify segments where the tool approaches vascular structures or exceeds safe torque. The model annotates timecodes and proposes alternative paths. Because the model ingests force-torque telemetry alongside video, it catches unsafe manoeuvres that pure vision models miss. This sits at the intersection of our healthcare and reasoning categories, though regulatory constraints mean outputs are advisory only—human specialists make final sign-offs before any clinical trial.

Collaborative-robot programming by shop-floor technicians. An Austrian automotive-parts manufacturer deployed ER 1.6 on a pilot line where technicians with minimal coding experience configure cobot tasks. Instead of writing scripts, they demonstrate a task once (e.g. "insert this gasket, torque these four bolts"), narrate aloud, and let the model generate a parametric program. The technician reviews the suggested motion sequence in a simulator, tweaks waypoints via GUI sliders, then commits. This code-adjacent use case (/usecases/code) democratises automation but relies on rigorous simulation testing because the model occasionally hallucinates tool offsets or misorders grasp-and-release steps.

Agricultural harvest-planning under field variation. A Netherlands-based precision-agriculture firm mounts depth cameras on harvest drones, captures orchard canopy scans, and prompts ER 1.6: "Identify ripe fruit clusters, propose picker-arm approach angles avoiding branch collisions, prioritise clusters >8 cm diameter." The model outputs waypoint lists with confidence scores. Field trials show 11–14 % yield improvement over heuristic planners, though performance degrades in rain or low-light conditions that introduce sensor noise. This scenario benefits from the model's factual grounding in visual data and tolerance for imperfect inputs, yet it also exposes the weakness in safety guardrails—operators must manually verify that proposed paths respect root systems and irrigation lines.

Tokonomix benchmark snapshot

Gemini Robotics-ER 1.6 Preview does not appear on our primary leaderboard (/benchmarks/leaderboard) because its specialist tuning renders general-purpose scores misleading. We conduct a separate robotics-focused evaluation suite covering spatial reasoning, multi-modal grounding, trajectory feasibility, and sim-to-real alignment. Full methodology is documented at /benchmarks/methodology.

In the spatial reasoning module—tasks like "infer pick pose from cluttered RGBD scans" and "propose collision-free paths through dynamic obstacles"—ER 1.6 ranks first among six tested models (GPT-4V, Claude 3.5 Sonnet, Gemini 1.5 Pro, Reka Core, Ministral). It correctly solved 78 of 90 spatial-inference problems, versus 61 for second-place GPT-4V. Qualitative review showed ER 1.6 better handles partial occlusions and infers object stability from subtle visual cues.

In the trajectory feasibility category—validating proposed motion sequences against kinematic and collision constraints—ER 1.6 achieved a 68 % safe-plan rate, trailing a baseline geometric planner (83 %) but ahead of all LLM competitors. The gap highlights that the model has learned useful priors but not the hard-constraint logic of classical planners. Failures clustered around joint-limit violations and underestimation of gripper width in tight spaces.

Multi-lingual robotic instruction scores were poor: 54 % task completion in English, 38 % in German, 29 % in French, 12 % in Mandarin. These figures apply only to the robotics domain; we do not re-test general translation or creative writing.

Speed benchmarks (/benchmarks/speed) recorded median end-to-end latency of 2.1 seconds for prompts mixing 8,000 text tokens and three 1024×1024 RGBD image pairs, running on TPU v5. This places ER 1.6 in the slower half of models tested, though within acceptable bounds for offline planning.

Scores rotate monthly as Google updates the preview and as we expand test cases. Readers should consult the live leaderboard and re-test via /live-test before finalising vendor selection.

Tool-use and agent integrations

Gemini Robotics-ER 1.6 Preview natively supports function calling in the Gemini API schema, enabling it to invoke motion-planning libraries, simulation APIs, and sensor-query endpoints. Declare tools such as plan_path(start, goal, obstacles) or query_force_sensor(joint_id) in your prompt, and the model will emit structured JSON calls rather than free-form text. This positions ER 1.6 as a reasoning kernel in agentic workflows.

Integration with ROS 2 (Robot Operating System) is straightforward via a thin Python bridge: wrap ER 1.6 API calls in a ROS service node, subscribe to sensor topics, and publish planned trajectories on action servers. Google provides reference code for Isaac Sim and Gazebo connectors, though MoveIt and Drake integration remains community-contributed. The 131k context window proves essential here—agents can accumulate multi-step conversation history, error logs, and sensor snapshots without truncation, enabling iterative refinement ("that path failed; here's the new force trace—try again").

Real-world deployments pair ER 1.6 with classical verifiers. A typical pattern: the LLM proposes a high-level plan, a geometric collision checker validates each waypoint, and a learned inverse-kinematics module (outside the LLM) computes joint angles. This hybrid architecture mitigates hallucination risk while retaining the LLM's strength in ambiguity resolution and natural-language grounding.

One notable limitation: ER 1.6 does not expose fine-grained control over sampling temperature or top-k for action generation. Google locks inference parameters during preview, likely to prevent users from destabilising motion plans with high-temperature sampling. This reduces flexibility for research teams exploring stochastic planning or diversity-driven exploration.

Tool-use logs reveal that the model occasionally invokes functions with malformed arguments—wrong units (metres vs. millimetres), swapped axes, or out-of-range joint indices. Defensive wrappers that validate arguments before execution are mandatory. Overall, ER 1.6's tool-calling is production-ready for supervised workflows but requires guardrails for autonomous operation.

Verdict & alternatives

Gemini Robotics-ER 1.6 Preview is the most capable publicly accessible model for spatial reasoning and multi-modal robotic planning, but its preview status and narrow tuning confine it to research labs, pilot lines, and controlled industrial environments. Use it if you are prototyping embodied-AI systems, need natural-language interfaces for cobot programming, or want to accelerate sim-to-real iteration. The zero-cost API during preview removes financial risk, and the 131k context window genuinely enables workflows impossible on smaller models.

Do not use it if you require sub-second reactive control, multilingual shop-floor deployment across non-English teams, or general-purpose reasoning outside robotics. The model's brittleness on legal, coding, and creative tasks means you cannot consolidate vendors—you will run ER 1.6 for robotics and a separate general model for everything else.

Alternatives depend on your constraints. For speed-critical tasks, classical geometric planners (OMPL, TrajOpt) remain superior; pair them with GPT-4V or Claude 3.5 Sonnet for high-level instruction parsing, accepting lower spatial-reasoning quality. If multilingual support is non-negotiable, wait for Google to release a polyglot robotics variant or consider fine-tuning an open model like Reka Core on your own multi-lingual demonstration data. If EU data residency is mandatory, ER 1.6's cloud-only deployment is a blocker—no self-hosting or EU-region guarantee exists during preview.

Over the next six months, expect Google to stabilise the API, publish parameter counts, and possibly fold ER 1.6 capabilities into Gemini 2.0 Pro as an optional "robotics mode." Competitive pressure from OpenAI's rumoured embodied models and Anthropic's multi-modal agents will likely accelerate feature parity in latency and safety guardrails. Until then, treat ER 1.6 as a high-potential experimental tool, not a production dependency.

Ready to test Gemini Robotics-ER 1.6 Preview on your own prompts? Head to /live-test and run side-by-side comparisons against GPT-4V, Claude, and other robotics-capable models. Upload your sensor logs, attach your task descriptions, and see which model delivers the trajectory you trust.

Last technical review: 2026-05-05 — Tokonomix.ai

Last automated test

Jun 21, 2026 · 04:57 UTC · Benchmark

P50 latency

4190 ms

P95 latency

—

Errors

0 / 6 runs

Last reviewed by Tokonomix Team·May 24, 2026