How to read the score: note position encodes the action category, duration its type, and dynamics the gripper force. Phrases play legato, waypoints blending without stops.
Spark (Sequential Planning via Anchored Robotic Keypoints) is a training-free manipulation system that reaches 43.7% on the six LIBERO-Pro position-and-task cells, more than doubling every baseline. LIBERO-Pro perturbs object positions and task descriptions. Vision-language-action models top simulated benchmarks but collapse under these shifts, and CaP-Agent0 re-queries an LLM every turn to reach only 18.2%, at an order of magnitude more frontier-model calls per trial. Both spend their test-time compute refining the plan, yet the layer that fails under a position or task shift is perception, so SPARK spends its compute there. A single Gemini call emits a typed behavior tree over composable primitives that encapsulate the low-level control (motion, grasping, depth geometry) a code-generation agent would otherwise regenerate as code each trial. In simulation, one more Gemini call proposes three alternative text prompts per object and keeps whichever SAM3 detects most cleanly, and a re-grounding recovery loop retries a failed primitive against freshly detected objects with no new LLM call. Against a language-only baseline, prompt self-consistency adds 27.7 points on spatial and 10.0 on object, and re-grounding recovery adds about 5, all under the strict fairness conditions of CaP-Agent0: language input only, no privileged state, no per-task tuning. The same primitive grammar runs unchanged on three robot families (UR10e, Franka FR3, and a bimanual Franka), and across eleven task-embodiment cells spanning nine unique tasks, at twenty trials each, SPARK averages 68%. Every step is a typed primitive with a checkable post-condition, so each failure traces to the planner, perception, or a kinematic limit, and we release every logged trial as a labeled dataset.
SAM3 grounds the scene from the platform cameras. One Gemini call writes the whole plan as a typed behavior tree, the score, and the robot sight-reads it without any training. The plan is symbolic, so moving an object or rewording the instruction barely changes it. “Put the bowl on the plate” calls for the same score whether the bowl starts on the left or the right; only the pixels that the label bowl binds to have moved. Perception is the layer that breaks under that shift, so that is where SPARK spends its compute.
Each spatial argument is a keypoint label that the executor resolves to a 3D pose against live perception at the moment the robot acts. When a primitive’s post-condition fails, SPARK retracts, re-renders, re-runs SAM3, and retries the same plan with no new LLM call. The plan structure stays fixed while the spatial bindings are corrected. In simulation, a single extra call proposes three text prompts per object and keeps the cleanest SAM3 detection, which raises the spatial mean by 27.7 points and the object mean by 10.0. Five base primitives compose into multi-step behaviors with no task-specific code, and the full grammar of thirty typed skills adds the force calibration and retry logic a position-only set cannot express. Because execution flows through that grammar, every trial logs a labeled episode: trajectories for the policies that fail under these shifts, collected with no teleoperation.
LIBERO-Pro success rates (%) on the six position-and-task cells under matched fairness conditions: task language only, no privileged state, no per-task tuning.
Per-task success (%, 100 trials per task) against CaP-Agent0 under the same protocol. CaP-Agent0 values are the reported approximations from Fu et al.
On hardware the same pipeline averages 68% across eleven task-embodiment cells (nine unique tasks, twenty trials each) over three robot families, with no retraining.
The UR10e, the Franka FR3, the bimanual Franka, and the simulated benchmarks running the same pipeline. Objects and placements are randomized per trial.
Click a rollout to open its score.
Mug pourFranka FR3
Sponge washFranka FR3
Sweep to dustpanFranka FR3
T-shirt foldFranka FR3
Silverware sortFranka FR3
T-shirt foldBimanual Franka
Plushie in bowlUR10e
Utensils in trayUR10e
Ockham's Razor?Franka FR3
Sponge washFranka FR3, repeat trial
Mug pourFranka FR3, bird's-eye camera
T-shirt fold failure caseBimanual Franka, partial fold on blue shirt
T-shirt fold failure caseBimanual Franka, missed hem pinch
LiftCaP-Bench sim, 100% at 100 trials
WipeCaP-Bench sim
Nut assembly failure caseCaP-Bench sim, shared kinematic ceiling
Bowl on stoveLIBERO sim, goal suite
Both moka pots failure caseLIBERO sim, SAM3 finds only 1 of 2 identical pots
Chocolate pudding to basketLIBERO-Pro, position perturbation, pass
Alphabet soup to basketLIBERO-Pro, task perturbation, pass
Bryce Grant is supported by an NSF Graduate Research Fellowship. This work was sponsored by an NVIDIA Academic Grant.
@misc{grant2026spark,
title = {Sequential Planning via Anchored Robotic Keypoints},
author = {Grant, Bryce and Rothenberg, Aryeh and Senning, Logan
and Chua, Zonghe and Patterson, Zach and Wang, Peng},
year = {2026}
}