SPARK

Sequential Planning via Anchored Robotic Keypoints

1Electrical, Computer & Systems Engineering, 2Mechanical & Aerospace Engineering
Case Western Reserve University

How to read the score: note position encodes the action category, duration its type, and dynamics the gripper force. Phrases play legato, waypoints blending without stops.

Abstract

Spark (Sequential Planning via Anchored Robotic Keypoints) is a training-free manipulation system that reaches 43.7% on the six LIBERO-Pro position-and-task cells, more than doubling every baseline. LIBERO-Pro perturbs object positions and task descriptions. Vision-language-action models top simulated benchmarks but collapse under these shifts, and CaP-Agent0 re-queries an LLM every turn to reach only 18.2%, at an order of magnitude more frontier-model calls per trial. Both spend their test-time compute refining the plan, yet the layer that fails under a position or task shift is perception, so SPARK spends its compute there. A single Gemini call emits a typed behavior tree over composable primitives that encapsulate the low-level control (motion, grasping, depth geometry) a code-generation agent would otherwise regenerate as code each trial. In simulation, one more Gemini call proposes three alternative text prompts per object and keeps whichever SAM3 detects most cleanly, and a re-grounding recovery loop retries a failed primitive against freshly detected objects with no new LLM call. Against a language-only baseline, prompt self-consistency adds 27.7 points on spatial and 10.0 on object, and re-grounding recovery adds about 5, all under the strict fairness conditions of CaP-Agent0: language input only, no privileged state, no per-task tuning. The same primitive grammar runs unchanged on three robot families (UR10e, Franka FR3, and a bimanual Franka), and across eleven task-embodiment cells spanning nine unique tasks, at twenty trials each, SPARK averages 68%. Every step is a typed primitive with a checkable post-condition, so each failure traces to the planner, perception, or a kinematic limit, and we release every logged trial as a labeled dataset.

Composing the score

SPARK architecture: perception, plan, execute
SAM3 grounds each object to 3D keypoints, sharpened by adaptive perception self-consistency. One Gemini call composes a typed behavior tree over five base primitives and the skills they extend. The robot resolves each keypoint label to a pose at runtime under per-primitive post-condition checks, re-grounding perception on a failed check with no new LLM call.

SAM3 grounds the scene from the platform cameras. One Gemini call writes the whole plan as a typed behavior tree, the score, and the robot sight-reads it without any training. The plan is symbolic, so moving an object or rewording the instruction barely changes it. “Put the bowl on the plate” calls for the same score whether the bowl starts on the left or the right; only the pixels that the label bowl binds to have moved. Perception is the layer that breaks under that shift, so that is where SPARK spends its compute.

Each spatial argument is a keypoint label that the executor resolves to a 3D pose against live perception at the moment the robot acts. When a primitive’s post-condition fails, SPARK retracts, re-renders, re-runs SAM3, and retries the same plan with no new LLM call. The plan structure stays fixed while the spatial bindings are corrected. In simulation, a single extra call proposes three text prompts per object and keeps the cleanest SAM3 detection, which raises the spatial mean by 27.7 points and the object mean by 10.0. Five base primitives compose into multi-step behaviors with no task-specific code, and the full grammar of thirty typed skills adds the force calibration and retry logic a position-only set cannot express. Because execution flows through that grammar, every trial logs a labeled episode: trajectories for the policies that fail under these shifts, collected with no teleoperation.

Results

LIBERO-Pro success rates (%) on the six position-and-task cells under matched fairness conditions: task language only, no privileged state, no per-task tuning.

Per-suite LIBERO-Pro success rates by method
Per-suite breakdown. Fair receives task language only, matching CaP-Agent0; Adaptive adds the three-prompt SAM3 self-consistency (the full system). OpenVLA and π0 score 0 in every cell and are omitted; the +BDDL-names ablation tracks Fair (31.2 vs 31.9 mean).
Per-task LIBERO-Pro success, spatial suite Per-task LIBERO-Pro success, object suite Per-task LIBERO-Pro success, goal suite
Per-task success (%) for each LIBERO-Pro suite. SPARK (top two rows) holds up under task perturbation, while MolmoAct2 (bottom two rows) collapses to near zero.

CaP-Bench (Robosuite)

Per-task success (%, 100 trials per task) against CaP-Agent0 under the same protocol. CaP-Agent0 values are the reported approximations from Fu et al.

CaP-Bench per-task success, CaP-Agent0 vs SPARK

On hardware the same pipeline averages 68% across eleven task-embodiment cells (nine unique tasks, twenty trials each) over three robot families, with no retraining.

Real-robot rollouts

The UR10e, the Franka FR3, the bimanual Franka, and the simulated benchmarks running the same pipeline. Objects and placements are randomized per trial.

Click a rollout to open its score.

Mug pourFranka FR3

Sponge washFranka FR3

Sweep to dustpanFranka FR3

T-shirt foldFranka FR3

Silverware sortFranka FR3

T-shirt foldBimanual Franka

Plushie in bowlUR10e

Utensils in trayUR10e

Ockham's Razor?Franka FR3

Sponge washFranka FR3, repeat trial

Mug pourFranka FR3, bird's-eye camera

T-shirt fold failure caseBimanual Franka, partial fold on blue shirt

T-shirt fold failure caseBimanual Franka, missed hem pinch

LiftCaP-Bench sim, 100% at 100 trials

WipeCaP-Bench sim

Nut assembly failure caseCaP-Bench sim, shared kinematic ceiling

Bowl on stoveLIBERO sim, goal suite

Both moka pots failure caseLIBERO sim, SAM3 finds only 1 of 2 identical pots

Chocolate pudding to basketLIBERO-Pro, position perturbation, pass

Alphabet soup to basketLIBERO-Pro, task perturbation, pass

Acknowledgements

Bryce Grant is supported by an NSF Graduate Research Fellowship. This work was sponsored by an NVIDIA Academic Grant.

BibTeX

@misc{grant2026spark,
  title  = {Sequential Planning via Anchored Robotic Keypoints},
  author = {Grant, Bryce and Rothenberg, Aryeh and Senning, Logan
            and Chua, Zonghe and Patterson, Zach and Wang, Peng},
  year   = {2026}
}