Not All Features Are Created Equal

A Mechanistic Study of Vision-Language-Action Models

1Electrical, Computer, and Systems Engineering, Case Western Reserve University
2Mechanical and Aerospace Engineering, Case Western Reserve University
ICLR 2026 Multimodal Intelligence Workshop (Oral)
6
VLA Architectures
420,000+
Rollout Episodes
520+
SAE Checkpoints
79
Manipulation Concepts
4
Benchmarks

Abstract

A fine-tuned Vision-Language-Action (VLA) policy will pick up the alphabet soup and place it in the basket on demand, then drop the soup off the table when an evaluator shifts the basket five centimeters left. We use activation injection to ask what the policy is doing: inject task A's activations into task B's scene at the action expert, and π0.5 executes task A's motor trajectory in 99.6% of episodes (n = 1,968); X-VLA does so in 99.8%. The injected program is bound to absolute workspace coordinates rather than the visible scene, which mechanistically explains the perturbation brittleness reported in concurrent benchmark work. The same intervention framework applied across six VLAs (π0.5, OpenVLA-OFT, X-VLA, SmolVLA, GR00T N1.5, and ACT as a language-free control) on LIBERO, MetaWorld, SimplerEnv, and ALOHA over 420,000+ rollouts surfaces three further findings: language is encoded by every architecture yet behaviorally ignored when vision uniquely identifies the goal; SAE pooling preference splits along architecture lines (π0.5 per-token, X-VLA mean-pool, SmolVLA indifferent); and pathway specialization replicates wherever expert and VLM pathways are separable. SmolVLA's interleaved fusion attenuates the headline to 52.1% LIBERO override, scoping universality. We release Action Atlas, an interactive feature-exploration platform covering all 520+ trained SAEs and 79 identified concepts.

Method

Per-token Sparse Autoencoders + causal interventions for VLA mechanistic interpretability

Methodology overview

Methodology overview. Top: activations are recorded from VLA backbone and action expert layers during rollout episodes, then replayed under counterfactual conditions (null prompts, cross-task scenes) to establish causal relationships via behavioral change. Middle: per-token SAEs decompose layer activations into sparse features. Bottom: features are clustered, searched, and causally validated through ablation and steering experiments, with results visualized in Action Atlas.

Sparse Autoencoders (SAEs)

We train 520+ TopK SAEs (k=64) across all six models with 4–8× expansion. Each action token is processed independently to preserve temporal structure. Feature importance is scored via frequency-weighted contrastive selection (Cohen's d × top-k frequency), recovering 79 aligned manipulation concepts.

Activation Injection & Causal Interventions

We test causality with four injection conditions: null injection (correct prompt to empty string), same-scene steering (redirect to alternate targets), cross-task injection (transfer across visual scenes), and random-direction (norm-matched control: 22% source behavior vs. 99.6% directed).

Concept ablation zeros out specific SAE features during live rollouts, while feature steering scales feature activations by α to amplify or suppress encoded behaviors.

Linear Probes

Ridge regression probes recover prompt-category labels at > 99% accuracy from the residual stream of every model tested. A projection operator test confirms causality: projecting out the probe direction drops downstream R² to chance.

Counterfactual Prompting

We test language grounding with 6 prompt variations across 11,226 counterfactual episodes spanning all five language-capable VLAs: baseline, null (empty string), negation, motor commands, object swap, and mid-episode temporal switches. SmolVLA is additionally evaluated across MetaWorld's four difficulty tiers.

Key Findings

A five-step argument across six VLA architectures

1

Visual Context Dominates Action Generation

Replacing the prompt with an empty string and replaying the image stream from a different episode restores actions to cosine ≥ 0.99 against the original, with rollout success recovering to 14–77% across all six architectures. Zeroing the visual pathway collapses every model to noise.

2

Motor Programs Bind to Absolute Spatial Coordinates

Cross-task injection rebinds the trajectory, not the goal: π0.5 99.6% (n=1,968), X-VLA 99.8% (n=3,150), OpenVLA-OFT 96.4% (n=1,079, tie-conditional), GR00T 57.0%, SmolVLA 52.1% on LIBERO. A norm-matched random-direction control yields 22%; a vision-only ACT controller produces 0.0% override by construction.

3

Language Becomes Necessary Only When Vision Is Ambiguous

Linear probes recover prompt-category labels at > 99% accuracy: the goal is encoded. Yet on visually unambiguous suites, replacing the prompt does almost nothing: across 11,226 counterfactual episodes, π0.5 / LIBERO-Object yields F(4,3391)=1.23, p=0.247, η²=0.0015. Language matters only when scenes are visually ambiguous.

4

Pathway Specialization & Architecture-Tied SAE Structure

In all three multi-pathway architectures (π0.5, SmolVLA, GR00T), expert pathways encode HOW and VLM pathways encode WHAT. SmolVLA's expert carries 78.6% of cross-task override vs. 25.6% for the VLM. Single-feature ablation produces 28–92% zero-effect rates depending on architecture; SAE pooling preference is architecture-tied (per-token for π0.5 / OFT, mean-pool for X-VLA, indifferent on SmolVLA).

Results

Cross-task displacement override rates

Cross-task injection override rates. Pi0.5 99.6% (n=1,968) and X-VLA 99.8% (n=3,150) show near-complete source behavior transfer; OpenVLA-OFT 96.4% (n=1,079, tie-conditional; 77.9% tie-inclusive); GR00T 57.0%; SmolVLA 52.1% on LIBERO with strong pathway split (78.6% expert vs. 25.6% VLM). Vision-only ACT control: 0.0%.

Language ignored despite internal distinction

Language is encoded yet behaviorally ignored on visually unambiguous suites. π0.5 / LIBERO-Object across 3,396 counterfactual episodes shows no significant behavioral difference (F(4,3391)=1.23, p=0.247, η²=0.0015), even though layer-17 classifiers distinguish the same prompts at > 99% accuracy.

Concept ablation causal sensitivity

Single-feature concept ablation across five language-capable VLAs. SmolVLA (480-dim expert) is the most sensitive at 28% zero-effect rate; OpenVLA-OFT (4096-dim) and X-VLA (1024-dim) are the most resilient at 92% and 82%. Causal sensitivity does not follow representation width; it is architecture-tied.

Causal sensitivity does not follow representation width

Model Hidden Dim Zero-Effect Rate Destruction Rate Profile
SmolVLA Expert 480 28% 6.3% Narrow, most sensitive
Pi0.5 Expert 1024 54% 14% Bimodal, kill-switches at L8
GR00T N1.5 1536-2048 59% 9.1% Pathway-dependent (DiT > Eagle)
X-VLA 1024 82% 2.7% Resilient despite narrow width
OpenVLA-OFT 4096 92% 0.5% Wide, most resilient
PUT concept ablation

PUT concept ablation (L8): “Put the cream cheese in the bowl.” Top: baseline picks up cream cheese and places it in the bowl (91 steps). Bottom: with PUT features zeroed, the robot drops the cream cheese into the bowl, knocking it over (300 steps, task failure).

OPEN concept ablation

OPEN concept ablation (L8): “Open the middle drawer of the cabinet.” Top: baseline reaches for the middle drawer, grasps and pulls it open (140 steps). Bottom: with OPEN features zeroed, the robot opens the bottom drawer instead of the middle, misdirecting the motor program to the wrong target (300 steps, task failure).

Models Studied

Five VLAs and one language-free control, 80M to 7B parameters, three action generation paradigms

π
π0.5
3B parameters, cross-task override 99.6% (n=1,968)
PaliGemma VLM (SigLIP + Gemma 2.6B) with dedicated Gemma action expert (18L, 1024-dim). Flow matching with 50-step denoising and 50-step action chunking. Dual-pathway: VLM conditions expert.
Flow Matching, Dual Pathway
V
OpenVLA-OFT
7B parameters, cross-task override 96.4% (tie-conditional)
Llama-2 7B with DINOv2 + SigLIP vision encoders. MLPResNet action head produces all 7 DOFs in one forward pass with 8-step chunking. Optimized Fine-Tuning via LoRA adapters preserves pretrained representation geometry.
Continuous L1 Regression
A
ACT
80M parameters, cross-task override 0.0% (n=10)
ResNet encoder + transformer decoder. CVAE with action chunking for bimanual manipulation (TransferCube, Insertion). Vision-only by construction; cross-task injection is bit-identical to baseline. Serves as the non-VLA control.
CVAE, Vision-only Control
G
GR00T N1.5
3B parameters, cross-task override 57.0% (n=270)
NVIDIA's triple-pathway foundation model: 16 DiT layers, 12 Eagle LM layers, and 4 VL Self-Attention layers (32 total). DiT carries the motor signal: zeroing any DiT layer collapses every suite to 0%, while the four VL-SA layers tolerate ablation with 40 to 80% residual success.
Flow-Matching DiT, Triple Pathway
S
SmolVLA
450M parameters, LIBERO 52.1%, MetaWorld 12.4%
Interleaved dual-pathway: 32 VLM layers (960-dim) and 32 expert layers (480-dim) execute in alternation. The interleaved fusion mechanically attenuates cross-task override (expert pathway 78.6% vs. VLM pathway 25.6% on LIBERO); we treat SmolVLA as a scoping case for the universality claim.
Flow Matching, Interleaved
X
X-VLA
~1B parameters, cross-task override 99.8% (n=3,150)
Florence-2 VLM with soft-prompted flow-matching action head. Cross-embodiment design tested on LIBERO and SimplerEnv (WidowX + Google Robot). Override is reported under action-vector cosine; an EEF-velocity classifier yields 52.8% (both disclosed in the paper's limitations).
Flow Matching + Soft Prompts
Cross-model capability radar

Cross-model capability radar. Five VLAs scored on baseline success, visual override strength, language sensitivity, SAE fidelity, cross-task transfer rate, and pathway specialization. OFT lacks pathway specialization (single-pathway architecture); SmolVLA and GR00T show the strongest pathway specialization alongside Pi0.5.

Benchmarks

LIBERO

4 suites, 40 tasks
MuJoCo tabletop

SimplerEnv

10 tasks, 2 embodiments
WidowX + Google Robot

ALOHA-sim

Bimanual tasks
TransferCube, Insertion

MetaWorld

50 manipulation tasks
Multi-task evaluation

Experimental Scale

Model Episodes SAEs Trained Concepts ID'd Benchmark(s)
π0.5 71,500+ 72 43 LIBERO
OpenVLA-OFT 70,700+ 96 45 LIBERO
X-VLA 50,000+ 120 82 LIBERO, SimplerEnv
SmolVLA 58,000+ 146 45 LIBERO, MetaWorld
GR00T N1.5 164,700+ 87 36 LIBERO
ACT 1,870 - - ALOHA
Total 420,000+ 520+ 79 unique 4 benchmarks

Action Atlas

Interactive visualization platform for VLA interpretability, inspired by Neuronpedia

Explore VLA Representations

Feature Explorer

UMAP scatter plots of 4,096+ SAE features with semantic search via SBERT embeddings

Layer Wires

Architecture diagrams showing information flow and concept density across transformer layers

Video Library

308,000+ rollout videos filterable by model, suite, experiment type, and outcome

Ablation Studies

Side-by-side baseline vs. ablated behavior with success comparison

Perturbation Testing

Vision perturbation results across models with displacement analysis and cross-embodiment data

Launch Action Atlas

action-atlas.com

Citation

@misc{grant2026featurescreatedequalmechanistic,
  title         = {Not All Features Are Created Equal: A Mechanistic
                   Study of Vision-Language-Action Models},
  author        = {Bryce Grant and Xijia Zhao and Peng Wang},
  year          = {2026},
  eprint        = {2603.19233},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2603.19233}
}