Not All Features Are Created Equal

5

VLA Models

424,000+

Rollout Episodes

424

SAE Checkpoints

~7 TB

Activation Data

4

Benchmarks

Abstract

Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M-7B parameters across 424,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recovers near-identical behavior, while cross-task injection steers robots toward source-task positions (99.8% of X-VLA episodes align with the source trajectory), exposing spatially bound motor programs tied to scene coordinates. Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is ignored; when multiple goals share a scene, language becomes essential (X-VLA libero_goal: 94%→10% under wrong prompts vs. libero_object: 60-100% regardless). In all three multi-pathway architectures (Pi0.5, SmolVLA, GR00T), expert pathways encode motor programs while VLM pathways encode goal semantics (2× greater behavioral displacement from expert injection), and subspace injection confirms these occupy separable activation subspaces. Per-token SAE processing is essential for action fidelity on most architectures, though mean-pooling improves fidelity on X-VLA. Contrastive identification recovers 82+ manipulation concepts; causal ablation reveals sensitivity spanning 28-92% zero-effect rates independent of representation width. We release Action Atlas for interactive exploration of VLA representations across all six models.

Method

Per-token Sparse Autoencoders + causal interventions for VLA mechanistic interpretability

Methodology overview. Top: activations are recorded from VLA backbone and action expert layers during rollout episodes, then replayed under counterfactual conditions (null prompts, cross-task scenes) to establish causal relationships via behavioral change. Middle: per-token SAEs decompose layer activations into sparse features. Bottom: features are clustered, searched, and causally validated through ablation and steering experiments, with results visualized in Action Atlas.

Sparse Autoencoders (SAEs)

We train 424+ TopK SAEs (k=64) across all six models with 4-8x expansion. Each action token is processed independently to preserve temporal structure. Feature importance is scored via frequency-weighted contrastive selection using Cohen's d, recovering 82+ manipulation concepts.

Activation Injection & Causal Interventions

We test causality with four injection conditions: null injection (correct prompt to empty string), same-scene steering (redirect to alternate targets), cross-task injection (transfer across visual scenes), and cross-seed (same task, different initial conditions).

Concept ablation zeros out specific SAE features during live rollouts, while feature steering scales feature activations by alpha to amplify or suppress encoded behaviors.

Linear Probes

Ridge regression probes achieve 97-98% R² across all action dimensions. A projection operator test confirms causality: projecting out the probe direction drops R² to 0%.

Counterfactual Prompting

We test language grounding with 6 prompt variations across Pi0.5 (3,396+ episodes, ANOVA p = 0.25), SmolVLA (MetaWorld, 4 difficulty levels), and X-VLA (LIBERO + SimplerEnv): baseline, null (empty string), negation, motor commands, object swap, and temporal switches.

Key Findings

Four findings that hold across five VLA architectures

1

Visual Pathway Dominates Action Generation

Injecting baseline visual activations into null-prompt episodes recovers near-identical behavior (cos = 0.999). Cross-task injection steers robots toward source-task positions: 99.8% of X-VLA episodes align with the source trajectory, exposing spatially bound motor programs tied to scene coordinates.

2

Language Sensitivity Depends on Task Structure

When visual context uniquely specifies the task, language is ignored. When multiple goals share a scene, language becomes essential: X-VLA libero_goal drops from 94% to 10% under wrong prompts, while libero_object maintains 60-100% regardless of prompt.

3

Expert Pathways Encode Motor Programs

In all three multi-pathway architectures (Pi0.5, SmolVLA, GR00T), expert pathways cause 2× greater behavioral displacement than VLM pathways. Subspace injection confirms these occupy separable activation subspaces: expert for HOW, VLM for WHAT.

4

Per-Token SAE Processing Is Architecture-Dependent

Per-token processing is essential for action fidelity on most architectures, though mean-pooling improves fidelity on X-VLA. Contrastive identification recovers 82+ manipulation concepts; causal ablation reveals sensitivity spanning 28-92% zero-effect rates independent of representation width.

Results

Cross-task displacement override rates. Left: override rate across five models. Pi0.5 (99.6%) and X-VLA (99.8%) show near-complete source behavior transfer; OFT 77.9%; GR00T 57.0% (suite-dependent). Right: SmolVLA pathway displacement (15.8% expert vs. 9.0% VLM).

Language ignored despite internal distinction

Language is ignored despite internal distinction. Left: counterfactual prompting across 3,396+ episodes shows no significant behavioral difference (p>0.24). Right: layer 17 classifiers distinguish prompts with 99.3% accuracy, yet behavior is unchanged.

Concept ablation causal sensitivity across five models. SmolVLA (480-dim expert) is the most sensitive at 28% zero-effect rate; OFT (4096-dim) and X-VLA (1024-dim) are the most resilient at 92% and 82%. Causal sensitivity does not follow representation width.

Causal sensitivity does not follow representation width

Model	Hidden Dim	Zero-Effect Rate	Destruction Rate	Profile
SmolVLA Expert	480	28%	6.3%	Narrow, most sensitive
Pi0.5 Expert	1024	54%	14%	Bimodal, kill-switches at L8
GR00T N1.5	1536-2048	59%	9.1%	Pathway-dependent (DiT > Eagle)
X-VLA	1024	82%	2.7%	Resilient despite narrow width
OpenVLA-OFT	4096	92%	0.5%	Wide, most resilient

PUT concept ablation (L8): “Put the cream cheese in the bowl.” Top: baseline picks up cream cheese and places it in the bowl (91 steps). Bottom: with PUT features zeroed, the robot drops the cream cheese into the bowl, knocking it over (300 steps, task failure).

OPEN concept ablation (L8): “Open the middle drawer of the cabinet.” Top: baseline reaches for the middle drawer, grasps and pulls it open (140 steps). Bottom: with OPEN features zeroed, the robot opens the bottom drawer instead of the middle, misdirecting the motor program to the wrong target (300 steps, task failure).

Models Studied

Five VLAs and one language-free control, 80M to 7B parameters, three action generation paradigms

π

Pi0.5

3B parameters

PaliGemma VLM (SigLIP + Gemma 2.6B) with dedicated Gemma action expert (18L, 1024-dim). Flow matching with 50-step denoising and 50-step action chunking. Dual-pathway: VLM conditions expert.

Flow Matching

V

OpenVLA-OFT

7B parameters

Llama-2 7B with DINOv2 + SigLIP vision encoders. MLPResNet action head produces all 7 DOFs in one forward pass with 8-step chunking. Optimized Fine-Tuning via LoRA adapters preserves pretrained representation geometry.

Continuous L1 Regression

A

ACT

80M parameters · Language-free control

ResNet encoder + transformer decoder. CVAE with action chunking for bimanual manipulation (TransferCube, Insertion). Serves as a non-VLA baseline.

CVAE Decoder

G

GR00T N1.5

3B parameters

NVIDIA's triple-pathway foundation model: 16 DiT layers, 12 Eagle LM layers, and 4 VL Self-Attention layers (32 total). Flow-matching transformer for action generation.

Flow-Matching DiT

S

SmolVLA

450M parameters

Interleaved dual-pathway: 32 VLM layers (960-dim) and 32 expert layers (480-dim) execute in alternation. Flow matching action generation with asynchronous inference. Tested on LIBERO (4 suites) and MetaWorld (50 tasks).

Flow Matching

X

X-VLA

~1B parameters

Florence-2 VLM with soft-prompted flow-matching action head. Cross-embodiment design tested on LIBERO and SimplerEnv (WidowX + Google Robot).

Flow Matching + Soft Prompts

Cross-model capability radar. Five VLAs scored on baseline success, visual override strength, language sensitivity, SAE fidelity, cross-task transfer rate, and pathway specialization. OFT lacks pathway specialization (single-pathway architecture); SmolVLA and GR00T show the strongest pathway specialization alongside Pi0.5.

Benchmarks

LIBERO

4 suites, 40 tasks
MuJoCo tabletop

SimplerEnv

10 tasks, 2 embodiments
WidowX + Google Robot

ALOHA-sim

Bimanual tasks
TransferCube, Insertion

MetaWorld

50 manipulation tasks
Multi-task evaluation

Experimental Scale

Model	Episodes	SAEs Trained	Concepts ID'd	Benchmark(s)
Pi0.5	65,000+	36	43	LIBERO
OpenVLA-OFT	70,700+	32	45	LIBERO
X-VLA	80,000+	96	82	LIBERO, SimplerEnv
SmolVLA	42,000+	192	45	LIBERO, MetaWorld
GR00T N1.5	164,700+	68	36	LIBERO
ACT	1,870	-	-	ALOHA
Total	424,000+	424	82+	4 benchmarks

Action Atlas

Interactive visualization platform for VLA interpretability, inspired by Neuronpedia

Explore VLA Representations

Feature Explorer

UMAP scatter plots of 4,096+ SAE features with semantic search via SBERT embeddings

Layer Wires

Architecture diagrams showing information flow and concept density across transformer layers

Video Library

308,000+ rollout videos filterable by model, suite, experiment type, and outcome

Ablation Studies

Side-by-side baseline vs. ablated behavior with success comparison

Perturbation Testing

Vision perturbation results across models with displacement analysis and cross-embodiment data

Launch Action Atlas

action-atlas.com

Feature Explorer

Layer Circuits

Ablation Studies

Demo Videos

Citation

@misc{grant2026featurescreatedequalmechanistic,
  title         = {Not All Features Are Created Equal: A Mechanistic
                   Study of Vision-Language-Action Models},
  author        = {Bryce Grant and Xijia Zhao and Peng Wang},
  year          = {2026},
  eprint        = {2603.19233},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2603.19233}
}

A Mechanistic Study of Vision-Language-Action Models

Abstract

Method

Sparse Autoencoders (SAEs)

Activation Injection & Causal Interventions

Linear Probes

Counterfactual Prompting

Key Findings

Visual Pathway Dominates Action Generation

Language Sensitivity Depends on Task Structure

Expert Pathways Encode Motor Programs

Per-Token SAE Processing Is Architecture-Dependent

Results

Causal sensitivity does not follow representation width

Models Studied

Benchmarks

LIBERO

SimplerEnv

ALOHA-sim

MetaWorld

Experimental Scale

Action Atlas

Explore VLA Representations

Feature Explorer

Layer Wires

Video Library

Ablation Studies

Perturbation Testing

Citation