Not All Features Are Created Equal

A Mechanistic Study of Vision-Language-Action Models

1Electrical, Computer, and Systems Engineering, Case Western Reserve University
2Mechanical and Aerospace Engineering, Case Western Reserve University
6
VLA Models
351,000+
Rollout Episodes
388
SAE Checkpoints
~7 TB
Activation Data
4
Benchmarks

Abstract

VLA models combine vision, language, and motor control in one architecture—but nobody really knows what's going on inside them. We present the first large-scale mechanistic interpretability study of VLAs, covering six models from 80M to 7B parameters: Pi0.5 (3B, flow-matching), OpenVLA-OFT (7B, continuous regression), X-VLA (1B, soft-prompted flow-matching), SmolVLA (450M, interleaved VLM-expert), GR00T N1.5 (3B, DiT-Eagle hybrid), and ACT (80M, CVAE). Using activation injection, counterfactual prompting, sparse autoencoders, and linear probes across 351,000+ rollout episodes on four benchmarks (LIBERO, MetaWorld, SimplerEnv, ALOHA), we find five things that hold across architectures.

Visual pathway activations determine behavior across all six models (cosine similarity = 0.999). Fine-tuned VLAs ignore language—null and negated prompts produce identical behavior despite 99.3% internal prompt decodability. Cross-task injection fails universally (0–2% across 3,600+ pairs), but displacement analysis shows the robot actually executes the source task's motor program in the wrong scene (99.8% source-dominant in X-VLA). In multi-pathway architectures, expert pathways cause 2× more behavioral displacement than VLM pathways. Per-token SAE processing matters, and narrow architectures (1024-dim) have catastrophic kill-switch features while wider ones (4096-dim) spread information redundantly. We release Action Atlas, an interactive platform for exploring VLA concept representations across all six models with 388 trained SAEs and 82+ identified manipulation concepts.

Core findings: activation injection, per-token SAEs, and steering results

Figure 1: Core Findings. (A) Injecting PaliGemma activations from a baseline episode into a null-prompt episode recovers near-identical actions (cos = 0.999)—vision drives behavior, not language. (B) Per-token SAEs maintain 94% task success while mean-pooled SAEs cause 88% failure. (C) Same-scene steering improves performance by +23-26pp, while cross-task transfer fails (0-2%).

Key Findings

What we found by looking inside six VLA architectures

1

Visual Pathway Dominates

The visual pathway drives behavior across all six architectures. Inject baseline activations into a null-prompt episode and the robot does the same thing.

cos = 0.999, all 6 models
2

Layer 0 Is Sufficient

Task identity lives in the first transformer layer. Injecting just Layer 0 recovers 73% of task performance.

L0: 73% recovery
3

Same-Scene Steering Works

Activation injection overrides language prompts with 93% success, improving task performance by +23-26 percentage points.

93% override rate
4

Cross-Task Transfer Fails

Transfer fails across all six models (0-2%), but the robot is actually running the source task's motor program in the wrong scene. Displacement analysis shows 99.8% source-dominant trajectories in X-VLA.

0-2% across 3,600+ pairs, 6 models
5

Language Is Ignored

Fine-tuned VLAs ignore what you tell them. Null, negated, and contradictory prompts all produce the same behavior—even though the model internally distinguishes prompts with 99.3% accuracy.

p = 0.25 (Pi0.5); confirmed on SmolVLA, X-VLA
6

Per-Token SAEs Essential

Per-token processing is needed for most architectures, but the picture is more complex than expected: X-VLA mean-pooled SAEs actually achieve better rollout fidelity than per-token.

388 SAEs across 6 models
7

Width Determines Fragility

Narrow models (1024-dim Pi0.5) blow up when you ablate features. Wide models (4096-dim OFT) degrade gracefully because information is spread across more features.

1024: fragile, 4096: resilient
8

Cross-Suite SAE Generalization

A single SAE trained on mixed data works across all four LIBERO benchmark suites and four different fine-tuned OFT model variants.

99.2% (119/120)
9

Temporal Early Commitment

Robots commit to their trajectory early on. Ablating step 0 alone causes a -49% drop, while messing with later steps barely matters (-1%).

Step 0: -49%

Method

Per-token Sparse Autoencoders + causal interventions for VLA mechanistic interpretability

VLA architecture comparison

Figure 2: Architecture Comparison. Spider diagram comparing properties across architectures. In dual-pathway (Pi0.5, SmolVLA) and triple-pathway (GR00T) models, VLM components encode WHAT while expert components encode HOW, with expert pathways causing 2× more behavioral displacement.

Sparse Autoencoders (SAEs)

We train 388 TopK SAEs (k=64) across all six models with 4–8x expansion: 1024→8192 (Pi0.5 expert, X-VLA), 4096→32768 (OFT), 960→4096/480→4096 (SmolVLA VLM/expert), and 1536–2048→12288–16384 (GR00T DiT/Eagle/VL-SA). Each action token is processed independently to preserve the temporal structure needed for motor control. How pooling strategy affects rollout fidelity depends on the architecture—mean-pooling helps some models and hurts others.

Feature importance is scored via frequency-weighted contrastive selection using Cohen's d effect size between concept-present and concept-absent episodes, multiplied by activation frequency.

Activation Injection & Causal Interventions

We test causality with four injection conditions: null injection (correct prompt to empty string), same-scene steering (redirect to alternate targets), cross-task injection (transfer across visual scenes), and cross-seed (same task, different initial conditions).

Concept ablation zeros out specific SAE features during live rollouts, while feature steering scales feature activations by alpha to amplify or suppress encoded behaviors.

SAE explained variance across layers

Figure 3: SAE Explained Variance. Layer-wise analysis of SAE reconstruction quality across all 18 layers for both Expert and PaliGemma pathways, with concept density heatmaps showing how different concepts distribute across the network depth.

Linear Probes

Ridge regression probes hit 97-98% R² across all action dimensions. We validate this with a projection operator test: projecting out the probe direction drops R² to 0%, confirming those directions actually matter for action prediction.

Counterfactual Prompting

We test language grounding with 6 prompt variations across Pi0.5 (3,396+ episodes, ANOVA p = 0.25), SmolVLA (MetaWorld, 4 difficulty levels), and X-VLA (LIBERO + SimplerEnv): baseline, null (empty string), negation ("don't move"), motor commands, object swap, and temporal switches. The models mostly ignore language, though SmolVLA shows some sensitivity on harder tasks.

Results

Temporal ablation results

Figure 4: Temporal Criticality. (A) Feature ablation by episode phase: early steps matter most (-49% at step 0), late phases are fine (-1%). (B) Linear probes hit 97-98% R² while SAE ablation shows only 2% effect due to redundancy.

Goldilocks effect

Figure 5: The Goldilocks Effect. Unlike LLMs where you can smoothly dial features up or down, VLAs are all-or-nothing: any deviation from natural activation levels (boosting or dampening) causes failure. These are precise motor control signals, not soft preferences.

Language is ignored

Figure 6: Language Is Ignored. (A) Success rates don't change across 6 prompt types on Pi0.5 (ANOVA p = 0.25); same pattern on SmolVLA and X-VLA. (B) The weird part: Layer 17 classifiers distinguish prompt types with 99.3% accuracy, but the model's behavior doesn't change.

Linear probes

Figure 7: Linear Probes vs SAE Ablation. Per-dimension R² values and causality validation. Linear probes find the exact subspace used for action generation; SAE ablation is limited by feature redundancy.

Width Determines Ablation Resilience

Model Hidden Dim 30-Feature Ablation Tasks Affected Interpretation
Pi0.5 Expert 1024 Catastrophic (-60 to -100pp) 8-10 / 10 Narrow = concentrated = fragile
X-VLA 1024 Similar narrow profile All layers critical Narrow = concentrated = fragile
OpenVLA-OFT 4096 Sparse / zero (0pp to -33pp) 0-3 / 10 Wide = redundant = resilient
GR00T N1.5 1536-2048 Universal features devastate DiT layers Layer-type dependent Mixed = pathway-specialized
Qualitative concept ablation results

Qualitative Results. Ablating specific SAE features kills specific behaviors. Each row shows baseline (works) vs. ablated (fails) for concepts PUT, OPEN, PUSH, and STOVE/INTERACT. The failure modes match the ablated concept.

Models Studied

Six architectures, 80M to 7B parameters, three action generation paradigms

π
Pi0.5
3B parameters
Dual-pathway architecture (PaliGemma VLM + Gemma action expert). Flow matching with 50-step denoising. 50 action token chunking.
Flow Matching
V
OpenVLA-OFT
7B parameters
LLaMA-2 7B backbone with Orthogonal Fine-Tuning (OFT) adapters. Continuous L1 regression action head. 7-DOF action tokens with 8-step chunking.
L1 Regression
A
ACT-ALOHA
80M parameters
ResNet encoder + transformer decoder. CVAE with action chunking for bimanual manipulation (TransferCube, Insertion).
CVAE Decoder
G
GR00T N1.5
3B parameters
NVIDIA's foundation model with triple-pathway architecture: Eagle VLM for visual encoding, VL-SA for vision-language fusion, and DiT (Diffusion Transformer) action head for humanoid control.
Diffusion Transformer
S
SmolVLA
450M parameters
Compact VLA with interleaved VLM and expert pathways. Continuous action generation with SmolVLM backbone. Tested on LIBERO (4 suites) and MetaWorld (50 tasks).
Interleaved VLM+Expert
X
X-VLA
1B parameters
Florence-2 VLM with soft-prompted flow-matching action head. Cross-embodiment design tested on LIBERO and SimplerEnv (WidowX + Google Robot).
Flow Matching + Soft Prompts

Benchmarks

LIBERO

4 suites, 40 tasks
MuJoCo tabletop manipulation

SimplerEnv

10 tasks, 2 embodiments
WidowX + Google Robot

ALOHA-sim

Bimanual tasks
TransferCube, Insertion

MetaWorld

50 manipulation tasks
Multi-task evaluation

Experimental Scale

Model Episodes SAEs Trained Concepts ID'd Benchmark(s)
Pi0.5 31,600+ 36 43 LIBERO
OpenVLA-OFT 70,700+ 32 45 LIBERO
X-VLA 50,000+ 96 82 LIBERO, SimplerEnv
SmolVLA 37,100+ 128 45 LIBERO, MetaWorld
GR00T N1.5 164,700+ 96 36 LIBERO
ACT 1,870 ALOHA
Total 351,000+ 388 82+ 4 benchmarks

Coming Soon Physical Robot Experiments on UR5 and Franka Panda hardware

Action Atlas

Interactive visualization platform for VLA interpretability, inspired by Neuronpedia

Explore VLA Representations

Feature Explorer

UMAP scatter plots of 4,096+ SAE features with semantic search via SBERT embeddings

Layer Wires

Architecture diagrams showing information flow and concept density across transformer layers

Video Library

200,000+ rollout videos filterable by model, suite, experiment type, and outcome

Ablation Studies

Side-by-side baseline vs. ablated behavior with success comparison

Perturbation Testing

Vision perturbation results across models with displacement analysis and cross-embodiment data

Launch Action Atlas

action-atlas.com

Summary of all findings

Summary. All findings across six models: activation injection, SAE analysis, concept ablation, linear probing, and temporal dynamics.

Citation

@article{2026vla_interp,
  title   = {Not All Features Are Created Equal: A Mechanistic
             Study of Vision-Language-Action Models},
  author  = {Grant, Bryce and Zhao, Xijia and Wang, Peng},
  year    = {2026},
  url     = {https://arxiv.org/abs/}
}

Coming Soon Code & Data Release

Full codebase (SAE training, concept identification, ablation/steering, linear probing) and the Action Atlas platform will be released on paper acceptance. Pre-trained SAE checkpoints and activation datasets included.