VLA models combine vision, language, and motor control in one architecture—but nobody really knows what's going on inside them. We present the first large-scale mechanistic interpretability study of VLAs, covering six models from 80M to 7B parameters: Pi0.5 (3B, flow-matching), OpenVLA-OFT (7B, continuous regression), X-VLA (1B, soft-prompted flow-matching), SmolVLA (450M, interleaved VLM-expert), GR00T N1.5 (3B, DiT-Eagle hybrid), and ACT (80M, CVAE). Using activation injection, counterfactual prompting, sparse autoencoders, and linear probes across 351,000+ rollout episodes on four benchmarks (LIBERO, MetaWorld, SimplerEnv, ALOHA), we find five things that hold across architectures.
Visual pathway activations determine behavior across all six models (cosine similarity = 0.999). Fine-tuned VLAs ignore language—null and negated prompts produce identical behavior despite 99.3% internal prompt decodability. Cross-task injection fails universally (0–2% across 3,600+ pairs), but displacement analysis shows the robot actually executes the source task's motor program in the wrong scene (99.8% source-dominant in X-VLA). In multi-pathway architectures, expert pathways cause 2× more behavioral displacement than VLM pathways. Per-token SAE processing matters, and narrow architectures (1024-dim) have catastrophic kill-switch features while wider ones (4096-dim) spread information redundantly. We release Action Atlas, an interactive platform for exploring VLA concept representations across all six models with 388 trained SAEs and 82+ identified manipulation concepts.
Figure 1: Core Findings. (A) Injecting PaliGemma activations from a baseline episode into a null-prompt episode recovers near-identical actions (cos = 0.999)—vision drives behavior, not language. (B) Per-token SAEs maintain 94% task success while mean-pooled SAEs cause 88% failure. (C) Same-scene steering improves performance by +23-26pp, while cross-task transfer fails (0-2%).
What we found by looking inside six VLA architectures
The visual pathway drives behavior across all six architectures. Inject baseline activations into a null-prompt episode and the robot does the same thing.
cos = 0.999, all 6 modelsTask identity lives in the first transformer layer. Injecting just Layer 0 recovers 73% of task performance.
L0: 73% recoveryActivation injection overrides language prompts with 93% success, improving task performance by +23-26 percentage points.
93% override rateTransfer fails across all six models (0-2%), but the robot is actually running the source task's motor program in the wrong scene. Displacement analysis shows 99.8% source-dominant trajectories in X-VLA.
0-2% across 3,600+ pairs, 6 modelsFine-tuned VLAs ignore what you tell them. Null, negated, and contradictory prompts all produce the same behavior—even though the model internally distinguishes prompts with 99.3% accuracy.
p = 0.25 (Pi0.5); confirmed on SmolVLA, X-VLAPer-token processing is needed for most architectures, but the picture is more complex than expected: X-VLA mean-pooled SAEs actually achieve better rollout fidelity than per-token.
388 SAEs across 6 modelsNarrow models (1024-dim Pi0.5) blow up when you ablate features. Wide models (4096-dim OFT) degrade gracefully because information is spread across more features.
1024: fragile, 4096: resilientA single SAE trained on mixed data works across all four LIBERO benchmark suites and four different fine-tuned OFT model variants.
99.2% (119/120)Robots commit to their trajectory early on. Ablating step 0 alone causes a -49% drop, while messing with later steps barely matters (-1%).
Step 0: -49%Per-token Sparse Autoencoders + causal interventions for VLA mechanistic interpretability
Figure 2: Architecture Comparison. Spider diagram comparing properties across architectures. In dual-pathway (Pi0.5, SmolVLA) and triple-pathway (GR00T) models, VLM components encode WHAT while expert components encode HOW, with expert pathways causing 2× more behavioral displacement.
We train 388 TopK SAEs (k=64) across all six models with 4–8x expansion: 1024→8192 (Pi0.5 expert, X-VLA), 4096→32768 (OFT), 960→4096/480→4096 (SmolVLA VLM/expert), and 1536–2048→12288–16384 (GR00T DiT/Eagle/VL-SA). Each action token is processed independently to preserve the temporal structure needed for motor control. How pooling strategy affects rollout fidelity depends on the architecture—mean-pooling helps some models and hurts others.
Feature importance is scored via frequency-weighted contrastive selection using Cohen's d effect size between concept-present and concept-absent episodes, multiplied by activation frequency.
We test causality with four injection conditions: null injection (correct prompt to empty string), same-scene steering (redirect to alternate targets), cross-task injection (transfer across visual scenes), and cross-seed (same task, different initial conditions).
Concept ablation zeros out specific SAE features during live rollouts, while feature steering scales feature activations by alpha to amplify or suppress encoded behaviors.
Figure 3: SAE Explained Variance. Layer-wise analysis of SAE reconstruction quality across all 18 layers for both Expert and PaliGemma pathways, with concept density heatmaps showing how different concepts distribute across the network depth.
Ridge regression probes hit 97-98% R² across all action dimensions. We validate this with a projection operator test: projecting out the probe direction drops R² to 0%, confirming those directions actually matter for action prediction.
We test language grounding with 6 prompt variations across Pi0.5 (3,396+ episodes, ANOVA p = 0.25), SmolVLA (MetaWorld, 4 difficulty levels), and X-VLA (LIBERO + SimplerEnv): baseline, null (empty string), negation ("don't move"), motor commands, object swap, and temporal switches. The models mostly ignore language, though SmolVLA shows some sensitivity on harder tasks.
Figure 4: Temporal Criticality. (A) Feature ablation by episode phase: early steps matter most (-49% at step 0), late phases are fine (-1%). (B) Linear probes hit 97-98% R² while SAE ablation shows only 2% effect due to redundancy.
Figure 5: The Goldilocks Effect. Unlike LLMs where you can smoothly dial features up or down, VLAs are all-or-nothing: any deviation from natural activation levels (boosting or dampening) causes failure. These are precise motor control signals, not soft preferences.
Figure 6: Language Is Ignored. (A) Success rates don't change across 6 prompt types on Pi0.5 (ANOVA p = 0.25); same pattern on SmolVLA and X-VLA. (B) The weird part: Layer 17 classifiers distinguish prompt types with 99.3% accuracy, but the model's behavior doesn't change.
Figure 7: Linear Probes vs SAE Ablation. Per-dimension R² values and causality validation. Linear probes find the exact subspace used for action generation; SAE ablation is limited by feature redundancy.
| Model | Hidden Dim | 30-Feature Ablation | Tasks Affected | Interpretation |
|---|---|---|---|---|
| Pi0.5 Expert | 1024 | Catastrophic (-60 to -100pp) | 8-10 / 10 | Narrow = concentrated = fragile |
| X-VLA | 1024 | Similar narrow profile | All layers critical | Narrow = concentrated = fragile |
| OpenVLA-OFT | 4096 | Sparse / zero (0pp to -33pp) | 0-3 / 10 | Wide = redundant = resilient |
| GR00T N1.5 | 1536-2048 | Universal features devastate DiT layers | Layer-type dependent | Mixed = pathway-specialized |
Qualitative Results. Ablating specific SAE features kills specific behaviors. Each row shows baseline (works) vs. ablated (fails) for concepts PUT, OPEN, PUSH, and STOVE/INTERACT. The failure modes match the ablated concept.
Six architectures, 80M to 7B parameters, three action generation paradigms
4 suites, 40 tasks
MuJoCo tabletop manipulation
10 tasks, 2 embodiments
WidowX + Google Robot
Bimanual tasks
TransferCube, Insertion
50 manipulation tasks
Multi-task evaluation
| Model | Episodes | SAEs Trained | Concepts ID'd | Benchmark(s) |
|---|---|---|---|---|
| Pi0.5 | 31,600+ | 36 | 43 | LIBERO |
| OpenVLA-OFT | 70,700+ | 32 | 45 | LIBERO |
| X-VLA | 50,000+ | 96 | 82 | LIBERO, SimplerEnv |
| SmolVLA | 37,100+ | 128 | 45 | LIBERO, MetaWorld |
| GR00T N1.5 | 164,700+ | 96 | 36 | LIBERO |
| ACT | 1,870 | — | — | ALOHA |
| Total | 351,000+ | 388 | 82+ | 4 benchmarks |
Coming Soon Physical Robot Experiments on UR5 and Franka Panda hardware
Interactive visualization platform for VLA interpretability, inspired by Neuronpedia
UMAP scatter plots of 4,096+ SAE features with semantic search via SBERT embeddings
Architecture diagrams showing information flow and concept density across transformer layers
200,000+ rollout videos filterable by model, suite, experiment type, and outcome
Side-by-side baseline vs. ablated behavior with success comparison
Vision perturbation results across models with displacement analysis and cross-embodiment data
action-atlas.com
Feature Explorer
Layer Wires
Ablation Studies
Demo Videos
Summary. All findings across six models: activation injection, SAE analysis, concept ablation, linear probing, and temporal dynamics.
@article{2026vla_interp,
title = {Not All Features Are Created Equal: A Mechanistic
Study of Vision-Language-Action Models},
author = {Grant, Bryce and Zhao, Xijia and Wang, Peng},
year = {2026},
url = {https://arxiv.org/abs/}
}
Coming Soon Code & Data Release
Full codebase (SAE training, concept identification, ablation/steering, linear probing) and the Action Atlas platform will be released on paper acceptance. Pre-trained SAE checkpoints and activation datasets included.