Localizing objects and parts from natural language in 3D space is essential for robotics, AR, and embodied AI, yet existing methods face a trade-off between the accuracy and geometric consistency of per-scene optimization and the efficiency of feed-forward inference. We present TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference. Unlike prior methods that treat views independently, we introduce Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically plausible but geometrically inconsistent matches without requiring ground-truth poses. Validated on five benchmarks including ScanNet++ and uCO3D, TrianguLang achieves state-of-the-art feed-forward text-guided segmentation and localization, reducing user effort from O(N) clicks to a single text query. The model processes each frame at 1008×1008 resolution in ~57ms (~18 FPS) without optimization, enabling practical deployment for interactive robotics and AR applications.
TrianguLang takes uncalibrated RGB images and a single text query (e.g. "the red mug") and produces per-view segmentation masks plus a 3D centroid in world coordinates. No camera poses, no SLAM, no per-scene optimization.
The pipeline composes two frozen foundation models with a lightweight trained decoder: SAM3 (frozen, 848M) extracts text-conditioned semantic features, DA3-NESTED (frozen, 1.4B) jointly estimates metric depth, intrinsics, and extrinsics from the input images, and the GASA decoder (trained, 13.7M) fuses them with geometry-aware cross-view attention. Only 0.54% of the total parameters are trained.
Standard cross-attention matches features by semantic similarity alone, producing false correspondences between visually similar but spatially distant regions (e.g. two identical mugs). GASA introduces a geometric veto: each token carries a 3D position from depth unprojection, and a learned distance kernel penalizes attention between tokens that are far apart in metric space. If two features look similar but are meters apart, the geometric bias suppresses that match.
We also replace standard 2D positional encodings with world-space positional encoding: each pixel is unprojected to 3D using DA3 depth and camera parameters, then encoded with sinusoidal functions. The same physical point gets the same embedding regardless of viewpoint.
Spatial qualifiers like "nearest chair" or "mug left of the keyboard" are handled through direct geometric computation on depth-derived 3D positions, no LLM required. The pipeline parses spatial keywords, generates mask candidates, computes 3D centroids via depth unprojection, and selects the candidate satisfying the constraint. This runs in ~60ms, compared to 1-10+ seconds for LLM-based spatial reasoning.
On ScanNet++, a single text query achieves 62.4% mIoU, beating MV-SAM's 51.0% with 12 click prompts (+11.4 points). Cross-dataset transfer reveals the biggest gap: TrianguLang trained on ScanNet++ scores 75.7% mIoU on uCO3D, more than doubling MV-SAM's 32.2%. On LERF-OVS, TrianguLang closely matches LangSplat-V2 (58.1% vs. 59.9% mIoU) while running three orders of magnitude faster (~58ms vs. 10-45 minutes of per-scene optimization).
| Model | Setting | Train → Eval | mIoU ↑ | mAcc ↑ |
|---|---|---|---|---|
| MV-SAM | In-domain | ScanNet++ → ScanNet++ | 0.510 | 0.694 |
| In-domain | uCO3D → uCO3D | 0.910 | 0.965 | |
| Cross-domain | uCO3D → ScanNet++ | 0.194 | 0.251 | |
| Cross-domain | ScanNet++ → uCO3D | 0.322 | 0.517 | |
| Large-scale | SA-1B → ScanNet++ | 0.489 | 0.635 | |
| Large-scale | SA-1B → uCO3D | 0.877 | 0.950 | |
| TrianguLang (Ours) | In-domain | ScanNet++ → ScanNet++ | 0.624 | 0.774 |
| In-domain | uCO3D → uCO3D | 0.946 | 0.983 | |
| Cross-domain | uCO3D → ScanNet++ | 0.279 | 0.685 | |
| Cross-domain | ScanNet++ → uCO3D | 0.757 | 0.796 |
| Method | mIoU | Loc. Acc. | Per-scene Opt. | Time |
|---|---|---|---|---|
| LERF | 37.4 | 73.6 | Yes | ~45 min |
| LangSplat | 51.4 | 84.3 | Yes | ~10 min |
| LangSplat-V2 | 59.9 | 84.1 | Yes | ~10 min |
| TrianguLang (zero-shot) | 58.1 | 83.5 | No | ~58ms |
Each component matters. Removing the GASA kernel drops mIoU by 5.3 points; removing world-space PE drops it by 5.4. Swapping the learned distance kernel for a fixed RBF costs 10.7 points. The model runs at ~57ms per frame (1008x1008, single A100), while per-scene methods require 30-60 minutes per new scene.
TrianguLang's predicted depth and segmentation masks can be fused into a TSDF volume to produce watertight mesh reconstructions of queried objects. Select an object by text, and explore the extracted 3D mesh below.
@misc{grant2026triangulanggeometryawaresemanticconsensus,
title={TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization},
author={Bryce Grant and Aryeh Rothenberg and Atri Banerjee and Peng Wang},
year={2026},
eprint={2603.08096},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.08096},
}