TrianguLang

Abstract

Localizing objects and parts from natural language in 3D space is essential for robotics, AR, and embodied AI, yet existing methods face a trade-off between the accuracy and geometric consistency of per-scene optimization and the efficiency of feed-forward inference. We present TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference. Unlike prior methods that treat views independently, we introduce Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically plausible but geometrically inconsistent matches without requiring ground-truth poses. Validated on five benchmarks including ScanNet++ and uCO3D, TrianguLang achieves state-of-the-art feed-forward text-guided segmentation and localization, reducing user effort from O(N) clicks to a single text query. The model processes each frame at 1008×1008 resolution in ~57ms (~18 FPS) without optimization, enabling practical deployment for interactive robotics and AR applications.

How It Works

TrianguLang takes uncalibrated RGB images and a single text query (e.g. "the red mug") and produces per-view segmentation masks plus a 3D centroid in world coordinates. No camera poses, no SLAM, no per-scene optimization.

The pipeline composes two frozen foundation models with a lightweight trained decoder: SAM3 (frozen, 848M) extracts text-conditioned semantic features, DA3-NESTED (frozen, 1.4B) jointly estimates metric depth, intrinsics, and extrinsics from the input images, and the GASA decoder (trained, 13.7M) fuses them with geometry-aware cross-view attention. Only 0.54% of the total parameters are trained.

Overview of the TrianguLang architecture

Standard cross-attention matches features by semantic similarity alone, producing false correspondences between visually similar but spatially distant regions (e.g. two identical mugs). GASA introduces a geometric veto: each token carries a 3D position from depth unprojection, and a learned distance kernel penalizes attention between tokens that are far apart in metric space. If two features look similar but are meters apart, the geometric bias suppresses that match.

We also replace standard 2D positional encodings with world-space positional encoding: each pixel is unprojected to 3D using DA3 depth and camera parameters, then encoded with sinusoidal functions. The same physical point gets the same embedding regardless of viewpoint.

Spatial qualifiers like "nearest chair" or "mug left of the keyboard" are handled through direct geometric computation on depth-derived 3D positions, no LLM required. The pipeline parses spatial keywords, generates mask candidates, computes 3D centroids via depth unprojection, and selects the candidate satisfying the constraint. This runs in ~60ms, compared to 1-10+ seconds for LLM-based spatial reasoning.

Spatial language demo (static/images/spatial_demo.png)

Results

On ScanNet++, a single text query achieves 62.4% mIoU, beating MV-SAM's 51.0% with 12 click prompts (+11.4 points). Cross-dataset transfer reveals the biggest gap: TrianguLang trained on ScanNet++ scores 75.7% mIoU on uCO3D, more than doubling MV-SAM's 32.2%. On LERF-OVS, TrianguLang closely matches LangSplat-V2 (58.1% vs. 59.9% mIoU) while running three orders of magnitude faster (~58ms vs. 10-45 minutes of per-scene optimization).

Model	Setting	Train → Eval	mIoU ↑	mAcc ↑
MV-SAM	In-domain	ScanNet++ → ScanNet++	0.510	0.694
	In-domain	uCO3D → uCO3D	0.910	0.965
	Cross-domain	uCO3D → ScanNet++	0.194	0.251
	Cross-domain	ScanNet++ → uCO3D	0.322	0.517
	Large-scale	SA-1B → ScanNet++	0.489	0.635
	Large-scale	SA-1B → uCO3D	0.877	0.950
TrianguLang (Ours)	In-domain	ScanNet++ → ScanNet++	0.624	0.774
	In-domain	uCO3D → uCO3D	0.946	0.983
	Cross-domain	uCO3D → ScanNet++	0.279	0.685
	Cross-domain	ScanNet++ → uCO3D	0.757	0.796

Method	mIoU	Loc. Acc.	Per-scene Opt.	Time
LERF	37.4	73.6	Yes	~45 min
LangSplat	51.4	84.3	Yes	~10 min
LangSplat-V2	59.9	84.1	Yes	~10 min
TrianguLang (zero-shot)	58.1	83.5	No	~58ms

Each component matters. Removing the GASA kernel drops mIoU by 5.3 points; removing world-space PE drops it by 5.4. Swapping the learned distance kernel for a fixed RBF costs 10.7 points. The model runs at ~57ms per frame (1008x1008, single A100), while per-scene methods require 30-60 minutes per new scene.

Interactive Demo

Interactive 3D viewer coming soon.
Type a query to highlight objects in the scene.

3D Reconstruction

TrianguLang's predicted depth and segmentation masks can be fused into a TSDF volume to produce watertight mesh reconstructions of queried objects. Select an object by text, and explore the extracted 3D mesh below.

TSDF mesh viewer coming soon.
Rotate, zoom, and inspect reconstructed object meshes.

Citation

@misc{grant2026triangulanggeometryawaresemanticconsensus,
      title={TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization},
      author={Bryce Grant and Aryeh Rothenberg and Atri Banerjee and Peng Wang},
      year={2026},
      eprint={2603.08096},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.08096},
}