TrianguLang

Geometry-Aware Semantic Consensus for Pose-Free 3D Localization

Bryce Grant1, Aryeh Rothenberg1, Atri Banerjee1, Peng Wang1
1Case Western Reserve University

Abstract

Localizing objects and parts from natural language in 3D space is essential for robotics, AR, and embodied AI, yet existing methods face a trade-off between the accuracy and geometric consistency of per-scene optimization and the efficiency of feed-forward inference. We present TrianguLang, a feed-forward framework for 3D localization that requires no camera calibration at inference. Unlike prior methods that treat views independently, we introduce Geometry-Aware Semantic Attention (GASA), which utilizes predicted geometry to gate cross-view feature correspondence, suppressing semantically plausible but geometrically inconsistent matches without requiring ground-truth poses. Validated on five benchmarks including ScanNet++ and uCO3D, TrianguLang achieves state-of-the-art feed-forward text-guided segmentation and localization, reducing user effort from O(N) clicks to a single text query. The model processes each frame at 1008×1008 resolution in ~57ms (~18 FPS) without optimization, enabling practical deployment for interactive robotics and AR applications.

How It Works

TrianguLang takes uncalibrated RGB images and a single text query (e.g. "the red mug") and produces per-view segmentation masks plus a 3D centroid in world coordinates. No camera poses, no SLAM, no per-scene optimization.

The pipeline composes two frozen foundation models with a lightweight trained decoder: SAM3 (frozen, 848M) extracts text-conditioned semantic features, DA3-NESTED (frozen, 1.4B) jointly estimates metric depth, intrinsics, and extrinsics from the input images, and the GASA decoder (trained, 13.7M) fuses them with geometry-aware cross-view attention. Only 0.54% of the total parameters are trained.

Overview of the TrianguLang architecture

Standard cross-attention matches features by semantic similarity alone, producing false correspondences between visually similar but spatially distant regions (e.g. two identical mugs). GASA introduces a geometric veto: each token carries a 3D position from depth unprojection, and a learned distance kernel penalizes attention between tokens that are far apart in metric space. If two features look similar but are meters apart, the geometric bias suppresses that match.

We also replace standard 2D positional encodings with world-space positional encoding: each pixel is unprojected to 3D using DA3 depth and camera parameters, then encoded with sinusoidal functions. The same physical point gets the same embedding regardless of viewpoint.

Overview of the GASA decoder

Spatial qualifiers like "nearest chair" or "mug left of the keyboard" are handled through direct geometric computation on depth-derived 3D positions, no LLM required. The pipeline parses spatial keywords, generates mask candidates, computes 3D centroids via depth unprojection, and selects the candidate satisfying the constraint. This runs in ~60ms, compared to 1-10+ seconds for LLM-based spatial reasoning.

Spatial language demo (static/images/spatial_demo.png)

Results

On ScanNet++, a single text query achieves 62.4% mIoU, beating MV-SAM's 51.0% with 12 click prompts (+11.4 points). Cross-dataset transfer reveals the biggest gap: TrianguLang trained on ScanNet++ scores 75.7% mIoU on uCO3D, more than doubling MV-SAM's 32.2%. On LERF-OVS, TrianguLang closely matches LangSplat-V2 (58.1% vs. 59.9% mIoU) while running three orders of magnitude faster (~58ms vs. 10-45 minutes of per-scene optimization).

Model Setting Train → Eval mIoU ↑ mAcc ↑
MV-SAM In-domain ScanNet++ → ScanNet++ 0.510 0.694
In-domain uCO3D → uCO3D 0.910 0.965
Cross-domain uCO3D → ScanNet++ 0.194 0.251
Cross-domain ScanNet++ → uCO3D 0.322 0.517
Large-scale SA-1B → ScanNet++ 0.489 0.635
Large-scale SA-1B → uCO3D 0.877 0.950
TrianguLang (Ours) In-domain ScanNet++ → ScanNet++ 0.624 0.774
In-domain uCO3D → uCO3D 0.946 0.983
Cross-domain uCO3D → ScanNet++ 0.279 0.685
Cross-domain ScanNet++ → uCO3D 0.757 0.796
Method mIoU Loc. Acc. Per-scene Opt. Time
LERF 37.4 73.6 Yes ~45 min
LangSplat 51.4 84.3 Yes ~10 min
LangSplat-V2 59.9 84.1 Yes ~10 min
TrianguLang (zero-shot) 58.1 83.5 No ~58ms

Each component matters. Removing the GASA kernel drops mIoU by 5.3 points; removing world-space PE drops it by 5.4. Swapping the learned distance kernel for a fixed RBF costs 10.7 points. The model runs at ~57ms per frame (1008x1008, single A100), while per-scene methods require 30-60 minutes per new scene.

Segmentation results on ScanNet++
Qualitative comparison on LERF-OVS
Segmentation results on uCO3D
Segmentation results on SPIn-NeRF
NVOS results NVOS spatial reasoning results

Interactive Demo

Interactive 3D viewer coming soon.
Type a query to highlight objects in the scene.

3D Reconstruction

TrianguLang's predicted depth and segmentation masks can be fused into a TSDF volume to produce watertight mesh reconstructions of queried objects. Select an object by text, and explore the extracted 3D mesh below.

TSDF mesh viewer coming soon.
Rotate, zoom, and inspect reconstructed object meshes.

Citation

@misc{grant2026triangulanggeometryawaresemanticconsensus,
      title={TrianguLang: Geometry-Aware Semantic Consensus for Pose-Free 3D Localization},
      author={Bryce Grant and Aryeh Rothenberg and Atri Banerjee and Peng Wang},
      year={2026},
      eprint={2603.08096},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.08096},
}