One Scene, Two Depths: Probing Geometric Ambiguity in Monocular Foundation Models
Abstract
A faithful 3D world representation should account for layered geometry, where a single camera ray may contain multiple visible and geometrically valid surfaces. Monocular depth estimation, however, reduces this structure to one scalar depth per pixel. Transparent scenes make this ambiguity measurable: the same ray can pass through foreground glass and observe the background, turning the supervised target into a convention of annotation, data, and training rather than a scene-intrinsic truth. A learned predictor exposes this convention as its depth-layer preference. We introduce MultiDepth-3k (MD-3k), a sparse two-layer ordinal benchmark for measuring depth-layer preference and multi-layer spatial relationship accuracy (ML-SRA). On MD-3k, leading depth foundation models exhibit diverse layer preferences under standard RGB input, showing that the same layered geometry can be resolved differently across models. We further find that Laplacian Visual Prompting (LVP), a training-free spectral input transformation, can substantially change the reported layer for certain frozen models. The strongest RGB/LVP pair, DAv2-L, reaches 75.5% ML-SRA. These results suggest that depth foundation models may express complementary geometric hypotheses that standard RGB inference leaves unexpressed. We invite the community to rethink depth supervision and evaluation through an ambiguity-aware lens, where multiple valid 3D interpretations are treated as geometric structure to be measured, preserved, and expressed.
Community
One Scene, Two Depths studies a simple but overlooked question in monocular depth foundation models: under layered visibility, when one visual ray contains multiple visible and geometrically valid depths, which depth does the model choose?
Our key view is that single-depth prediction under ambiguity exposes a model’s depth-layer preference, rather than an unbiased scene-intrinsic truth. The label itself can become a convention shaped by sensors, annotation, datasets, training mixtures, and evaluation metrics.
We introduce MultiDepth-3k (MD-3k), a real-world transparent-scene benchmark with sparse two-layer ordinal annotations, to measure whether a model reports the transparent foreground or the visible background. We further propose Laplacian Visual Prompting (LVP), a training-free spectral input transformation that queries the same frozen model differently.
A key finding is that some frozen single-output depth models can express complementary depth hypotheses under RGB vs. LVP inputs. On MD-3k, the strongest RGB/LVP pair reaches 75.5% ML-SRA, above the strict 56.4% duplicated single-hypothesis ceiling, and reaches 52.2% on Reverse cases where one depth map cannot satisfy both valid layer relations by construction.
The broader implication is that single-depth prediction may be an incomplete interface for learned 3D world models: standard RGB inference may reveal only one preferred slice of richer multi-layer geometric knowledge.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs (2026)
- {\alpha}Depth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion (2026)
- World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible (2026)
- DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images (2026)
- GeoEdit: Geometry-Aware Object Editing via Dual-Branch Denoising (2026)
- PRISM: Feed-Forward Single-Image 3D Reconstruction via Geometric Warp-Residual Modeling (2026)
- Unified Panoramic Geometry Estimation via Multi-View Foundation Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2606.29600 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper