Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

This is the model checkpoint of the GeoVR, a paradigm to restructure MLLM’s intrinsic representations with geometric awareness using purely 2D videos for Spatial Intelligence.

Sample Usage

To use this model, you need to clone the official repository to access the custom modeling files.

import torch
from utils.utils import *
from transformers import AutoProcessor
from models.qwen3vl_geo import Qwen3VLForConditionalGeneration

device = 'cuda:0'
model_id = "WHB139426/GeoVR-VGGT-Qwen3-VL-2B"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    geometry_encoder_path=None,
    metric_model_path=None,
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    add_camera=False,
    add_scale=False,
    add_depth=False,
    distill_geometry_feature=False,
)
model.load_geometric_weights(model_id)
model.to(device)

num_frames = 32
processor = AutoProcessor.from_pretrained(model_id)
processor.video_processor.size = {"longest_edge": 384*num_frames*32*32, "shortest_edge": 4*num_frames*32*32}

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": './assets/scene0111_02.mp4',},
            {"type": "text", "text": "Measuring distance from the nearest points, select the closest object (trash bin, door, table, refrigerator) to the tv. If multiple exist, use the nearest instance.
Options:
A. trash bin
B. door
C. table
D. refrigerator
Answer with the option's letter from the given choices directly."},
        ],
    }
]

generation_kwargs = {
    'do_sample': True,
    'top_p': 0.8,
    'top_k': 20,
    'temperature': 0.7,
    'repetition_penalty': 1.0,
    'max_new_tokens': 32*1024,
}

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
    num_frames=num_frames,
    fps=None,
    enable_thinking=False,
).to(model.device)

with torch.cuda.amp.autocast(enabled=True, dtype=torch.bfloat16):
    with torch.inference_mode():
        generated_ids = model.generate(**inputs, **generation_kwargs) 
        output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].strip()
print(output_text)

Citation

If you find this work useful, please consider citing:

@article{wang2026learning,
  title={Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models},
  author={Wang, Haibo and Huang, Lifu},
  journal={arXiv preprint arXiv:2606.05833},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for WHB139426/GeoVR