Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models
Paper • 2606.05833 • Published • 3
This is the model checkpoint of the GeoVR, a paradigm to restructure MLLM’s intrinsic representations with geometric awareness using purely 2D videos for Spatial Intelligence.
To use this model, you need to clone the official repository to access the custom modeling files.
import torch
from utils.utils import *
from transformers import AutoProcessor
from models.qwen3vl_geo import Qwen3VLForConditionalGeneration
device = 'cuda:0'
model_id = "WHB139426/GeoVR-VGGT-Qwen3-VL-2B"
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id,
geometry_encoder_path=None,
metric_model_path=None,
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
add_camera=False,
add_scale=False,
add_depth=False,
distill_geometry_feature=False,
)
model.load_geometric_weights(model_id)
model.to(device)
num_frames = 32
processor = AutoProcessor.from_pretrained(model_id)
processor.video_processor.size = {"longest_edge": 384*num_frames*32*32, "shortest_edge": 4*num_frames*32*32}
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": './assets/scene0111_02.mp4',},
{"type": "text", "text": "Measuring distance from the nearest points, select the closest object (trash bin, door, table, refrigerator) to the tv. If multiple exist, use the nearest instance.
Options:
A. trash bin
B. door
C. table
D. refrigerator
Answer with the option's letter from the given choices directly."},
],
}
]
generation_kwargs = {
'do_sample': True,
'top_p': 0.8,
'top_k': 20,
'temperature': 0.7,
'repetition_penalty': 1.0,
'max_new_tokens': 32*1024,
}
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
num_frames=num_frames,
fps=None,
enable_thinking=False,
).to(model.device)
with torch.cuda.amp.autocast(enabled=True, dtype=torch.bfloat16):
with torch.inference_mode():
generated_ids = model.generate(**inputs, **generation_kwargs)
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].strip()
print(output_text)
If you find this work useful, please consider citing:
@article{wang2026learning,
title={Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models},
author={Wang, Haibo and Huang, Lifu},
journal={arXiv preprint arXiv:2606.05833},
year={2026}
}