--- license: apache-2.0 pipeline_tag: image-text-to-text library_name: transformers --- # Visual Spatial Tuning: VST-7B-SFT

This model is described in the paper [Visual Spatial Tuning](https://huggingface.co/papers/2511.05491). TL;DR: VST is a comprehensive framework designed to cultivate Vision-Language Models (VLMs) with human-like visuospatial abilities—from spatial perception to advanced reasoning. ## 💡 Key Highlights ✨ **VST-P**: 4.1M samples across 19 skills, spanning single images, multi-image scenarios, and videos—boosting spatial perception in VLMs. ✨ **VST-R**: 135K curated samples that teach models to reason in space, including step-by-step reasoning and rule-based data for reinforcement learning. ✨ **Progressive Training Pipeline**: Start with supervised fine-tuning to build foundational spatial knowledge, then reinforce spatial reasoning abilities via RL. VST achieves state-of-the-art results on spatial benchmarks (34.8% on MMSI-Bench, 61.2% on VSIBench) without compromising general capabilities. ✨ **Vision-Language-Action Models Enhanced**: The VST paradigm significantly strengthens spatial tuning, paving the way for more physically grounded AI. ### 📈 Spatial & General Benchmarks | Models | CV | 3DSR | MMSI | BLINK | VSI | MMStar | MMB | RealworldQA | MMMU | OCRB | AI2D | |---------------------|------|------|------|-------|------|--------|------|-------------|------|------|------| | VST-3B-SFT | 84.4 | 54.1 | 30.2 | 59.1 | 57.9 | 58.0 | 80.9 | 68.4 | 45.2 | 83.7 | 82.5 | | VST-3B-RL | 84.2 | 56.5 | 31.3 | 57.2 | 57.7 | 58.9 | 80.5 | 68.5 | 49.8 | 80.9 | 82.4 | | VST-7B-SFT | 85.5 | 54.6 | 32.0 | 62.1 | 60.6 | 63.1 | 83.3 | 72.2 | 50.6 | 85.5 | 84.9 | | VST-7B-RL| 86.5 | 60.1 | 34.8 | 62.6 | 61.2 | 63.5 | 83.0 | 68.5 | 49.4 | 86.1 | 83.5 | ### 📈 VSIBench | Methods | Avg. | Obj. Count | Abs. Dist. | Obj. Size | Room Size | Rel. Dist | Rel. Dir. | Route Plan | Appr. Order | |-----------------------|------|------------|------------|-----------|-----------|-----------|-----------|------------|-------------| | VST-3B-SFT | 57.9 | 69.3 | 45.4 | 71.8 | 62.4 | 59.0 | 46.0 | 38.7 | 70.2 | | VST-3B-RL | 57.7 | 66.6 | 45.0 | 72.8 | 60.9 | 59.9 | 47.6 | 40.7 | 68.3 | | VST-7B-SFT | 60.6 | 72.0 | 44.4 | 74.3 | 68.3 | 59.7 | 55.8 | 44.9 | 65.2 | | VST-7B-RL | 61.2 | 71.6 | 43.8 | 75.5 | 69.2 | 60.0 | 55.6 | 44.3 | 69.2 | ## Sample Usage To get started, install the necessary libraries: ```bash pip install transformers # It's highly recommanded to use `[decord]` feature for faster video loading. pip install qwen-vl-utils ``` ### Using 🤗 Transformers to Chat Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`: ```python import torch from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model_path="rayruiyang/VST-7B-SFT" # We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios. model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", ) # default processer processor = AutoProcessor.from_pretrained(model_path, min_pixels = 256*28*28, max_pixels=1280*28*28) messages = [ { "role": "user", "content": [ { "type": "image", "image": "http://images.cocodataset.org/train2017/000000039685.jpg", }, {"type": "text", "text": "Consider the real-world 3D locations of the objects. Is the flag directly underneath the airplane?"}, ], } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") # Inference: Generation of the output generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text[0]) ``` ## Citation If you find our work helpful, feel free to give us a cite. ``` @article{vst, title={Visual Spatial Tuning}, author={Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao}, journal={arXiv preprint arXiv:2511.05491}, year={2025} } ```