Model Card
Salamandra-VL-7B-2512 is the latest version of the Salamandra vision model family. This version brings significant improvements in both architecture and training data.
Main Improvements
- Image Encoder: Upgraded to SigLIP 2 Giant (Patch 16).
- LLM: Implements the base version of Salamandra 7B, fine-tuned with the latest instruction data and a specialized focus on European languages using the Aya collection.
- Visual Instruction Tuning: Incorporated PixMo datasets to improve fine-grained visual grounding and counting capabilities.
To visit the model cards of other Salamandra versions, please refer to the Model Index.
This model has been optimized to engage in conversation but has NOT been aligned through RLHF to filter or avoid sensitive topics. As a result, it may generate harmful or inappropriate content. The team is actively working to enhance its performance through further instruction and alignment with RL techniques.
Model Details
Architecture
The model architecture builds upon the Llava Onevision framework, integrating:
- Vision Tower: SigLIP 2 Giant (Patch 16).
- Language Model: A customized version of Salamandra 7B Base, tuned for better multilingual instruction following.
- Projector: 2 layer MLP.
Hyperparameters
The full list of hyperparameters can be found here.
Framework
We utilized the Llava Onevision technique to train our vision model.
How to use
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
import torch
from PIL import Image
path = "BSC-LT/Salamandra-VL-7B-2511"
processor = AutoProcessor.from_pretrained(path)
model = LlavaOnevisionForConditionalGeneration.from_pretrained(path, torch_dtype=torch.float16, low_cpu_mem_usage=True).to("cuda")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(url)
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe la imagen con el mayor detalle posible."},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda", torch.float16)
output = model.generate(**inputs,
temperature=0.7,
max_new_tokens=1024)
output_tokens = output[0].tolist()
print(processor.decode(output[0], skip_special_tokens=True))
Using this template, each turn is preceded by a <|im_start|> delimiter and the role of the entity
(either user, for content supplied by the user, or assistant for LLM responses), and finished with the <|im_end|> token.
Data
The training data for Salamandra-VL-7B-2511 represents a major upgrade from v1 (Salamandra-7B-Instruct), with a specific focus on European languages and high-quality visual instructions.
1. Text-Only Instruction Tuning
To enhance the multilingual capabilities of the model, we incorporated a diverse set of high-quality instruction tuning datasets during the visual training stages. The key datasets included are:
- TowerBlocks v0.2
- Aya Dataset
- OpenAssistant 2 (OASST2)
- OpenMathInstruct-2
- OpenR1-Math
- No Robots
- OpenOrca
2. Visual Instruction Tuning
The visual instruction tuning process builds upon the standard LLaVA-NeXT data mixture (BLIP-558K, COCO-118K, MIMIC-IT, etc.) used in v1. In this version (v2), we have significantly enriched the training set with the PixMo dataset collection to improve fine-grained visual understanding, counting, and document processing.
Key New Visual Datasets:
- PixMo Cap & Cap-QA: For detailed captioning and question answering.
- PixMo General: Including
ask-model-anythingandpoint-explanationsfor better grounding and general visual reasoning. - PixMo Docs: Specialized subsets for
charts,diagrams,tables, andotherdocument types.
To maintain robust multilingual performance without compromising visual quality—and without the need for massive multilingual multimodal datasets—we adopted the strategy proposed in Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization. By integrating the text-only multilingual data described above directly into the visual instruction tuning stages, we ensure the model retains its language fidelity across European languages while acquiring state-of-the-art visual capabilities.
Evaluation
Improvements from v1 to v2
We compared the new v2 model against the previous v1 version using a comprehensive set of benchmarks and a pairwise arena evaluation.
Benchmark Comparison
The table below shows the performance improvement of Salamandra-VL-7B-2512 (v2) over v1 across various standard vision-language benchmarks.
| Benchmark | Salamandra-VL-7B-2506 (v1) | Salamandra-VL-7B-2512 (v2) | Improvement |
|---|---|---|---|
| AI2D | 74.51 | 78.34 | +3.83 |
| ChartQA | 68.40 | 70.72 | +2.32 |
| CountBenchQA | 72.69 | 86.65 | +13.96 |
| DocVQA | 68.05 | 71.82 | +3.77 |
| HallusionBench | 21.54 | 17.80 | -3.74 |
| InfoVQA | 48.59 | 45.59 | -3.00 |
| MMBench | 71.13 | 75.16 | +4.03 |
| MMMU | 36.89 | 37.89 | +1.00 |
| MMStar | 48.65 | 50.67 | +2.02 |
| RealWorldQA | 56.99 | 63.40 | +6.41 |
| TextVQA | 68.81 | 71.73 | +2.92 |
Arena Evaluation (Pairwise)
We conducted a pairwise evaluation where both models generated responses to the same set of prompts in multiple European languages. We used the Aya Vision m-wildvision dataset for the evaluation prompts and Qwen 2.5-VL 32B) as a judge (LLM-as-a-Judge) to evaluate the responses and decide which model provided a better answer.
As shown in the plot, Salamandra-VL-7B-2512 (v2) consistently outperforms or ties with v1 across most languages, demonstrating improved multilingual capabilities and visual understanding.
Comparison with Open Models
We benchmarked Salamandra-VL-7B-2512 against other comparable open-source vision-language models of similar size.
| Benchmark | Llava-v1.6-Vicuna-7B | Llama-3-Llava-Next-8B | Idefics3-8B | Salamandra-VL-7B-2512 |
|---|---|---|---|---|
| AI2D | 66.74 | 73.19 | 76.39 | 78.34 |
| BLINK | 19.20 | 39.56 | 48.45 | 31.04 |
| ChartQA | 53.96 | 69.64 | 28.72 | 70.72 |
| CountBenchQA | 52.36 | 57.08 | 58.93 | 86.65 |
| DocVQA | 67.03 | 72.54 | 80.71 | 71.82 |
| HallusionBench | 8.57 | 13.19 | 32.53 | 17.80 |
| InfoVQA | 30.30 | 31.91 | 44.28 | 45.59 |
| MMBench | 65.56 | 72.45 | 74.61 | 75.16 |
| MMMU | 35.94 | 43.61 | 43.50 | 37.89 |
| MMStar | 37.73 | 43.33 | 54.47 | 50.67 |
| OCRBench | 50.60 | 54.70 | 55.30 | 63.70 |
| RealWorldQA | 58.43 | 58.17 | 62.48 | 63.40 |
| TextVQA | 63.72 | 65.57 | 58.62 | 71.73 |
| Average | 46.93 | 53.46 | 55.31 | 58.81 |
Intended Use
Out-of-scope Use
The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.
Hardware and Software
Training Framework
The visual instruction-tuned versions were produced with Llava_Onevision.
Compute Infrastructure
All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.
The accelerated partition is composed of 1,120 nodes with the following specifications:
- 4x Nvidia Hopper GPUs with 64 HBM2 memory
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
- 4x NDR200 (BW per node 800Gb/s)
- 512 GB of Main memory (DDR5)
- 460GB on NVMe storage
Ethical Considerations and Limitations
This model is an initial prototype, and we have not yet conducted a thorough evaluation of societal and cognitive biases. In future iterations, we plan to assess potential biases using established benchmarks, following methodologies similar to those applied in previous models.
We acknowledge that bias evaluation is a critical step in responsible model development. Given the ongoing nature of this work, we strongly encourage developers to conduct safety assessments and bias mitigation strategies tailored to their specific applications of the model. Future updates will include more comprehensive analyses as we continue improving this model.
Additional information
Author
The Language Technologies Lab from Barcelona Supercomputing Center.
Contact
For further information, please send an email to [email protected].
Copyright
Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.
Funding
This work has been promoted and financed by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.
Disclaimer
Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.
The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.
Citation
@misc{gonzalezagirre2025salamandratechnicalreport,
title={Salamandra Technical Report},
author={Aitor Gonzalez-Agirre and Marc Pàmies and Joan Llop and Irene Baucells and Severino Da Dalt and Daniel Tamayo and José Javier Saiz and Ferran Espuña and Jaume Prats and Javier Aula-Blasco and Mario Mina and Adrián Rubio and Alexander Shvets and Anna Sallés and Iñaki Lacunza and Iñigo Pikabea and Jorge Palomar and Júlia Falcão and Lucía Tormo and Luis Vasquez-Reina and Montserrat Marimon and Valle Ruíz-Fernández and Marta Villegas},
year={2025},
eprint={2502.08489},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.08489},
}
License
Base Model Index
- Downloads last month
- 18

