Model Card

Salamandra-VL-7B-2512 is the latest version of the Salamandra vision model family. This version brings significant improvements in both architecture and training data.

Main Improvements

Image Encoder: Upgraded to SigLIP 2 Giant (Patch 16).
LLM: Implements the base version of Salamandra 7B, fine-tuned with the latest instruction data and a specialized focus on European languages using the Aya collection.
Visual Instruction Tuning: Incorporated PixMo datasets to improve fine-grained visual grounding and counting capabilities.

To visit the model cards of other Salamandra versions, please refer to the Model Index.

This model has been optimized to engage in conversation but has NOT been aligned through RLHF to filter or avoid sensitive topics. As a result, it may generate harmful or inappropriate content. The team is actively working to enhance its performance through further instruction and alignment with RL techniques.

Model Details

Architecture

The model architecture builds upon the Llava Onevision framework, integrating:

Vision Tower: SigLIP 2 Giant (Patch 16).
Language Model: A customized version of Salamandra 7B Base, tuned for better multilingual instruction following.
Projector: 2 layer MLP.

Hyperparameters

The full list of hyperparameters can be found here.

Framework

We utilized the Llava Onevision technique to train our vision model.

How to use

from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
import torch
from PIL import Image
path = "BSC-LT/Salamandra-VL-7B-2511"
processor = AutoProcessor.from_pretrained(path) 
model = LlavaOnevisionForConditionalGeneration.from_pretrained(path, torch_dtype=torch.float16, low_cpu_mem_usage=True).to("cuda")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(url)
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe la imagen con el mayor detalle posible."},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda", torch.float16)
output = model.generate(**inputs,
    temperature=0.7,
    max_new_tokens=1024)
output_tokens = output[0].tolist()
print(processor.decode(output[0], skip_special_tokens=True))

Using this template, each turn is preceded by a <|im_start|> delimiter and the role of the entity (either user, for content supplied by the user, or assistant for LLM responses), and finished with the <|im_end|> token.

Data

The training data for Salamandra-VL-7B-2511 represents a major upgrade from v1 (Salamandra-7B-Instruct), with a specific focus on European languages and high-quality visual instructions.

1. Text-Only Instruction Tuning

To enhance the multilingual capabilities of the model, we incorporated a diverse set of high-quality instruction tuning datasets during the visual training stages. The key datasets included are:

2. Visual Instruction Tuning

The visual instruction tuning process builds upon the standard LLaVA-NeXT data mixture (BLIP-558K, COCO-118K, MIMIC-IT, etc.) used in v1. In this version (v2), we have significantly enriched the training set with the PixMo dataset collection to improve fine-grained visual understanding, counting, and document processing.

Key New Visual Datasets:

PixMo Cap & Cap-QA: For detailed captioning and question answering.
PixMo General: Including ask-model-anything and point-explanations for better grounding and general visual reasoning.
PixMo Docs: Specialized subsets for charts, diagrams, tables, and other document types.

To maintain robust multilingual performance without compromising visual quality—and without the need for massive multilingual multimodal datasets—we adopted the strategy proposed in Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization. By integrating the text-only multilingual data described above directly into the visual instruction tuning stages, we ensure the model retains its language fidelity across European languages while acquiring state-of-the-art visual capabilities.

Evaluation

Improvements from v1 to v2

We compared the new v2 model against the previous v1 version using a comprehensive set of benchmarks and a pairwise arena evaluation.

Benchmark Comparison

The table below shows the performance improvement of Salamandra-VL-7B-2512 (v2) over v1 across various standard vision-language benchmarks.

Benchmark	Salamandra-VL-7B-2506 (v1)	Salamandra-VL-7B-2512 (v2)	Improvement
AI2D	74.51	78.34	+3.83
ChartQA	68.40	70.72	+2.32
CountBenchQA	72.69	86.65	+13.96
DocVQA	68.05	71.82	+3.77
HallusionBench	21.54	17.80	-3.74
InfoVQA	48.59	45.59	-3.00
MMBench	71.13	75.16	+4.03
MMMU	36.89	37.89	+1.00
MMStar	48.65	50.67	+2.02
RealWorldQA	56.99	63.40	+6.41
TextVQA	68.81	71.73	+2.92

Arena Evaluation (Pairwise)

We conducted a pairwise evaluation where both models generated responses to the same set of prompts in multiple European languages. We used the Aya Vision m-wildvision dataset for the evaluation prompts and Qwen 2.5-VL 32B) as a judge (LLM-as-a-Judge) to evaluate the responses and decide which model provided a better answer.

As shown in the plot, Salamandra-VL-7B-2512 (v2) consistently outperforms or ties with v1 across most languages, demonstrating improved multilingual capabilities and visual understanding.

Comparison with Open Models

We benchmarked Salamandra-VL-7B-2512 against other comparable open-source vision-language models of similar size.

Benchmark	Llava-v1.6-Vicuna-7B	Llama-3-Llava-Next-8B	Idefics3-8B	Salamandra-VL-7B-2512
AI2D	66.74	73.19	76.39	78.34
BLINK	19.20	39.56	48.45	31.04
ChartQA	53.96	69.64	28.72	70.72
CountBenchQA	52.36	57.08	58.93	86.65
DocVQA	67.03	72.54	80.71	71.82
HallusionBench	8.57	13.19	32.53	17.80
InfoVQA	30.30	31.91	44.28	45.59
MMBench	65.56	72.45	74.61	75.16
MMMU	35.94	43.61	43.50	37.89
MMStar	37.73	43.33	54.47	50.67
OCRBench	50.60	54.70	55.30	63.70
RealWorldQA	58.43	58.17	62.48	63.40
TextVQA	63.72	65.57	58.62	71.73
Average	46.93	53.46	55.31	58.81

Intended Use

Out-of-scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.

Hardware and Software

Training Framework

The visual instruction-tuned versions were produced with Llava_Onevision.

Compute Infrastructure

All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:

4x Nvidia Hopper GPUs with 64 HBM2 memory
2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
4x NDR200 (BW per node 800Gb/s)
512 GB of Main memory (DDR5)
460GB on NVMe storage

Ethical Considerations and Limitations

This model is an initial prototype, and we have not yet conducted a thorough evaluation of societal and cognitive biases. In future iterations, we plan to assess potential biases using established benchmarks, following methodologies similar to those applied in previous models.

We acknowledge that bias evaluation is a critical step in responsible model development. Given the ongoing nature of this work, we strongly encourage developers to conduct safety assessments and bias mitigation strategies tailored to their specific applications of the model. Future updates will include more comprehensive analyses as we continue improving this model.

Additional information

Author

The Language Technologies Lab from Barcelona Supercomputing Center.

Contact

For further information, please send an email to [email protected].

Copyright

Funding

This work has been promoted and financed by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

@misc{gonzalezagirre2025salamandratechnicalreport,
      title={Salamandra Technical Report}, 
      author={Aitor Gonzalez-Agirre and Marc Pàmies and Joan Llop and Irene Baucells and Severino Da Dalt and Daniel Tamayo and José Javier Saiz and Ferran Espuña and Jaume Prats and Javier Aula-Blasco and Mario Mina and Adrián Rubio and Alexander Shvets and Anna Sallés and Iñaki Lacunza and Iñigo Pikabea and Jorge Palomar and Júlia Falcão and Lucía Tormo and Luis Vasquez-Reina and Montserrat Marimon and Valle Ruíz-Fernández and Marta Villegas},
      year={2025},
      eprint={2502.08489},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08489}, 
}

License

RESEARCH-ONLY RAIL-AMS

Base Model Index

Model	Base	Instruct
2B	Link	Link
7B	Link	Link
40B	Link	WiP

Downloads last month: 18

Safetensors

Model size

9B params

Tensor type

F16

Collection including BSC-LT/Salamandra-VL-7B-2512

Salamandra 🦎

Collection

17 items • Updated 5 days ago • 62