Image-Text-to-Text
Transformers
Safetensors
llava_onevision
image-to-text
vision
conversational

Model Card

Salamandra-VL-7B-2512 is the latest version of the Salamandra vision model family. This version brings significant improvements in both architecture and training data.

Main Improvements

  • Image Encoder: Upgraded to SigLIP 2 Giant (Patch 16).
  • LLM: Implements the base version of Salamandra 7B, fine-tuned with the latest instruction data and a specialized focus on European languages using the Aya collection.
  • Visual Instruction Tuning: Incorporated PixMo datasets to improve fine-grained visual grounding and counting capabilities.

To visit the model cards of other Salamandra versions, please refer to the Model Index.

This model has been optimized to engage in conversation but has NOT been aligned through RLHF to filter or avoid sensitive topics. As a result, it may generate harmful or inappropriate content. The team is actively working to enhance its performance through further instruction and alignment with RL techniques.


Model Details

Architecture

The model architecture builds upon the Llava Onevision framework, integrating:

  • Vision Tower: SigLIP 2 Giant (Patch 16).
  • Language Model: A customized version of Salamandra 7B Base, tuned for better multilingual instruction following.
  • Projector: 2 layer MLP.

Hyperparameters

The full list of hyperparameters can be found here.

Framework

We utilized the Llava Onevision technique to train our vision model.


How to use

from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
import torch
from PIL import Image
path = "BSC-LT/Salamandra-VL-7B-2511"
processor = AutoProcessor.from_pretrained(path) 
model = LlavaOnevisionForConditionalGeneration.from_pretrained(path, torch_dtype=torch.float16, low_cpu_mem_usage=True).to("cuda")
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
image = Image.open(url)
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe la imagen con el mayor detalle posible."},
        ],
    },
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda", torch.float16)
output = model.generate(**inputs,
    temperature=0.7,
    max_new_tokens=1024)
output_tokens = output[0].tolist()
print(processor.decode(output[0], skip_special_tokens=True))

Using this template, each turn is preceded by a <|im_start|> delimiter and the role of the entity (either user, for content supplied by the user, or assistant for LLM responses), and finished with the <|im_end|> token.


Data

The training data for Salamandra-VL-7B-2511 represents a major upgrade from v1 (Salamandra-7B-Instruct), with a specific focus on European languages and high-quality visual instructions.

1. Text-Only Instruction Tuning

To enhance the multilingual capabilities of the model, we incorporated a diverse set of high-quality instruction tuning datasets during the visual training stages. The key datasets included are:

2. Visual Instruction Tuning

The visual instruction tuning process builds upon the standard LLaVA-NeXT data mixture (BLIP-558K, COCO-118K, MIMIC-IT, etc.) used in v1. In this version (v2), we have significantly enriched the training set with the PixMo dataset collection to improve fine-grained visual understanding, counting, and document processing.

Key New Visual Datasets:

  • PixMo Cap & Cap-QA: For detailed captioning and question answering.
  • PixMo General: Including ask-model-anything and point-explanations for better grounding and general visual reasoning.
  • PixMo Docs: Specialized subsets for charts, diagrams, tables, and other document types.

To maintain robust multilingual performance without compromising visual quality—and without the need for massive multilingual multimodal datasets—we adopted the strategy proposed in Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization. By integrating the text-only multilingual data described above directly into the visual instruction tuning stages, we ensure the model retains its language fidelity across European languages while acquiring state-of-the-art visual capabilities.


Evaluation

Improvements from v1 to v2

We compared the new v2 model against the previous v1 version using a comprehensive set of benchmarks and a pairwise arena evaluation.

Benchmark Comparison

The table below shows the performance improvement of Salamandra-VL-7B-2512 (v2) over v1 across various standard vision-language benchmarks.

Benchmark Salamandra-VL-7B-2506 (v1) Salamandra-VL-7B-2512 (v2) Improvement
AI2D 74.51 78.34 +3.83
ChartQA 68.40 70.72 +2.32
CountBenchQA 72.69 86.65 +13.96
DocVQA 68.05 71.82 +3.77
HallusionBench 21.54 17.80 -3.74
InfoVQA 48.59 45.59 -3.00
MMBench 71.13 75.16 +4.03
MMMU 36.89 37.89 +1.00
MMStar 48.65 50.67 +2.02
RealWorldQA 56.99 63.40 +6.41
TextVQA 68.81 71.73 +2.92

Arena Evaluation (Pairwise)

We conducted a pairwise evaluation where both models generated responses to the same set of prompts in multiple European languages. We used the Aya Vision m-wildvision dataset for the evaluation prompts and Qwen 2.5-VL 32B) as a judge (LLM-as-a-Judge) to evaluate the responses and decide which model provided a better answer.

Salamandra Arena Plot

As shown in the plot, Salamandra-VL-7B-2512 (v2) consistently outperforms or ties with v1 across most languages, demonstrating improved multilingual capabilities and visual understanding.

Comparison with Open Models

We benchmarked Salamandra-VL-7B-2512 against other comparable open-source vision-language models of similar size.

Benchmark Llava-v1.6-Vicuna-7B Llama-3-Llava-Next-8B Idefics3-8B Salamandra-VL-7B-2512
AI2D 66.74 73.19 76.39 78.34
BLINK 19.20 39.56 48.45 31.04
ChartQA 53.96 69.64 28.72 70.72
CountBenchQA 52.36 57.08 58.93 86.65
DocVQA 67.03 72.54 80.71 71.82
HallusionBench 8.57 13.19 32.53 17.80
InfoVQA 30.30 31.91 44.28 45.59
MMBench 65.56 72.45 74.61 75.16
MMMU 35.94 43.61 43.50 37.89
MMStar 37.73 43.33 54.47 50.67
OCRBench 50.60 54.70 55.30 63.70
RealWorldQA 58.43 58.17 62.48 63.40
TextVQA 63.72 65.57 58.62 71.73
Average 46.93 53.46 55.31 58.81

Intended Use

Out-of-scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. Any downstream application must comply with current laws and regulations. Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged.


Hardware and Software

Training Framework

The visual instruction-tuned versions were produced with Llava_Onevision.

Compute Infrastructure

All models were trained on MareNostrum 5, a pre-exascale EuroHPC supercomputer hosted and operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:

  • 4x Nvidia Hopper GPUs with 64 HBM2 memory
  • 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
  • 4x NDR200 (BW per node 800Gb/s)
  • 512 GB of Main memory (DDR5)
  • 460GB on NVMe storage

Ethical Considerations and Limitations

This model is an initial prototype, and we have not yet conducted a thorough evaluation of societal and cognitive biases. In future iterations, we plan to assess potential biases using established benchmarks, following methodologies similar to those applied in previous models.

We acknowledge that bias evaluation is a critical step in responsible model development. Given the ongoing nature of this work, we strongly encourage developers to conduct safety assessments and bias mitigation strategies tailored to their specific applications of the model. Future updates will include more comprehensive analyses as we continue improving this model.


Additional information

Author

The Language Technologies Lab from Barcelona Supercomputing Center.

Contact

For further information, please send an email to [email protected].

Copyright

Copyright(c) 2025 by Language Technologies Lab, Barcelona Supercomputing Center.

Funding

This work has been promoted and financed by the Ministerio para la Transformación Digital y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje.

Disclaimer

Be aware that the model may contain biases or other unintended distortions. When third parties deploy systems or provide services based on this model, or use the model themselves, they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

Citation

@misc{gonzalezagirre2025salamandratechnicalreport,
      title={Salamandra Technical Report}, 
      author={Aitor Gonzalez-Agirre and Marc Pàmies and Joan Llop and Irene Baucells and Severino Da Dalt and Daniel Tamayo and José Javier Saiz and Ferran Espuña and Jaume Prats and Javier Aula-Blasco and Mario Mina and Adrián Rubio and Alexander Shvets and Anna Sallés and Iñaki Lacunza and Iñigo Pikabea and Jorge Palomar and Júlia Falcão and Lucía Tormo and Luis Vasquez-Reina and Montserrat Marimon and Valle Ruíz-Fernández and Marta Villegas},
      year={2025},
      eprint={2502.08489},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08489}, 
}

License

RESEARCH-ONLY RAIL-AMS

Base Model Index

Model Base Instruct
2B Link Link
7B Link Link
40B Link WiP
Downloads last month
18
Safetensors
Model size
9B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including BSC-LT/Salamandra-VL-7B-2512