SigLino: Vision Foundation Models (SigLIP2 + DINOv3)
Collection
Vision encoders distilled from DINOv3 and SigLIP2 (MoE & Dense). Stems from the CVPR 2026 AMoE paper. • 5 items • Updated • 4
Accepted at CVPR 2026
This work stems from the CVPR 2026 AMoE paper, which designs and applies distillation into a Mixture-of-Experts (MoE) vision architecture. We have chosen the name SigLino for better clarity (SigLIP2 + DINOv3).
Dense variant of SigLino. 30M parameters.
Part of the SigLino model family.
import torch
from PIL import Image
from transformers import AutoModel, AutoImageProcessor
model_id = "tiiuae/siglino-30M"
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to("cuda", dtype=torch.bfloat16)
processor = AutoImageProcessor.from_pretrained(model_id, trust_remote_code=True)
image = Image.open("image.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt").to("cuda")
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
with torch.no_grad():
outputs = model(**inputs)
# Options: 'siglino' (384d), 'siglip2' (1152d), 'dinov3' (1024d)
patch_features = outputs["patch_features"]["siglino"] # (Batch, Tokens, 384)
summary_features = outputs["summary_features"]["siglip2"] # (Batch, 1152)
| Property | Value |
|---|---|
| Architecture | Dense |
| Parameters | 0.03B |
| Layers | 12 |
| Hidden Dim | 384 |
| FFN Dim | 1536 |
| Patch Size | 16x16 |
| Teachers | DINOv3, SigLIP2 |
| Task | Metric | Score |
|---|---|---|
| kNN (ImageNet) | Acc | 79.0 |
| kNN (6-dataset avg) | Acc | 83.3 |
| Zero-shot cls (ImageNet) | Acc | 65.1 |
| Flickr30K I2T | R@1 | 82.2 |
| MSCOCO I2T | R@1 | 59.7 |
| Pascal VOC (1024) | mIoU | 82.1 |
| Cityscapes (1024) | mIoU | 59.2 |
@article{chaybouti2025amoe,
title={AMoE: Agglomerative Mixture-of-Experts Vision Foundation Models},
author={Chaybouti, Sofian and Narayan, Sanath and Dahou, Yasser and Le Khac, Phuc H. and Singh, Ankit and Huynh, Ngoc Dung and Para, Wamiq Reyaz and Kuehne, Hilde and Hacid, Hakim},
journal={arXiv preprint arXiv:2512.20157},
year={2025}
}