Qwen3-VL-Embedding-8B GGUF
GGUF quantizations of Qwen/Qwen3-VL-Embedding-8B for efficient CPU inference with llama.cpp.
Model Description
Qwen3-VL-Embedding-8B is a multimodal embedding model for information retrieval and cross-modal understanding. It supports text, images, screenshots, videos, and mixed multimodal inputs.
Original model specs:
- Parameters: 8B
- Context Length: 32K tokens
- Embedding Dimension: 64-4096 (configurable)
- Languages: 30+
- Input Modalities: Text, Images, Videos
Available Quantizations
| File | Size | Use Case |
|---|---|---|
| Qwen3-VL-Embedding-8B-F16.gguf | 15GB | Maximum quality, baseline reference |
| Qwen3-VL-Embedding-8B-Q8_0.gguf | 7.5GB | Recommended - minimal quality loss |
| Qwen3-VL-Embedding-8B-Q6_K.gguf | 5.8GB | High quality, good balance |
| Qwen3-VL-Embedding-8B-Q5_K_M.gguf | 5.1GB | Good quality, balanced size |
| Qwen3-VL-Embedding-8B-Q5_K_S.gguf | 5.0GB | Good quality, smaller variant |
| Qwen3-VL-Embedding-8B-Q4_K_M.gguf | 4.4GB | Decent quality, smaller size |
| Qwen3-VL-Embedding-8B-Q4_K_S.gguf | 4.2GB | Decent quality, more compressed |
| Qwen3-VL-Embedding-8B-Q3_K_M.gguf | 3.6GB | Lower quality, significant compression |
| Qwen3-VL-Embedding-8B-Q2_K.gguf | 2.9GB | Lowest quality, maximum compression |
Recommendation: Start with Q8_0 for production use. Use Q4_K_M or Q5_K_M for resource-constrained environments.
Usage with llama.cpp
Installation
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
Download Model
huggingface-cli download dam2452/Qwen3-VL-Embedding-8B-GGUF \
Qwen3-VL-Embedding-8B-Q8_0.gguf \
--local-dir ./models
Run Embedding Server
./llama-server \
-m models/Qwen3-VL-Embedding-8B-Q8_0.gguf \
--embedding \
--port 8080 \
--host 0.0.0.0
Generate Embeddings (API)
curl http://localhost:8080/embedding \
-H "Content-Type: application/json" \
-d '{
"content": "Your text or image data here"
}'
Generate Embeddings (Python)
import requests
response = requests.post(
"http://localhost:8080/embedding",
json={"content": "A woman playing with her dog on a beach"}
)
embedding = response.json()["embedding"]
print(f"Embedding dimension: {len(embedding)}")
Performance
Original model performance on benchmarks:
- MMEB-V2: 77.9 overall score
- MMTEB: 67.88 mean task score
- Retrieval: 81.08
Note: Quantized models may show slightly reduced performance, with Q8_0 typically having less than 1% degradation.
License
Apache 2.0 (inherited from original model)
Citation
@article{qwen3vlembedding,
title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
journal={arXiv},
year={2026}
}
Resources
- Downloads last month
- 2,942
Hardware compatibility
Log In
to add your hardware
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support