Qwen3-VL-Embedding-8B GGUF

GGUF quantizations of Qwen/Qwen3-VL-Embedding-8B for efficient CPU inference with llama.cpp.

Model Description

Qwen3-VL-Embedding-8B is a multimodal embedding model for information retrieval and cross-modal understanding. It supports text, images, screenshots, videos, and mixed multimodal inputs.

Original model specs:

Parameters: 8B
Context Length: 32K tokens
Embedding Dimension: 64-4096 (configurable)
Languages: 30+
Input Modalities: Text, Images, Videos

Available Quantizations

File	Size	Use Case
Qwen3-VL-Embedding-8B-F16.gguf	15GB	Maximum quality, baseline reference
Qwen3-VL-Embedding-8B-Q8_0.gguf	7.5GB	Recommended - minimal quality loss
Qwen3-VL-Embedding-8B-Q6_K.gguf	5.8GB	High quality, good balance
Qwen3-VL-Embedding-8B-Q5_K_M.gguf	5.1GB	Good quality, balanced size
Qwen3-VL-Embedding-8B-Q5_K_S.gguf	5.0GB	Good quality, smaller variant
Qwen3-VL-Embedding-8B-Q4_K_M.gguf	4.4GB	Decent quality, smaller size
Qwen3-VL-Embedding-8B-Q4_K_S.gguf	4.2GB	Decent quality, more compressed
Qwen3-VL-Embedding-8B-Q3_K_M.gguf	3.6GB	Lower quality, significant compression
Qwen3-VL-Embedding-8B-Q2_K.gguf	2.9GB	Lowest quality, maximum compression

Recommendation: Start with Q8_0 for production use. Use Q4_K_M or Q5_K_M for resource-constrained environments.

Usage with llama.cpp

Installation

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

Download Model

huggingface-cli download dam2452/Qwen3-VL-Embedding-8B-GGUF \
  Qwen3-VL-Embedding-8B-Q8_0.gguf \
  --local-dir ./models

Run Embedding Server

./llama-server \
  -m models/Qwen3-VL-Embedding-8B-Q8_0.gguf \
  --embedding \
  --port 8080 \
  --host 0.0.0.0

Generate Embeddings (API)

curl http://localhost:8080/embedding \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Your text or image data here"
  }'

Generate Embeddings (Python)

import requests

response = requests.post(
    "http://localhost:8080/embedding",
    json={"content": "A woman playing with her dog on a beach"}
)

embedding = response.json()["embedding"]
print(f"Embedding dimension: {len(embedding)}")

Performance

Original model performance on benchmarks:

MMEB-V2: 77.9 overall score
MMTEB: 67.88 mean task score
Retrieval: 81.08

Note: Quantized models may show slightly reduced performance, with Q8_0 typically having less than 1% degradation.

License

Apache 2.0 (inherited from original model)

Citation

@article{qwen3vlembedding,
  title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
  author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
  journal={arXiv},
  year={2026}
}

Resources

Downloads last month: 2,942

GGUF

Model size

8B params

Architecture

qwen3vl

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dam2452/Qwen3-VL-Embedding-8B-GGUF

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

Qwen/Qwen3-VL-Embedding-8B

Quantized

(6)

this model