Qwen3-VL-Embedding-8B GGUF

GGUF quantizations of Qwen/Qwen3-VL-Embedding-8B for efficient CPU inference with llama.cpp.

Model Description

Qwen3-VL-Embedding-8B is a multimodal embedding model for information retrieval and cross-modal understanding. It supports text, images, screenshots, videos, and mixed multimodal inputs.

Original model specs:

  • Parameters: 8B
  • Context Length: 32K tokens
  • Embedding Dimension: 64-4096 (configurable)
  • Languages: 30+
  • Input Modalities: Text, Images, Videos

Available Quantizations

File Size Use Case
Qwen3-VL-Embedding-8B-F16.gguf 15GB Maximum quality, baseline reference
Qwen3-VL-Embedding-8B-Q8_0.gguf 7.5GB Recommended - minimal quality loss
Qwen3-VL-Embedding-8B-Q6_K.gguf 5.8GB High quality, good balance
Qwen3-VL-Embedding-8B-Q5_K_M.gguf 5.1GB Good quality, balanced size
Qwen3-VL-Embedding-8B-Q5_K_S.gguf 5.0GB Good quality, smaller variant
Qwen3-VL-Embedding-8B-Q4_K_M.gguf 4.4GB Decent quality, smaller size
Qwen3-VL-Embedding-8B-Q4_K_S.gguf 4.2GB Decent quality, more compressed
Qwen3-VL-Embedding-8B-Q3_K_M.gguf 3.6GB Lower quality, significant compression
Qwen3-VL-Embedding-8B-Q2_K.gguf 2.9GB Lowest quality, maximum compression

Recommendation: Start with Q8_0 for production use. Use Q4_K_M or Q5_K_M for resource-constrained environments.

Usage with llama.cpp

Installation

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

Download Model

huggingface-cli download dam2452/Qwen3-VL-Embedding-8B-GGUF \
  Qwen3-VL-Embedding-8B-Q8_0.gguf \
  --local-dir ./models

Run Embedding Server

./llama-server \
  -m models/Qwen3-VL-Embedding-8B-Q8_0.gguf \
  --embedding \
  --port 8080 \
  --host 0.0.0.0

Generate Embeddings (API)

curl http://localhost:8080/embedding \
  -H "Content-Type: application/json" \
  -d '{
    "content": "Your text or image data here"
  }'

Generate Embeddings (Python)

import requests

response = requests.post(
    "http://localhost:8080/embedding",
    json={"content": "A woman playing with her dog on a beach"}
)

embedding = response.json()["embedding"]
print(f"Embedding dimension: {len(embedding)}")

Performance

Original model performance on benchmarks:

  • MMEB-V2: 77.9 overall score
  • MMTEB: 67.88 mean task score
  • Retrieval: 81.08

Note: Quantized models may show slightly reduced performance, with Q8_0 typically having less than 1% degradation.

License

Apache 2.0 (inherited from original model)

Citation

@article{qwen3vlembedding,
  title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
  author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
  journal={arXiv},
  year={2026}
}

Resources

Downloads last month
2,942
GGUF
Model size
8B params
Architecture
qwen3vl
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for dam2452/Qwen3-VL-Embedding-8B-GGUF

Quantized
(6)
this model