jinaai/jina-embeddings-v5-text-small-text-matching

jina-embeddings-v5-text-small-text-matching: Text-Matching-Targeted Embedding Distillation

Elastic Inference Service | ArXiv | Release Note | Blog

Model Overview

jina-embeddings-v5-text Architecture

`jina-embeddings-v5-text-small-text-matching` is a compact, high-performance text embedding model designed for text-matching.

It is part of the jina-embeddings-v5-text model family, which also includes jina-embeddings-v5-text-nano, a smaller model for more resource-constrained use cases.

Trained using a novel approach that combines distillation with task-specific contrastive losses, jina-embeddings-v5-text-small-text-matching outperforms existing state-of-the-art models of similar size across diverse embedding benchmarks.

Feature	Value
Parameters	677M
Supported Tasks	`text-matching`
Max Sequence Length	32768
Embedding Dimension	1024
Matryoshka Dimensions	32, 64, 128, 256, 512, 768, 1024
Pooling Strategy	Last-token pooling
Base Model	jinaai/jina-embeddings-v5-text-small

Training and Evaluation

For training details and evaluation results, see our technical report.

Usage

Requirements

The following Python packages are required:

transformers>=5.1.0
torch>=2.8.0
peft>=0.15.2
vllm>=0.15.1

Optional / Recommended

flash-attention: Installing flash-attention is recommended for improved inference speed and efficiency, but not mandatory.
sentence-transformers: If you want to use the model via the sentence-transformers interface, install this package as well.

via Elastic Inference Service

The fastest way to use v5-text in production. Elastic Inference Service (EIS) provides managed embedding inference with built-in scaling, so you can generate embeddings directly within your Elastic deployment.

PUT _inference/text_embedding/jina-v5
{
  "service": "elastic",
  "service_settings": {
    "model_id": "jina-embeddings-v5-text-small"
  }
}

See the Elastic Inference Service documentation for setup details.

via sentence-transformers

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
    "jinaai/jina-embeddings-v5-text-small-text-matching",
    model_kwargs={"dtype": torch.bfloat16},  # Recommended for GPUs
    config_kwargs={"_attn_implementation": "flash_attention_2"},  # Recommended but optional
)
# Optional: set truncate_dim in encode() to control embedding size

texts = [
    "A beautiful sunset over the beach",  # English
    "غروب جميل على الشاطئ",  # Arabic
    "海滩上美丽的日落",  # Chinese
    "Un beau coucher de soleil sur la plage",  # French
    "Ein wunderschöner Sonnenuntergang am Strand",  # German
    "Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία",  # Greek
    "समुद्र तट पर एक खूबसूरत सूर्यास्त",  # Hindi
    "Un bellissimo tramonto sulla spiaggia",  # Italian
    "浜辺に沈む美しい夕日",  # Japanese
    "해변 위로 아름다운 일몰",  # Korean
]

# Encode texts
embeddings = model.encode(texts)
print(embeddings.shape)
# (10, 1024)

similarity = model.similarity(embeddings[0], embeddings[1:])
print(similarity)
# tensor([[0.7833, 0.8926, 0.9333, 0.9421, 0.7588, 0.9068, 0.9301, 0.8521, 0.8768]])

via vLLM

from vllm import LLM
from vllm.config.pooler import PoolerConfig

# Initialize model
name = "jinaai/jina-embeddings-v5-text-small-text-matching"
model = LLM(
    model=name,
    dtype="float16",
    runner="pooling",
    pooler_config=PoolerConfig(seq_pooling_type="LAST", normalize=True)
)

# Create text prompts
document1 = "Overview of climate change impacts on coastal cities"
document1_prompt = f"Document: {document1}"

document2 = "The impacts of climate change on large cities"
document2_prompt = f"Document: {document2}"

# Encode all prompts
prompts = [document1_prompt, document2_prompt]
outputs = model.encode(prompts, pooling_task="embed")

embed_document1 = outputs[0].outputs.data
embed_document2 = outputs[1].outputs.data

via Text Embeddings Inference

Via Docker on CPU:

docker run -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
  --model-id jinaai/jina-embeddings-v5-text-small-text-matching \
  --dtype float32 --pooling last-token

Via Docker on NVIDIA GPU (Turing, Ampere, Ada Lovelace, Hopper or Blackwell):

docker run --gpus all --shm-size 1g -p 8080:80 \
  ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 \
  --model-id jinaai/jina-embeddings-v5-text-small-text-matching \
  --dtype float16 --pooling last-token

Alternatively, you can also run with cargo, more information can be found in the Text Embeddings Inference documentation.

Send a request to /v1/embeddings to generate embeddings via the OpenAI Embeddings API:

curl -X POST http://127.0.0.1:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "jinaai/jina-embeddings-v5-text-small-text-matching",
    "input": [
      "Document: The impacts of climate change on coastal cities are significant...",
    ]
  }'

Or rather via the Text Embeddings Inference API specification instead, to prevent from manually formatting the inputs:

curl -X POST http://127.0.0.1:8080/embed \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": "Overview of climate change impacts on coastal cities",
    "prompt_name": "document",
  }'

via llama.cpp (GGUF)

After installing llama.cpp one can run llama-server to host the embedding model as OpenAI API compatible HTTP server with the respective model version:

llama-server -hf jinaai/jina-embeddings-v5-text-small-text-matching:F16 --embedding --pooling last -ub 32768

Client:

curl -X POST "http://127.0.0.1:8080/v1/embeddings" \
  -H "Content-Type: application/json" \
  -d '{
    "input": [
      "Document: A beautiful sunset over the beach",
      "Document: Un beau coucher de soleil sur la plage",
      "Document: 海滩上美丽的日落",
      "Document: 浜辺に沈む美しい夕日",
      "Document: Golden sunlight melts into the horizon, painting waves in warm amber and rose, while the sky whispers goodnight to the quiet, endless sea."
    ]
  }'

via Optimum (ONNX)

You can run the ONNX-optimized version of the model locally using Hugging Face's optimum library. Make sure you have the required dependencies installed (e.g., pip install optimum[onnxruntime] transformers torch):

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch

model_id = "jinaai/jina-embeddings-v5-text-small-text-matching"

# 1. Load tokenizer and ONNX model
# We specify the subfolder 'onnx' where the weights are located
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = ORTModelForFeatureExtraction.from_pretrained(
    model_id,
    subfolder="onnx",
    file_name="model.onnx",
    provider="CPUExecutionProvider",  # Or "CUDAExecutionProvider" for GPU
    trust_remote_code=True,
)

# 2. Prepare input
texts = ["Document: How do I use Jina ONNX models?", "Document: Information about semantic matching."]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")


# 4. Inference
with torch.no_grad():
    outputs = model(**inputs)

# 5. Pooling (Crucial for Jina-v5)
# Jina-v5 uses LAST-TOKEN pooling.
# We take the hidden state of the last non-padding token.
last_hidden_state = outputs.last_hidden_state
# Find the indices of the last token (usually the end of the sequence)
sequence_lengths = inputs.attention_mask.sum(dim=1) - 1
embeddings = last_hidden_state[torch.arange(last_hidden_state.size(0)), sequence_lengths]

print('embeddings shape:', embeddings.shape)
print('embeddings:', embeddings)

License

The model is licensed under CC BY-NC 4.0. For commercial use, please contact us.

Citation

If you find jina-embeddings-v5-text-small-text-matching useful in your research, please cite the following paper:

@misc{akram2026jinaembeddingsv5texttasktargetedembeddingdistillation,
      title={jina-embeddings-v5-text: Task-Targeted Embedding Distillation}, 
      author={Mohammad Kalim Akram and Saba Sturua and Nastia Havriushenko and Quentin Herreros and Michael Günther and Maximilian Werk and Han Xiao},
      year={2026},
      eprint={2602.15547},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.15547}, 
}