MemTok-Qwen2.5-Omni-3B

This model uses Memory Tokens (MemTok) to compress multi-vector audiovisual video representations for efficient ColBERT-style late interaction retrieval. Model weights are initialized from Qwen2.5-Omni-3B-Instruct (thinker part) and finetuned on RankVideo-Dataset and tested on MultiVENT2.0 for audiovisual text-to-video retrieval with bidirectional attention.

MemTok compresses video and audio token vectors into a fixed budget of 64 vectors via learnable memory tokens that aggregate document information through attention.

arXiv GitHub License

Method Overview

MemTok appends a set of m learnable memory tokens to the document token sequence. The concatenated sequence is encoded with a bidirectional transformer; after self-attention, each memory token has attended over the full document. The final hidden states of the m memory tokens form the compressed multi-vector representation used for ColBERT-style MaxSim retrieval.

Method

Results on MultiVENT 2.0

Method Tokens R@10 nDCG@10
SeqResize 64 41.1 38.5
MemTok (this model) 64 48.7 44.8
H-Pool 64 49.2 46.5
AGC 64 49.6 46.3

Model Details

Initial weights Qwen2.5-Omni-3B-Instruct (thinker)
Architecture Qwen2.5-Omni (thinker) with bidirectional attention
Hidden dimension 2048
Compression method MemTok (memory tokens)
Memory tokens 64 learned tokens (<|mem0|> – <|mem63|>) appended to document
Default budget 64 vectors per document
Scoring ColBERT-style MaxSim (late interaction)
Normalization L2-normalized embeddings
Query prefix "Query: "
Passage prefix "Passage: "
Precision bfloat16
Training video frames 24
Audio sampling rate 4KHz

Usage

import torch
from transformers import AutoProcessor
from qwen_omni_utils import process_mm_info

from src.arguments import ModelArguments
from src.encoder.multivec_encoder import MultiVecEncoder
from src.models.qwen2_5_omni_embed.qwen2_5_omni_embed import Qwen2_5OmniForEmbedding
from src.utils import get_appending_token_strings

MODEL_ID = "PLACEHOLDER"
VIDEO_PATH = "PLACEHOLDER"
AUDIO_PATH = "PLACEHOLDER"
NUM_MEMORY_TOKENS = 64
APPENDING_SUFFIX = "".join(get_appending_token_strings(NUM_MEMORY_TOKENS))

# --- Setup ---
model_args = ModelArguments(
    model_name_or_path=MODEL_ID,
    pooling="memory",
    normalize=True,
    num_appending_token=NUM_MEMORY_TOKENS,
    use_parametric_appending_tokens=True,
    attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = MultiVecEncoder.load(
    Qwen2_5OmniForEmbedding,
    model_args,
    attn_implementation=model_args.attn_implementation,
    dtype=torch.bfloat16,
)
model = model.to("cuda").eval()

# --- Encode a video+audio document ---
passage_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Passage: "},
            {"type": "video", "video": VIDEO_PATH, "nframes": 24, "max_pixels": 75264, "min_pixels": 65856},
            {"type": "audio", "audio": AUDIO_PATH},
        ],
    }
]
text = processor.apply_chat_template(passage_messages, tokenize=False, add_generation_prompt=False)
text += APPENDING_SUFFIX
audio_inputs, image_inputs, video_inputs = process_mm_info([passage_messages], use_audio_in_video=False)
passage_inputs = processor(
    text=[text], images=image_inputs, videos=video_inputs, audio=audio_inputs, padding=True, return_tensors="pt",
).to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        doc_embeddings, doc_mask = model.encode(passage_inputs, is_query=False)
        print(doc_embeddings.shape)
        # doc_embeddings: (1, 64, 2048) — 64 MemTok vectors

# --- Encode a text query ---
query_messages = [{"role": "user", "content": [{"type": "text", "text": "Query: a person is cooking"}]}]
query_text = processor.apply_chat_template(query_messages, tokenize=False, add_generation_prompt=False)
query_inputs = processor(text=[query_text], padding=True, return_tensors="pt").to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        query_embeddings, query_mask = model.encode(query_inputs, is_query=True)
        print(query_embeddings.shape)

# --- ColBERT MaxSim scoring ---
score = model.compute_similarity(query_embeddings, doc_embeddings, query_mask, doc_mask)
print(f"Similarity score: {score.item():.4f}")

Command line usage

For running inference and evaluation from the command line, see the Quick Start section.

Citation

@misc{qin2026multivectorindexcompressionmodality,
      title={Multi-Vector Index Compression in Any Modality}, 
      author={Hanxiang Qin and Alexander Martin and Rohan Jha and Chunsheng Zuo and Reno Kriz and Benjamin Van Durme},
      year={2026},
      eprint={2602.21202},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2602.21202}, 
}
Downloads last month
28
Safetensors
Model size
1.26M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train hltcoe/MemTok_qwen2.5-omni_multivent

Collection including hltcoe/MemTok_qwen2.5-omni_multivent

Paper for hltcoe/MemTok_qwen2.5-omni_multivent