SeqResize-Qwen2.5-Omni-3B

This model uses SeqResize (sequence resizing) to compress multi-vector audiovisual video representations for ColBERT-style late interaction retrieval. Model weights are initialized from Qwen2.5-Omni-3B-Instruct (thinker part) and finetuned on RankVideo-Dataset and tested on MultiVENT2.0 for audiovisual text-to-video retrieval with bidirectional attention.

SeqResize compresses video and audio token vectors into a fixed budget of 64 vectors through projection along the sequence dimension.

Method Overview

SeqResize is a simple compression baseline: a MLP (or linear) layer projects the sequence dimension from a fixed input length to a fixed output length. Variable-length token sequences are trim-or-padded to the input length, then projected to the target number of vectors for ColBERT-style MaxSim retrieval. It is not the main method in our paper; we include it as a baseline.

Results on MultiVENT 2.0

Method	Tokens	R@10	nDCG@10
SeqResize (this model)	64	41.1	38.5
MemTok	64	48.7	44.8
H-Pool	64	49.2	46.5
AGC	64	49.6	46.3

Model Details


Initial weights	Qwen2.5-Omni-3B-Instruct (thinker)
Architecture	Qwen2.5-Omni (thinker) with bidirectional attention
Hidden dimension	2048
Compression method	SeqResize (learned sequence projection)
Resizer input size	1536 (fixed sequence length before projection)
Resizer output size	64 (Budget)
Resizer hidden size	384 (MLP bottleneck)
Default budget	64 vectors per document
Scoring	ColBERT-style MaxSim (late interaction)
Normalization	L2-normalized embeddings
Query prefix	`"Query: "`
Passage prefix	`"Passage: "`
Precision	bfloat16
Training video frames	24
Audio sampling rate	4KHz

Usage

Important: When loading this model you must set resizer_input_size, resizer_output_size, and resizer_hidden_size to match the trained checkpoint (1280, 64, and 256 for this release). The extra_encoder_state.safetensors, should also be placed in the model directory so the sequence resizer weights are loaded.

import torch
from transformers import AutoProcessor
from qwen_omni_utils import process_mm_info

from src.arguments import ModelArguments
from src.encoder.resize_encoder import SequenceResizerEncoder
from src.models.qwen2_5_omni_embed.qwen2_5_omni_embed import Qwen2_5OmniForEmbedding

MODEL_ID = "PLACEHOLDER"
VIDEO_PATH = "PLACEHOLDER"
AUDIO_PATH = "PLACEHOLDER"
RESIZER_INPUT_SIZE = 1536
RESIZER_OUTPUT_SIZE = 64
RESIZER_HIDDEN_SIZE = 384

# --- Setup ---
model_args = ModelArguments(
    model_name_or_path=MODEL_ID,
    pooling="resize",
    normalize=True,
    resizer_input_size=RESIZER_INPUT_SIZE,
    resizer_output_size=RESIZER_OUTPUT_SIZE,
    resizer_hidden_size=RESIZER_HIDDEN_SIZE,
    attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = SequenceResizerEncoder.load(
    Qwen2_5OmniForEmbedding,
    model_args,
    attn_implementation=model_args.attn_implementation,
    dtype=torch.bfloat16,
)
model = model.to("cuda").eval()

# --- Encode a video+audio document ---
passage_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Passage: "},
            {"type": "video", "video": VIDEO_PATH, "nframes": 24, "max_pixels": 75264, "min_pixels": 65856},
            {"type": "audio", "audio": AUDIO_PATH},
        ],
    }
]
text = processor.apply_chat_template(passage_messages, tokenize=False, add_generation_prompt=False)
audio_inputs, image_inputs, video_inputs = process_mm_info([passage_messages], use_audio_in_video=False)
passage_inputs = processor(
    text=[text], images=image_inputs, videos=video_inputs, audio=audio_inputs, padding=True, return_tensors="pt",
).to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        doc_embeddings, doc_mask = model.encode(passage_inputs, is_query=False)
        print(doc_embeddings.shape)
        # doc_embeddings: (1, 64, 2048) — 64 compressed vectors

# --- Encode a text query ---
query_messages = [{"role": "user", "content": [{"type": "text", "text": "Query: a person is cooking"}]}]
query_text = processor.apply_chat_template(query_messages, tokenize=False, add_generation_prompt=False)
query_inputs = processor(text=[query_text], padding=True, return_tensors="pt").to("cuda")

with torch.amp.autocast(device_type="cuda", dtype=torch.bfloat16):
    with torch.inference_mode():
        query_embeddings, query_mask = model.encode(query_inputs, is_query=True)
        print(query_embeddings.shape)

# --- ColBERT MaxSim scoring ---
score = model.compute_similarity(query_embeddings, doc_embeddings, query_mask, doc_mask)
print(f"Similarity score: {score.item():.4f}")

Command line usage

For running inference and evaluation from the command line, see the Quick Start section.

Citation

@misc{qin2026multivectorindexcompressionmodality,
      title={Multi-Vector Index Compression in Any Modality}, 
      author={Hanxiang Qin and Alexander Martin and Rohan Jha and Chunsheng Zuo and Reno Kriz and Benjamin Van Durme},
      year={2026},
      eprint={2602.21202},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2602.21202}, 
}

Downloads last month: 26

Safetensors

Model size

1.26M params

Tensor type

BF16

Datasets used to train hltcoe/SeqResize_qwen2.5-omni_multivent

Collection including hltcoe/SeqResize_qwen2.5-omni_multivent

Multi-Vector Index Compression in Any Modality

Collection

Models and Paper for Multi-Vector Index Compression in Any Modality • 15 items • Updated 10 days ago • 2

Paper for hltcoe/SeqResize_qwen2.5-omni_multivent

Multi-Vector Index Compression in Any Modality

Paper • 2602.21202 • Published 23 days ago • 22