ColQwen3 8B - VetCoders MLX Edition

Visual document retrieval model with ColBERT-style late interaction (MaxSim scoring), optimized for Apple Silicon via MLX.

Model Description

ColQwen3-VetCoders-MLX is a visual document retrieval model converted to Apple MLX format. It produces multi-vector embeddings for both document images and text queries, enabling precise visual document search using late interaction (MaxSim) scoring.

Key Features

Visual Document Retrieval - Find relevant pages in PDF documents using image understanding
Late Interaction Ranking - ColBERT-style MaxSim scoring for precision
Multi-modal Embeddings - Embed both images and text queries into shared 320-dim space
Apple Silicon Native - Optimized for M1/M2/M3/M4 via MLX framework

Architecture

Input (Image or Text)
         ↓
┌─────────────────────────────┐
│   Qwen3-VL Vision Encoder   │  ← For images: extract visual features
│   (frozen ViT patches)      │
└─────────────────────────────┘
         ↓
┌─────────────────────────────┐
│   Qwen3 Language Model      │  ← Multimodal token processing
│   (7B parameters)           │
└─────────────────────────────┘
         ↓
┌─────────────────────────────┐
│   Projection Layer          │  ← Project to 320-dim embedding space
│   (4096 → 320)              │
└─────────────────────────────┘
         ↓
Multi-vector embeddings [N, 320]

Usage

Installation

pip install mlx mlx-vlm safetensors pillow

Loading the Model

from colqwen3_embedder import ColQwen3Embedder

# Initialize embedder (uses env vars or default paths)
embedder = ColQwen3Embedder()
embedder.load()

# Or specify paths directly
embedder = ColQwen3Embedder(
    model_path="LibraxisAI/colqwen3-8b-vetcoders-mlx",
    projection_path="path/to/projection.safetensors"
)

Embedding Documents

# Embed a document image
doc_embedding = embedder.embed_image("document_page.png")
# Returns: EmbeddingResult with shape [num_patches, 320]

# Embed a text query
query_embedding = embedder.embed_text("What is the treatment protocol?")
# Returns: EmbeddingResult with shape [num_tokens, 320]

Scoring Relevance

# MaxSim scoring for retrieval
score = embedder.maxsim_score(query_embedding, doc_embedding)
# Higher score = more relevant document

Batch Processing

# Process multiple documents
documents = ["page1.png", "page2.png", "page3.png"]
doc_embeddings = [embedder.embed_image(doc) for doc in documents]

# Score all documents against query
scores = [embedder.maxsim_score(query_embedding, doc) for doc in doc_embeddings]

# Get top matches
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)

Technical Details

Base Model

Converted from tomoro-ai/Colqwen3-8B-base, which was trained on:

vidore/colpali_train_set
Additional document understanding datasets

Weight Mapping

Original Tomoro weights are mapped to MLX-compatible structure:

vlm.model.language_model.* → language_model.model.*
vlm.model.visual.* → vision_tower.*
embedding_proj_layer.* → saved separately as projection weights

Embedding Details

Dimension: 320 (projected from 4096)
Image tokens: Variable based on image resolution (patches)
Text tokens: Variable based on query length
Scoring: MaxSim (maximum similarity) late interaction

Performance

Tested on Apple Silicon:

Device	Image Embedding	Text Embedding	Memory
M3 Max 128GB	~1.2s	~0.3s	~17GB
M3 Ultra 512GB	~0.8s	~0.2s	~17GB
M2 Ultra 192GB	~1.5s	~0.4s	~17GB

Files

colqwen3-8b-vetcoders-mlx/
├── config.json                    # Model configuration
├── model-00001-of-00007.safetensors  # Model weights (sharded)
├── model-00002-of-00007.safetensors
├── ...
├── model.safetensors.index.json   # Weight index
├── tokenizer.json                 # Tokenizer
├── tokenizer_config.json
├── preprocessor_config.json       # Image preprocessor
└── video_preprocessor_config.json

Projection weights (separate file):

colqwen3_projection.safetensors    # 4096→320 projection layer

Limitations

Requires Apple Silicon (M1/M2/M3/M4) for MLX acceleration
Large memory footprint (~17GB for inference)
Optimized for document images, not general photos

Citation

@misc{colqwen3-vetcoders-mlx,
  author = {LibraxisAI Team},
  title = {ColQwen3 8B - VetCoders MLX Edition},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/LibraxisAI/colqwen3-8b-vetcoders-mlx}}
}

License

Apache 2.0

Downloads last month: 10

Safetensors

Model size

9B params

Tensor type

F32

Inference Providers NEW

Visual Document Retrieval

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

LibraxisAI
/

colqwen3-8b-vetcoders-mlx