ColQwen3 8B - VetCoders MLX Edition

Visual document retrieval model with ColBERT-style late interaction (MaxSim scoring), optimized for Apple Silicon via MLX.

Created by M&K (c)2025 The LibraxisAI Team

Model Description

ColQwen3-VetCoders-MLX is a visual document retrieval model converted to Apple MLX format. It produces multi-vector embeddings for both document images and text queries, enabling precise visual document search using late interaction (MaxSim) scoring.

Key Features

  • Visual Document Retrieval - Find relevant pages in PDF documents using image understanding
  • Late Interaction Ranking - ColBERT-style MaxSim scoring for precision
  • Multi-modal Embeddings - Embed both images and text queries into shared 320-dim space
  • Apple Silicon Native - Optimized for M1/M2/M3/M4 via MLX framework

Architecture

Input (Image or Text)
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Qwen3-VL Vision Encoder   β”‚  ← For images: extract visual features
β”‚   (frozen ViT patches)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Qwen3 Language Model      β”‚  ← Multimodal token processing
β”‚   (7B parameters)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Projection Layer          β”‚  ← Project to 320-dim embedding space
β”‚   (4096 β†’ 320)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         ↓
Multi-vector embeddings [N, 320]

Usage

Installation

pip install mlx mlx-vlm safetensors pillow

Loading the Model

from colqwen3_embedder import ColQwen3Embedder

# Initialize embedder (uses env vars or default paths)
embedder = ColQwen3Embedder()
embedder.load()

# Or specify paths directly
embedder = ColQwen3Embedder(
    model_path="LibraxisAI/colqwen3-8b-vetcoders-mlx",
    projection_path="path/to/projection.safetensors"
)

Embedding Documents

# Embed a document image
doc_embedding = embedder.embed_image("document_page.png")
# Returns: EmbeddingResult with shape [num_patches, 320]

# Embed a text query
query_embedding = embedder.embed_text("What is the treatment protocol?")
# Returns: EmbeddingResult with shape [num_tokens, 320]

Scoring Relevance

# MaxSim scoring for retrieval
score = embedder.maxsim_score(query_embedding, doc_embedding)
# Higher score = more relevant document

Batch Processing

# Process multiple documents
documents = ["page1.png", "page2.png", "page3.png"]
doc_embeddings = [embedder.embed_image(doc) for doc in documents]

# Score all documents against query
scores = [embedder.maxsim_score(query_embedding, doc) for doc in doc_embeddings]

# Get top matches
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)

Technical Details

Base Model

Converted from tomoro-ai/Colqwen3-8B-base, which was trained on:

  • vidore/colpali_train_set
  • Additional document understanding datasets

Weight Mapping

Original Tomoro weights are mapped to MLX-compatible structure:

  • vlm.model.language_model.* β†’ language_model.model.*
  • vlm.model.visual.* β†’ vision_tower.*
  • embedding_proj_layer.* β†’ saved separately as projection weights

Embedding Details

  • Dimension: 320 (projected from 4096)
  • Image tokens: Variable based on image resolution (patches)
  • Text tokens: Variable based on query length
  • Scoring: MaxSim (maximum similarity) late interaction

Performance

Tested on Apple Silicon:

Device Image Embedding Text Embedding Memory
M3 Max 128GB ~1.2s ~0.3s ~17GB
M3 Ultra 512GB ~0.8s ~0.2s ~17GB
M2 Ultra 192GB ~1.5s ~0.4s ~17GB

Files

colqwen3-8b-vetcoders-mlx/
β”œβ”€β”€ config.json                    # Model configuration
β”œβ”€β”€ model-00001-of-00007.safetensors  # Model weights (sharded)
β”œβ”€β”€ model-00002-of-00007.safetensors
β”œβ”€β”€ ...
β”œβ”€β”€ model.safetensors.index.json   # Weight index
β”œβ”€β”€ tokenizer.json                 # Tokenizer
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ preprocessor_config.json       # Image preprocessor
└── video_preprocessor_config.json

Projection weights (separate file):

colqwen3_projection.safetensors    # 4096β†’320 projection layer

Limitations

  • Requires Apple Silicon (M1/M2/M3/M4) for MLX acceleration
  • Large memory footprint (~17GB for inference)
  • Optimized for document images, not general photos

Citation

@misc{colqwen3-vetcoders-mlx,
  author = {LibraxisAI Team},
  title = {ColQwen3 8B - VetCoders MLX Edition},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/LibraxisAI/colqwen3-8b-vetcoders-mlx}}
}

License

Apache 2.0


Created by M&K (c)2025 The LibraxisAI Team Co-Authored-By: Maciej & Klaudiusz

Downloads last month
10
Safetensors
Model size
9B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train LibraxisAI/colqwen3-8b-vetcoders-mlx