Molmo2-4B-FP8

FP8 quantized version of allenai/Molmo2-4B for efficient inference.

Model Details

Property Value
Base Model allenai/Molmo2-4B
Quantization FP8 (8-bit floating point)
Format compressed-tensors
Model Size ~8GB (vs ~16GB original)
Vision Backbone Full precision (not quantized)

Quantization Details

  • Method: FP8 quantization using llmcompressor
  • Target Layers: Linear layers (excluding vision backbone, lm_head, mlp.gate)
  • Precision: 8-bit symmetric floating point for both weights and activations

Usage with vLLM

Important: This model requires a custom vLLM build with FP8 quantized weight mapping support for Molmo2.

Step 1: Start Docker Container

docker run -it --gpus all \
  --entrypoint /bin/bash \
  -e SETUPTOOLS_SCM_PRETEND_VERSION=0.9.0 \
  -v /path/to/your/models:/workspace/models \
  -p 8000:8000 \
  vllm/vllm-openai:latest

Step 2: Build Custom vLLM

Inside the container:

git clone https://github.com/George-Polya/vllm.git -b dev/molmo2-quantize
cd vllm
pip install --no-build-isolation -e .

Step 3: Serve the Model

vllm serve /workspace/models/Molmo2-4B-FP8 \
  --trust-remote-code \
  --max-model-len 4096 \
  --max-num-batched-tokens 8192

Step 4: Query the Model

from openai import OpenAI
import base64

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

# With image URL
response = client.chat.completions.create(
    model="Molmo2-4B-FP8",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
                {"type": "text", "text": "Describe this image."}
            ]
        }
    ],
    max_tokens=512
)
print(response.choices[0].message.content)

Why Custom vLLM Build?

The official vLLM does not yet support FP8 quantized weight loading for Molmo2's vision backbone. The custom branch adds:

  1. prefix parameter to vision layers for proper weight name mapping
  2. Extended hf_to_vllm_mapper patterns for quantized weight names

See: George-Polya/vllm@dev/molmo2-quantize

Quantization Recipe

default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: ['re:.*lm_head', 're:.*vision_backbone.*', 're:.*mlp.gate$']
      scheme: FP8

Limitations

  • Vision backbone remains in full precision to preserve image understanding quality
  • Requires custom vLLM build (not compatible with stock vLLM)
  • FP8 requires hardware support (NVIDIA Ada Lovelace / Hopper or newer)

License

This model inherits the Apache 2.0 license from the base model.

Downloads last month
1,038
Safetensors
Model size
5B params
Tensor type
F32
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tollea1234/Molmo2-4B-FP8

Finetuned
allenai/Molmo2-4B
Quantized
(2)
this model