Qwen3.5-9B-heretic-NVFP4A16

Static NVFP4A16 (weight-only FP4) quantization of coder3101/Qwen3.5-9B-heretic (abliterated via Heretic).

Important: This is a static, weight-only quantization using round-to-nearest (RTN) — no calibration data, no activation quantization. Weights are FP4 (E2M1), activations stay BF16 at runtime. This means:

The Marlin kernel dequantizes FP4 weights → BF16 before compute (BF16xBF16 matmul)

Native Blackwell FP4 tensor cores are NOT used — those require W4A4 (both weights and activations in FP4)

A W4A4 model with calibrated activation scales (input_global_scale) is needed for native FP4xFP4 acceleration

Quality may be slightly lower than a calibration-aware quantization (GPTQ, etc.)

This model is useful for validating the NVFP4 pipeline and for inference where memory savings matter more than peak compute throughput.

Quantization Details

Parameter	Value
Method	NVFP4A16 — static, weight-only FP4, RTN (round-to-nearest)
Weight dtype	E2M1 (4-bit float)
Block scale	FP8 E4M3, one per 16 elements
Global scale	FP32, one per tensor
Format	`nvfp4-pack-quantized` (compressed-tensors)
Calibration	None — pure RTN, no forward passes
Original size	18.8 GB (BF16)
Quantized size	11.2 GB (59.5%)
Tensors quantized	128
Tensors kept BF16	632

What's quantized

All standard linear projections: attention (q/k/v/o_proj), MLP (gate/up/down_proj).

What's kept in BF16

All GatedDeltaNet (linear attention) layers — more sensitive at FP4 than FP8
Embeddings, lm_head, all norms
Vision tower (entire)
MTP head

Quantization accuracy (roundtrip)

Metric	Value
Max absolute error	0.058
Mean absolute error	0.0008

Usage with vLLM

vllm serve nivvis/Qwen3.5-9B-heretic-NVFP4A16 \
    --max-num-seqs 32 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.85 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --trust-remote-code

from vllm import LLM
model = LLM("nivvis/Qwen3.5-9B-heretic-NVFP4A16", trust_remote_code=True)

Performance

Metric	Value
VRAM	10.58 GiB
Decode throughput	~140 t/s (single request, not concurrent)
Hardware tested	NVIDIA RTX PRO 6000 Blackwell Max-Q
Kernel	Marlin FP4 (weight dequant → BF16 compute)
vLLM version	0.17.0rc1

Why not native FP4 tensor cores?

Native Blackwell FP4 tensor cores perform FP4xFP4 matmuls — both weights and activations must be FP4. This model is W4A16 (FP4 weights, BF16 activations), so the compute path is:

Marlin kernel dequantizes FP4 weights → BF16
Standard BF16xBF16 matmul

To hit native FP4 tensor cores, a W4A4 model is needed. That requires calibration-aware quantization (forward passes to compute per-tensor activation scales). W4A4 quantization is a planned next step.

Known Issues

SGLang 0.5.6.post2: Cannot serve this model due to a bug in compressed-tensors scheme dispatch (input_quant is None for weight-only quant). Fixed on SGLang main branch.
No MTP: Heretic abliteration strips MTP weights. Do not use speculation flags.
Thinking model: Very verbose chain-of-thought. Use a system prompt like "Do not use a thinking block" or set high max_tokens (8000+).

Quantization Tooling

Quantized with a custom NVFP4A16 quantizer. llm-compressor does not support Qwen3.5 models (requires transformers 5.x, llm-compressor is pinned to 4.x — issue #2369).