Qwen3.5-9B-heretic-NVFP4A16
Static NVFP4A16 (weight-only FP4) quantization of coder3101/Qwen3.5-9B-heretic (abliterated via Heretic).
Important: This is a static, weight-only quantization using round-to-nearest (RTN) — no calibration data, no activation quantization. Weights are FP4 (E2M1), activations stay BF16 at runtime. This means:
- The Marlin kernel dequantizes FP4 weights → BF16 before compute (BF16xBF16 matmul)
- Native Blackwell FP4 tensor cores are NOT used — those require W4A4 (both weights and activations in FP4)
- A W4A4 model with calibrated activation scales (
input_global_scale) is needed for native FP4xFP4 acceleration- Quality may be slightly lower than a calibration-aware quantization (GPTQ, etc.)
This model is useful for validating the NVFP4 pipeline and for inference where memory savings matter more than peak compute throughput.
Quantization Details
| Parameter | Value |
|---|---|
| Method | NVFP4A16 — static, weight-only FP4, RTN (round-to-nearest) |
| Weight dtype | E2M1 (4-bit float) |
| Block scale | FP8 E4M3, one per 16 elements |
| Global scale | FP32, one per tensor |
| Format | nvfp4-pack-quantized (compressed-tensors) |
| Calibration | None — pure RTN, no forward passes |
| Original size | 18.8 GB (BF16) |
| Quantized size | 11.2 GB (59.5%) |
| Tensors quantized | 128 |
| Tensors kept BF16 | 632 |
What's quantized
All standard linear projections: attention (q/k/v/o_proj), MLP (gate/up/down_proj).
What's kept in BF16
- All GatedDeltaNet (linear attention) layers — more sensitive at FP4 than FP8
- Embeddings,
lm_head, all norms - Vision tower (entire)
- MTP head
Quantization accuracy (roundtrip)
| Metric | Value |
|---|---|
| Max absolute error | 0.058 |
| Mean absolute error | 0.0008 |
Usage with vLLM
vllm serve nivvis/Qwen3.5-9B-heretic-NVFP4A16 \
--max-num-seqs 32 \
--max-model-len 32768 \
--gpu-memory-utilization 0.85 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--trust-remote-code
from vllm import LLM
model = LLM("nivvis/Qwen3.5-9B-heretic-NVFP4A16", trust_remote_code=True)
Performance
| Metric | Value |
|---|---|
| VRAM | 10.58 GiB |
| Decode throughput | ~140 t/s (single request, not concurrent) |
| Hardware tested | NVIDIA RTX PRO 6000 Blackwell Max-Q |
| Kernel | Marlin FP4 (weight dequant → BF16 compute) |
| vLLM version | 0.17.0rc1 |
Why not native FP4 tensor cores?
Native Blackwell FP4 tensor cores perform FP4xFP4 matmuls — both weights and activations must be FP4. This model is W4A16 (FP4 weights, BF16 activations), so the compute path is:
- Marlin kernel dequantizes FP4 weights → BF16
- Standard BF16xBF16 matmul
To hit native FP4 tensor cores, a W4A4 model is needed. That requires calibration-aware quantization (forward passes to compute per-tensor activation scales). W4A4 quantization is a planned next step.
Known Issues
- SGLang 0.5.6.post2: Cannot serve this model due to a bug in compressed-tensors scheme dispatch (
input_quantis None for weight-only quant). Fixed on SGLang main branch. - No MTP: Heretic abliteration strips MTP weights. Do not use speculation flags.
- Thinking model: Very verbose chain-of-thought. Use a system prompt like "Do not use a thinking block" or set high
max_tokens(8000+).
Quantization Tooling
Quantized with a custom NVFP4A16 quantizer. llm-compressor does not support Qwen3.5 models (requires transformers 5.x, llm-compressor is pinned to 4.x — issue #2369).
Credits
- Base model: Qwen/Qwen3.5-9B by Qwen Team
- Abliteration: coder3101/Qwen3.5-9B-heretic using Heretic
- NVFP4A16 quantization: nivvis
- Downloads last month
- 16