Qwen3.5-9B-heretic-NVFP4A16

Static NVFP4A16 (weight-only FP4) quantization of coder3101/Qwen3.5-9B-heretic (abliterated via Heretic).

Important: This is a static, weight-only quantization using round-to-nearest (RTN) — no calibration data, no activation quantization. Weights are FP4 (E2M1), activations stay BF16 at runtime. This means:

  • The Marlin kernel dequantizes FP4 weights → BF16 before compute (BF16xBF16 matmul)
  • Native Blackwell FP4 tensor cores are NOT used — those require W4A4 (both weights and activations in FP4)
  • A W4A4 model with calibrated activation scales (input_global_scale) is needed for native FP4xFP4 acceleration
  • Quality may be slightly lower than a calibration-aware quantization (GPTQ, etc.)

This model is useful for validating the NVFP4 pipeline and for inference where memory savings matter more than peak compute throughput.

Quantization Details

Parameter Value
Method NVFP4A16 — static, weight-only FP4, RTN (round-to-nearest)
Weight dtype E2M1 (4-bit float)
Block scale FP8 E4M3, one per 16 elements
Global scale FP32, one per tensor
Format nvfp4-pack-quantized (compressed-tensors)
Calibration None — pure RTN, no forward passes
Original size 18.8 GB (BF16)
Quantized size 11.2 GB (59.5%)
Tensors quantized 128
Tensors kept BF16 632

What's quantized

All standard linear projections: attention (q/k/v/o_proj), MLP (gate/up/down_proj).

What's kept in BF16

  • All GatedDeltaNet (linear attention) layers — more sensitive at FP4 than FP8
  • Embeddings, lm_head, all norms
  • Vision tower (entire)
  • MTP head

Quantization accuracy (roundtrip)

Metric Value
Max absolute error 0.058
Mean absolute error 0.0008

Usage with vLLM

vllm serve nivvis/Qwen3.5-9B-heretic-NVFP4A16 \
    --max-num-seqs 32 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.85 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3 \
    --trust-remote-code
from vllm import LLM
model = LLM("nivvis/Qwen3.5-9B-heretic-NVFP4A16", trust_remote_code=True)

Performance

Metric Value
VRAM 10.58 GiB
Decode throughput ~140 t/s (single request, not concurrent)
Hardware tested NVIDIA RTX PRO 6000 Blackwell Max-Q
Kernel Marlin FP4 (weight dequant → BF16 compute)
vLLM version 0.17.0rc1

Why not native FP4 tensor cores?

Native Blackwell FP4 tensor cores perform FP4xFP4 matmuls — both weights and activations must be FP4. This model is W4A16 (FP4 weights, BF16 activations), so the compute path is:

  1. Marlin kernel dequantizes FP4 weights → BF16
  2. Standard BF16xBF16 matmul

To hit native FP4 tensor cores, a W4A4 model is needed. That requires calibration-aware quantization (forward passes to compute per-tensor activation scales). W4A4 quantization is a planned next step.

Known Issues

  • SGLang 0.5.6.post2: Cannot serve this model due to a bug in compressed-tensors scheme dispatch (input_quant is None for weight-only quant). Fixed on SGLang main branch.
  • No MTP: Heretic abliteration strips MTP weights. Do not use speculation flags.
  • Thinking model: Very verbose chain-of-thought. Use a system prompt like "Do not use a thinking block" or set high max_tokens (8000+).

Quantization Tooling

Quantized with a custom NVFP4A16 quantizer. llm-compressor does not support Qwen3.5 models (requires transformers 5.x, llm-compressor is pinned to 4.x — issue #2369).

Credits

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nivvis/Qwen3.5-9B-heretic-NVFP4A16

Finetuned
Qwen/Qwen3.5-9B
Quantized
(1)
this model

Collection including nivvis/Qwen3.5-9B-heretic-NVFP4A16