RTX 5090 + Qwen 30B MoE @ 135 tok/s in NVFP4 Guide
#24
by
JohnTdi
- opened
RTX 5090 + Qwen 30B MoE @ 135 tok/s in NVFP4 - Full guide with C++ patches
Tutorial | Guide
Spent 4 days getting NVFP4 working on consumer Blackwell. T RT-LLM 1.2.0rc4 has critical bugs that prevent loading managed weights for FP4 models - the allocator uses 2x VRAM and type checking rejects packed INT8 weights.
Results on RTX 5090 (32GB):
| Throughput | ~135 tokens/s |
| TTFT | ~15 ms |
| VRAM | 24.1 GB |
| Model | Qwen 3 30B MoE (A3B) |
Why so fast? Qwen 3 30B is MoE - only ~2.4B params active per token. Combined with Blackwell's native FP4 tensor cores = 7B-level speed with 30B knowledge.
What's in the guide: - SWAP trick for quantization (64GB RAM + 64GB SWAP = enough) - --fast_build flags to avoid compiler OOM - C++ runtime patch to fix allocator bug and type mismatch - Open WebUI integration fix Full tutorial + patches: https://github.com/JohnTDI-cpu/trtllm-nvfp4-blackwell-fix
This was mass amount of pain so hoping to save others the trouble.