RTX 5090 + Qwen 30B MoE @ 135 tok/s in NVFP4 Guide

#24

by JohnTdi - opened 13 days ago

13 days ago

RTX 5090 + Qwen 30B MoE @ 135 tok/s in NVFP4 - Full guide with C++ patches
Tutorial | Guide
Spent 4 days getting NVFP4 working on consumer Blackwell. T RT-LLM 1.2.0rc4 has critical bugs that prevent loading managed weights for FP4 models - the allocator uses 2x VRAM and type checking rejects packed INT8 weights.

Results on RTX 5090 (32GB):

Why so fast? Qwen 3 30B is MoE - only ~2.4B params active per token. Combined with Blackwell's native FP4 tensor cores = 7B-level speed with 30B knowledge.

What's in the guide: - SWAP trick for quantization (64GB RAM + 64GB SWAP = enough) - --fast_build flags to avoid compiler OOM - C++ runtime patch to fix allocator bug and type mismatch - Open WebUI integration fix Full tutorial + patches: https://github.com/JohnTDI-cpu/trtllm-nvfp4-blackwell-fix

This was mass amount of pain so hoping to save others the trouble.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment