Memory Requirements to run `Qwen/Qwen3.5-397B-A17B`
Hey all,
See below the visual output of hf-mem on the estimated memory required to load Qwen/Qwen3.5-397B-A17B and run the inference, including the KV cache estimation.
uvx hf-mem --model-id Qwen/Qwen3.5-397B-A17B --experimental --kv-cache-dtype fp8
Let me know if that's useful! 🤗
Hey @saireddy I just did! But note that https://github.com/alvarobartt/hf-mem is open-source so you can run those yourself as e.g. uvx hf-mem --model-id Qwen/Qwen3.5-397B-A17B-FP8 --experimental --kv-cache-dtype fp8, let me know if you have any issue 🤗
https://huggingface.co/Qwen/Qwen3.5-397B-A17B-FP8/discussions/6
15 full attention with GQA2-d256 & FP8 KV should use 15kB/token KV cache? your result is 30kB/token.
I checked your code (https://github.com/alvarobartt/hf-mem) and found that it wrongly assumes two things
- all hidden layers are assumed to be full attention, while Qwen3.5-397B-A17B actually has 15 full attention layers in total 60 layers. This ×4 the result.
head_dim = hidden_size // num_attention_heads, while Qwen3.5-397B-A17B actually useshead_dim = 256,hidden_size = 4096andnum_attention_heads = 32. This ×0.5 the result.
:D
Yes @YouJiacheng there are some known issues, thanks for reporting those clearly! Would you mind opening an issue in https://github.com/alvarobartt/hf-mem/issues?
Thanks for taking the time to respond! 🤗
Hey again @YouJiacheng thanks a lot for the suggestion, I've already fixed it and I've mentioned you in the release notes at https://github.com/alvarobartt/hf-mem/releases/tag/0.5.0, see the fixed output below (still under the --experimental flag) 🤗
uvx hf-mem --model-id Qwen/Qwen3.5-397B-A17B --experimental --kv-cache-dtype fp8

