Memory Requirements to run `Qwen/Qwen3.5-397B-A17B`

#20
by alvarobartt - opened

Hey all,

See below the visual output of hf-mem on the estimated memory required to load Qwen/Qwen3.5-397B-A17B and run the inference, including the KV cache estimation.

uvx hf-mem --model-id Qwen/Qwen3.5-397B-A17B --experimental --kv-cache-dtype fp8

image

Let me know if that's useful! 🤗

@alvarobartt Can you please help us by running the same for FP-8 variant ?

Hey @saireddy I just did! But note that https://github.com/alvarobartt/hf-mem is open-source so you can run those yourself as e.g. uvx hf-mem --model-id Qwen/Qwen3.5-397B-A17B-FP8 --experimental --kv-cache-dtype fp8, let me know if you have any issue 🤗

https://huggingface.co/Qwen/Qwen3.5-397B-A17B-FP8/discussions/6

15 full attention with GQA2-d256 & FP8 KV should use 15kB/token KV cache? your result is 30kB/token.
I checked your code (https://github.com/alvarobartt/hf-mem) and found that it wrongly assumes two things

  1. all hidden layers are assumed to be full attention, while Qwen3.5-397B-A17B actually has 15 full attention layers in total 60 layers. This ×4 the result.
  2. head_dim = hidden_size // num_attention_heads, while Qwen3.5-397B-A17B actually uses head_dim = 256, hidden_size = 4096 and num_attention_heads = 32. This ×0.5 the result.
    :D

Yes @YouJiacheng there are some known issues, thanks for reporting those clearly! Would you mind opening an issue in https://github.com/alvarobartt/hf-mem/issues?

Thanks for taking the time to respond! 🤗

Hey again @YouJiacheng thanks a lot for the suggestion, I've already fixed it and I've mentioned you in the release notes at https://github.com/alvarobartt/hf-mem/releases/tag/0.5.0, see the fixed output below (still under the --experimental flag) 🤗

uvx hf-mem --model-id Qwen/Qwen3.5-397B-A17B --experimental --kv-cache-dtype fp8

Screenshot 2026-03-13 at 14.59.37

Sign up or log in to comment