EPYC, RTX 5090 vs RTX 6000

#7
by sousekd - opened

Sharing some benchmarks of IQ5_K on EPYC 9355, various context sizes:

Single RTX 5090, 32K context (max to fit @ f16)

    ./llama-sweep-bench \
        --model "$MODEL_PATH" \
        --no-mmap \
        -b 4096 -ub 4096 \
        -ctk f16 -ctv f16 -c 32768 \
        -ngl 999 -ncmoe 999 \
        --grouped-expert-routing \
        --merge-qkv \
        --threads 24 \
        --threads-batch 32 \
        --warmup-batch \
        -n 256
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 256 0 8.678 471.99 12.459 20.55
4096 256 4096 9.077 451.26 12.866 19.90
4096 256 8192 9.423 434.69 13.107 19.53
4096 256 12288 9.790 418.38 13.481 18.99
4096 256 16384 10.121 404.69 14.077 18.19
4096 256 20480 10.396 393.99 14.084 18.18
4096 256 24576 10.858 377.25 14.613 17.52
4096 256 28672 11.392 359.55 14.764 17.34

Single RTX 5090, 64K context (max to fit @ q8_0)

    ./llama-sweep-bench \
        --model "$MODEL_PATH" \
        --no-mmap \
        -b 1024 -ub 1024 \
        -ctk q8_0 -ctv q8_0 -c 65536 \
        -ngl 999 -ncmoe 999 \
        --grouped-expert-routing \
        --merge-qkv \
        --threads 24 \
        --threads-batch 32 \
        --warmup-batch \
        -n 256
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
1024 256 0 6.536 156.67 12.658 20.22
1024 256 8192 6.633 154.38 13.775 18.58
1024 256 16384 6.845 149.61 14.822 17.27
1024 256 24576 7.000 146.29 16.269 15.74
1024 256 32768 6.948 147.39 17.989 14.23
1024 256 40960 7.314 140.01 19.740 12.97
1024 256 49152 7.642 134.00 21.392 11.97
1024 256 57344 8.001 127.98 22.845 11.21
1024 256 64512 8.139 125.81 25.354 10.10

Single RTX 6000, 192K context (max supported)

    ./llama-sweep-bench \
        --model "$MODEL_PATH" \
        --no-mmap \
        -b 8192 -ub 8192 \
        -ctk f16 -ctv f16 -c 196608 \
        -ngl 999 -ncmoe 999 \
        --grouped-expert-routing \
        --merge-qkv \
        --threads 24 \
        --threads-batch 32 \
        --warmup-batch \
        -n 256
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 256 0 11.294 725.34 12.656 20.23
8192 256 16384 13.643 600.43 14.151 18.09
8192 256 32768 16.848 486.22 15.492 16.53
8192 256 49152 21.804 375.71 17.151 14.93
8192 256 65536 26.748 306.27 18.837 13.59
8192 256 81920 31.093 263.47 26.243 9.75
8192 256 98304 35.944 227.91 31.187 8.21
8192 256 114688 40.928 200.15 33.786 7.58
8192 256 131072 45.776 178.96 36.925 6.93
8192 256 147456 49.368 165.94 39.823 6.43
8192 256 163840 54.102 151.42 42.650 6.00
8192 256 180224 58.958 138.95 45.294 5.65

This is running in a VM, with GPUs and CPU power-limited, so better results are achievable on the same hardware.
The significant drop in speed around 82K is interesting. At 128K it is twice as slow (or half the speed) of Kimi-K2.
But honestly, I haven't seen a model yet working well with context sizes above 64K or so, so it doesn't matter much 😀.

Thank you for the quants @ubergarm .

@sousekd

Thanks for the benchmarks here on the new GLM-4.7 comparing with Kimi-K2-Thinking over here: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/discussions/5

I didn't graph the comparisons which would be easier to read the tea leaves, but seems like Kimi-K2-Thinking as faster TG as context depth gets deeper. Pretty interesting as Kimi is a 1000B model vs the GLM's 358B! Maybe has something to do with MLA style attention on Kimi?

A few thoughts about your comments over there:

Kimi-K2 Thinking Q4_X with max supported context on RTX PRO 6000. It only fills 50% of VRAM (47 GB), while GLM 4.7 IQ5_K takes it all.

Yeah the MLA is much more efficient on VRAM storing long context is my guess. If you don't need all that extra context length, offloading more routed experts into VRAM will likely boost your speed making Kimi even faster at TG relatively. (My modelcard for GLM-4.7 has some newer examples)

This is without --grouped-expert-routing --merge-qkv, which I'm not quite sure what they are about:

I believe the -ger grouped expert routing is not for all models, but does work with GLM and can give a small improvement. Similarly when using a single GPU you can get like 1-2% speed-up also with --merge-qkv as these are small fusions of kernels reducing overhead is my understanding.

It has been tricky to keep up with all of ik's hard work lately haha... Especially if you have 2x GPUs and NCCL installed you can get some great improvments with -sm graph becoming competitive with VLLM in my early impressions at least at single user generations. Doesn't help tho for 1x GPU as that is already optimized.

The significant drop in speed around 82K is interesting.

Yeah wonder what that is about, perhaps some kind of power limiting kicking in or temperature throttling or something happening ? Have you checked out LACT for undervolting your GPUs instead of simple power limiting? Ends up giving roughly same performance at like lower power usage too! (holler if you need a link about LACT or i have some discusson of it in my recent talk here with timestamp listed in text: https://blog.aifoundry.org/p/adventures-in-model-quantization

Cheers and happy new years!

Thank you for your insights, @ubergarm .

Yeah wonder what that is about, perhaps some kind of power limiting kicking in or temperature throttling or something happening ?

That was my first thought too — I do have fairly aggressive throttling configured when RAM temps get too high. So I monitored temps and re-ran the bench three times, and it doesn’t seem to be the culprit.

Have you checked out LACT for undervolting your GPUs instead of simple power limiting? Ends up giving roughly same performance at like lower power usage too! (holler if you need a link about LACT or i have some discusson of it in my recent talk here with timestamp listed in text: https://blog.aifoundry.org/p/adventures-in-model-quantization

Not yet — I’ve been a bit scared to touch undervolting 😅. But it’s on my "TODO (eventually)" list. Thanks for the link — I’ll watch it now!

Cheers and happy new years!

You too! And thank you for your work.

Here is Q8_0 for comparison. It is significantly slower at lower context, but the difference is marginal later. It doesn't fit 192K at -ub 8192 -ctk f16 to 96 GB tho:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 256 0 13.499 606.85 15.182 16.86
8192 256 16384 15.628 524.18 16.611 15.41
8192 256 32768 18.561 441.36 18.094 14.15
8192 256 49152 23.677 345.99 19.716 12.98
8192 256 65536 28.509 287.34 21.286 12.03
8192 256 81920 32.778 249.92 29.586 8.65
8192 256 98304 37.355 219.30 33.977 7.53
8192 256 114688 42.326 193.55 39.272 6.52

Ah, this is tempting me to run GLM 4.7 on the RTX 5090 and see how it hangs. I've been positive with MiniMax M2.1 since it's smaller and more generous on memory. Different fine tuner.

Have you checked out LACT for undervolting your GPUs instead of simple power limiting? Ends up giving roughly same performance at like lower power usage too! (holler if you need a link about LACT or i have some discusson of it in my recent talk here with timestamp listed in text: https://blog.aifoundry.org/p/adventures-in-model-quantization

I have LACT and afaik there's no direct undervolting capability on Linux as these been disabled on nvidia drivers for Blackwell. I was adamant getting it done.

Although, offsetting the core clock +50 and a touch lower power consumption (400w) substantially drops the temps, offsetting the small performance loss.

Works well on fully VRAM MoE models which virtually hits full core speed, whereas dense models will be visibly slower.

Is the temperature on RAM significant on offloaded models? Never checked their temps. Nuts.

Is the temperature on RAM significant on offloaded models? Never checked their temps. Nuts.

Yeah — it can be. Offloaded models put a lot of pressure on the memory subsystem. On my machine, I couldn’t finish a single sweep-bench run until I added a DRAM heatsink. Consumer DIMMs often ship with heatsinks these days, but server RAM typically doesn’t — presumably because it’s designed for chassis airflow with turbine-style fans.

Is the temperature on RAM significant on offloaded models? Never checked their temps. Nuts.

Yeah — it can be. Offloaded models put a lot of pressure on the memory subsystem. On my machine, I couldn’t finish a single sweep-bench run until I added a DRAM heatsink. Consumer DIMMs often ship with heatsinks these days, but server RAM typically doesn’t — presumably because it’s designed for chassis airflow with turbine-style fans.

Thanks for the heads-up. Big blind spot from my side. Might have been lucky since I always have the room cooler and has plenty of airflow kicking earlier.

Never seen any particular signs. What caused your bench not to finish? Locking up?

Mine don't have heatsinks. I suppose I need to stress test them and align myself to potentially get some.

@wonderfuldestruction

It was a few months back, but I remember watching RAM temps climb to ~97°C, everything in red, and me not shutting it down… until the machine did it itself 🙂.

After booting again, one DRAM module didn’t show up. I shuffled the modules around the next day, pretty sure I had just lost a 64 GB stick and I just needed to figure out which one — but the machine eventually recognized all of them. And it actually recovered: I stress-tested it thoroughly afterwards.

That experience forced me to find the “memory frequency throttling” screen in BIOS and put sensible limits there. Once I did, every sweep-bench was crippled by throttling. The heatsinks I installed later fixed all of that.

https://cdna.pcpartpicker.com/static/forever/images/userbuild/510756.1b9ce15bb0a74a45db0b0d8c33e81959.1600.jpg

Sign up or log in to comment