10 days ago

•

Sharing some benchmarks of IQ5_K on EPYC 9355, various context sizes:

Single RTX 5090, 32K context (max to fit @ f16)

    ./llama-sweep-bench \
        --model "$MODEL_PATH" \
        --no-mmap \
        -b 4096 -ub 4096 \
        -ctk f16 -ctv f16 -c 32768 \
        -ngl 999 -ncmoe 999 \
        --grouped-expert-routing \
        --merge-qkv \
        --threads 24 \
        --threads-batch 32 \
        --warmup-batch \
        -n 256

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	256	0	8.678	471.99	12.459	20.55
4096	256	4096	9.077	451.26	12.866	19.90
4096	256	8192	9.423	434.69	13.107	19.53
4096	256	12288	9.790	418.38	13.481	18.99
4096	256	16384	10.121	404.69	14.077	18.19
4096	256	20480	10.396	393.99	14.084	18.18
4096	256	24576	10.858	377.25	14.613	17.52
4096	256	28672	11.392	359.55	14.764	17.34

Single RTX 5090, 64K context (max to fit @ q8_0)

    ./llama-sweep-bench \
        --model "$MODEL_PATH" \
        --no-mmap \
        -b 1024 -ub 1024 \
        -ctk q8_0 -ctv q8_0 -c 65536 \
        -ngl 999 -ncmoe 999 \
        --grouped-expert-routing \
        --merge-qkv \
        --threads 24 \
        --threads-batch 32 \
        --warmup-batch \
        -n 256

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
1024	256	0	6.536	156.67	12.658	20.22
1024	256	8192	6.633	154.38	13.775	18.58
1024	256	16384	6.845	149.61	14.822	17.27
1024	256	24576	7.000	146.29	16.269	15.74
1024	256	32768	6.948	147.39	17.989	14.23
1024	256	40960	7.314	140.01	19.740	12.97
1024	256	49152	7.642	134.00	21.392	11.97
1024	256	57344	8.001	127.98	22.845	11.21
1024	256	64512	8.139	125.81	25.354	10.10

Single RTX 6000, 192K context (max supported)

    ./llama-sweep-bench \
        --model "$MODEL_PATH" \
        --no-mmap \
        -b 8192 -ub 8192 \
        -ctk f16 -ctv f16 -c 196608 \
        -ngl 999 -ncmoe 999 \
        --grouped-expert-routing \
        --merge-qkv \
        --threads 24 \
        --threads-batch 32 \
        --warmup-batch \
        -n 256

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	256	0	11.294	725.34	12.656	20.23
8192	256	16384	13.643	600.43	14.151	18.09
8192	256	32768	16.848	486.22	15.492	16.53
8192	256	49152	21.804	375.71	17.151	14.93
8192	256	65536	26.748	306.27	18.837	13.59
8192	256	81920	31.093	263.47	26.243	9.75
8192	256	98304	35.944	227.91	31.187	8.21
8192	256	114688	40.928	200.15	33.786	7.58
8192	256	131072	45.776	178.96	36.925	6.93
8192	256	147456	49.368	165.94	39.823	6.43
8192	256	163840	54.102	151.42	42.650	6.00
8192	256	180224	58.958	138.95	45.294	5.65

This is running in a VM, with GPUs and CPU power-limited, so better results are achievable on the same hardware.
The significant drop in speed around 82K is interesting. At 128K it is twice as slow (or half the speed) of Kimi-K2.
But honestly, I haven't seen a model yet working well with context sizes above 64K or so, so it doesn't matter much 😀.

Thank you for the quants @ubergarm .

ubergarm

Owner 9 days ago

@sousekd

Thanks for the benchmarks here on the new GLM-4.7 comparing with Kimi-K2-Thinking over here: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/discussions/5

I didn't graph the comparisons which would be easier to read the tea leaves, but seems like Kimi-K2-Thinking as faster TG as context depth gets deeper. Pretty interesting as Kimi is a 1000B model vs the GLM's 358B! Maybe has something to do with MLA style attention on Kimi?

A few thoughts about your comments over there:

Kimi-K2 Thinking Q4_X with max supported context on RTX PRO 6000. It only fills 50% of VRAM (47 GB), while GLM 4.7 IQ5_K takes it all.

Yeah the MLA is much more efficient on VRAM storing long context is my guess. If you don't need all that extra context length, offloading more routed experts into VRAM will likely boost your speed making Kimi even faster at TG relatively. (My modelcard for GLM-4.7 has some newer examples)

This is without --grouped-expert-routing --merge-qkv, which I'm not quite sure what they are about:

I believe the -ger grouped expert routing is not for all models, but does work with GLM and can give a small improvement. Similarly when using a single GPU you can get like 1-2% speed-up also with --merge-qkv as these are small fusions of kernels reducing overhead is my understanding.

It has been tricky to keep up with all of ik's hard work lately haha... Especially if you have 2x GPUs and NCCL installed you can get some great improvments with -sm graph becoming competitive with VLLM in my early impressions at least at single user generations. Doesn't help tho for 1x GPU as that is already optimized.

The significant drop in speed around 82K is interesting.

Yeah wonder what that is about, perhaps some kind of power limiting kicking in or temperature throttling or something happening ? Have you checked out LACT for undervolting your GPUs instead of simple power limiting? Ends up giving roughly same performance at like lower power usage too! (holler if you need a link about LACT or i have some discusson of it in my recent talk here with timestamp listed in text: https://blog.aifoundry.org/p/adventures-in-model-quantization

Cheers and happy new years!

sousekd

9 days ago

Thank you for your insights, @ubergarm .

Yeah wonder what that is about, perhaps some kind of power limiting kicking in or temperature throttling or something happening ?

That was my first thought too — I do have fairly aggressive throttling configured when RAM temps get too high. So I monitored temps and re-ran the bench three times, and it doesn’t seem to be the culprit.

Have you checked out LACT for undervolting your GPUs instead of simple power limiting? Ends up giving roughly same performance at like lower power usage too! (holler if you need a link about LACT or i have some discusson of it in my recent talk here with timestamp listed in text: https://blog.aifoundry.org/p/adventures-in-model-quantization

Not yet — I’ve been a bit scared to touch undervolting 😅. But it’s on my "TODO (eventually)" list. Thanks for the link — I’ll watch it now!

Cheers and happy new years!

You too! And thank you for your work.

sousekd

9 days ago

Here is Q8_0 for comparison. It is significantly slower at lower context, but the difference is marginal later. It doesn't fit 192K at -ub 8192 -ctk f16 to 96 GB tho:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	256	0	13.499	606.85	15.182	16.86
8192	256	16384	15.628	524.18	16.611	15.41
8192	256	32768	18.561	441.36	18.094	14.15
8192	256	49152	23.677	345.99	19.716	12.98
8192	256	65536	28.509	287.34	21.286	12.03
8192	256	81920	32.778	249.92	29.586	8.65
8192	256	98304	37.355	219.30	33.977	7.53
8192	256	114688	42.326	193.55	39.272	6.52

wonderfuldestruction

9 days ago

•

edited 9 days ago

Ah, this is tempting me to run GLM 4.7 on the RTX 5090 and see how it hangs. I've been positive with MiniMax M2.1 since it's smaller and more generous on memory. Different fine tuner.

Have you checked out LACT for undervolting your GPUs instead of simple power limiting? Ends up giving roughly same performance at like lower power usage too! (holler if you need a link about LACT or i have some discusson of it in my recent talk here with timestamp listed in text: https://blog.aifoundry.org/p/adventures-in-model-quantization

I have LACT and afaik there's no direct undervolting capability on Linux as these been disabled on nvidia drivers for Blackwell. I was adamant getting it done.

Although, offsetting the core clock +50 and a touch lower power consumption (400w) substantially drops the temps, offsetting the small performance loss.

Works well on fully VRAM MoE models which virtually hits full core speed, whereas dense models will be visibly slower.

Is the temperature on RAM significant on offloaded models? Never checked their temps. Nuts.

sousekd

9 days ago

Is the temperature on RAM significant on offloaded models? Never checked their temps. Nuts.

Yeah — it can be. Offloaded models put a lot of pressure on the memory subsystem. On my machine, I couldn’t finish a single sweep-bench run until I added a DRAM heatsink. Consumer DIMMs often ship with heatsinks these days, but server RAM typically doesn’t — presumably because it’s designed for chassis airflow with turbine-style fans.

wonderfuldestruction

9 days ago

Is the temperature on RAM significant on offloaded models? Never checked their temps. Nuts.

Yeah — it can be. Offloaded models put a lot of pressure on the memory subsystem. On my machine, I couldn’t finish a single sweep-bench run until I added a DRAM heatsink. Consumer DIMMs often ship with heatsinks these days, but server RAM typically doesn’t — presumably because it’s designed for chassis airflow with turbine-style fans.

Thanks for the heads-up. Big blind spot from my side. Might have been lucky since I always have the room cooler and has plenty of airflow kicking earlier.

Never seen any particular signs. What caused your bench not to finish? Locking up?

Mine don't have heatsinks. I suppose I need to stress test them and align myself to potentially get some.

sousekd

9 days ago

•

edited 9 days ago

@wonderfuldestruction

It was a few months back, but I remember watching RAM temps climb to ~97°C, everything in red, and me not shutting it down… until the machine did it itself 🙂.

After booting again, one DRAM module didn’t show up. I shuffled the modules around the next day, pretty sure I had just lost a 64 GB stick and I just needed to figure out which one — but the machine eventually recognized all of them. And it actually recovered: I stress-tested it thoroughly afterwards.

That experience forced me to find the “memory frequency throttling” screen in BIOS and put sensible limits there. Once I did, every sweep-bench was crippled by throttling. The heatsinks I installed later fixed all of that.

https://cdna.pcpartpicker.com/static/forever/images/userbuild/510756.1b9ce15bb0a74a45db0b0d8c33e81959.1600.jpg

ubergarm
/

GLM-4.7-GGUF

EPYC, RTX 5090 vs RTX 6000

Single RTX 5090, 32K context (max to fit @ f16)

Single RTX 5090, 64K context (max to fit @ q8_0)

Single RTX 6000, 192K context (max supported)