(IQ2_XSS) Initial Performance Results Inside 「96GB VRAM / 256GB DDR5 」
What's better than one @ubergarm ? how about an @ubergarm2 !? lol
This model, along with the LongCat family were on my radar, glad we finally have some gguf support. Will update this post once model finishes downloading and I get some testing done! (probably around an hour or two)
As always; specs for clarity:
1x4090 / 3x3090 (~900ishGB/S Unison Bandwidth)
Intel QYFS Sapphire Rapids
ASUS W790 Sage with 256GB DDR5 (350GB/S Bandwidth)
revised; current launch command:
/home/phone/Documents/ik_llama.cpp/build/bin/llama-server \
--model /home/phone/Downloads/LocalModels/Ling-1T-smol-IQ2_XXS-00001-of-00006.gguf \
--alias ubergarm/Ling-1T-smol-IQ2_XXS \
--ctx-size 20000 \
-ctk q8_0 -ctv q8_0 \
-ub 4096 -b 4096 \
-fa -fmoe -ger \
-ngl 99 \
-ot "blk\.(0|1|2|3|4|5|6|7)\.ffn_.*=CUDA0" \
-ot "blk\.(8|9|10|11|12)\.ffn_.*=CUDA1" \
-ot "blk\.(13|14|15|16|17)\.ffn_.*=CUDA2" \
-ot "blk\.(18|19|20|21|22)\.ffn_.*=CUDA3" \
-ot exps=CPU \
--parallel 1 \
--threads 48 \
--threads-batch 56 \
--host 0.0.0.0 \
--port 8081 \
--no-mmap
edit 1; WE ARE LIVE
Hugging face servers seem to be weird lately, part 4 had to be redownloaded a few times, download kept ending prematurely.. oh well it's finally loaded up in ik_llama.cpp, time to test!
I'm impressed... ~50B active parameters, and from a fresh chat it's around the same speed as deepseek, which is 37B.
Token falloff for long context is minimal, with a fresh conversation starting with 10.6 t/s generation, and after 10k context fill, it only dropped 1 token a second... Each GPU also has room for 1 or 2 more layers per GPU to offload, I was generous with my initial script to allow for overhead and no crashes.
Initial "who are you" prompt:
10k Context fill + 1.5k token generation results:
Impressed, I'll keep playing around with it to see just how the 'vibes' are, but overally performance is solid with good generation integrity at longer contexts.
edit 2; forgot to add ' -ub 4096 -b 4096' parameters for my little PP boost! doing so has doubled it.
final edit for this post: I shoved as many layers as I could into each GPU, getting a solid 1t/s generation increase throughout the board. time to compare it against kimi! (updated launch script to reflect this change)
I noticed that this quant; (iq2_xss), seems to repeat itself a lot in it's responses for casual conversation, especially in other languages. I'm unsure if it's the model, ik_llama.cpp, or just the crushed first 4 dense layers.
for instance, starting off strong, I say "good morning" in japanese, it replies as expected. I then tell it i'm going to go and buy breakfast, it started it's reply with "good morning!" again, then proceeded to ask what breakfast I had in mind... I told it my response and it said good morning AGAIN, then proceeded with normal reply lmfao
I noticed the iq2_ks has q8_0 first 4 layers, I'll redownload and test that one out tonight... file size isn't too much different idk why I didn't just start with that one! lol
slightly off topic but I'm really wishing I went with 512GB VRAM when building this pc, just to run slightly better 1t model quants. ram prices are up now and I'm stuck with the old 256gb / 96gb combo until the landscape changes. hell me and my fiance went to microcenter last night and they had a RTX 6000 96gb card in stock. the look she gave me when I was considering buying it for $9k was enough to kill a man. hopefully when next gen drops data centers will offload their currect gpu supply and we will have more budget friendly access to these parts
Yeah that iq2_xss is really just for testing mainline llama.cpp PR https://github.com/ggml-org/llama.cpp/pull/16063 ... feel free to try that and see if it behaves any differently as that would help folks figure out more if it is implementation or model/chat template related.
Also if you tried it again without -ger to see if it gives similar responses as afaik u are the first person to test -ger with hybrid CPU+GPU for Ling-1T. That PR is very fresh: https://github.com/ikawrakow/ik_llama.cpp/pull/838
and finally possibly with --jinja but tbh dunno how that exactly effects the chat template business 😅 as it has changed so rapidly in the past 6 weeks across both ik and mainline...
good eyes yes for the iq2_ks I juiced the first 4 attn layers as they were reporting a higher importance with the llama-imatrix --layer-similarity scores so I figured why not boost them a bit while keeping the rest smaller to crunch down that A50B and VRAM requirements enough to fit some kv-cache on <24GB GPU rigs.
256GB RAM + 96GB VRAM is still a pretty sweet rig and a lot of sub 1T models are pretty good as you know like GLM-4.6 still seems pretty hot right now in that size category. Life gives us so many opportunities to appreciate what we do have it seems haha... hugs
Thanks for always keeping it real and bringing curiosity and authenticity while sharing your experiences!
Well it turns out the IQ2_KS actually is running quicker than the XSS, at around 13t/s generation!
This morning I had my fiance talk to the model in chinese, as it's a chinese mode itself I wanted her to test it and asses it's native tongue ability... she was very impressed, said the model was very descriptive, and knowledgeable in not just recent data, but even some older more obscure things that only older Wuhan citizens would really know. I think this model is great so far, thanks again for the quick and reliable quants!


