Benchmark suggestion

#2
by FlareRebellion - opened

Maybe I'm alone in this, but I lack the skill and ability to test this, so here's my thought:

If we compare

https://huggingface.co/unsloth/GLM-4.6-GGUF
and
https://huggingface.co/unsloth/GLM-4.6-REAP-268B-A32B-GGUF

in regards to model sizes, we can see that the pruned (REAP) model always has a corresponding quant with about the same size at an additional bit precision (Q3_K_XL ~= Q4_K_XL REAP; ~=: similar size) and so on. Now here's the question: Which of those two has better accuracy? No clue, but I can't help but wonder... is more aggressive quantization better or worse than the pruning+quant?

You are not alone. I would also like to have an optimal progression of (REAP%, Quant) pairs.

I think some volunteer with lots of multi-GPU horsepower is needed. Current REAP wants to have the full model in VRAM! For GLM 4.6 that is over 700GB aka 6x H200.

PS: In your comparison above I think Q4_K_XL REAP comes out ahead. REAP 25% costs only one or two percent in quality. Low quants have exponential quality loss over bitrate (aka 25% bitrate loss is >> 25% quality drop).

I coded a prune variant that can read the observer data from disk instead of from vLLM.
https://github.com/708-145/mergekit/blob/clown/prune.py

What's missing is the part where llama.cpp writes out the observer data. Then the data could be collected with RAM offloading or even complete CPU inference.

Sign up or log in to comment