Benchmark suggestion
Maybe I'm alone in this, but I lack the skill and ability to test this, so here's my thought:
If we compare
https://huggingface.co/unsloth/GLM-4.6-GGUF
and
https://huggingface.co/unsloth/GLM-4.6-REAP-268B-A32B-GGUF
in regards to model sizes, we can see that the pruned (REAP) model always has a corresponding quant with about the same size at an additional bit precision (Q3_K_XL ~= Q4_K_XL REAP; ~=: similar size) and so on. Now here's the question: Which of those two has better accuracy? No clue, but I can't help but wonder... is more aggressive quantization better or worse than the pruning+quant?
You are not alone. I would also like to have an optimal progression of (REAP%, Quant) pairs.
I think some volunteer with lots of multi-GPU horsepower is needed. Current REAP wants to have the full model in VRAM! For GLM 4.6 that is over 700GB aka 6x H200.
PS: In your comparison above I think Q4_K_XL REAP comes out ahead. REAP 25% costs only one or two percent in quality. Low quants have exponential quality loss over bitrate (aka 25% bitrate loss is >> 25% quality drop).
I coded a prune variant that can read the observer data from disk instead of from vLLM.
https://github.com/708-145/mergekit/blob/clown/prune.py
What's missing is the part where llama.cpp writes out the observer data. Then the data could be collected with RAM offloading or even complete CPU inference.