Problems with logical reasoning performance of GLM-4.7-Flash

#35
by sszymczyk - opened

I'm currently evaluating this model in my lineage-bench benchmark. I started with lineage graphs of 8 nodes. Other models of similar size (qwen3-32b or olmo-3-32b-think) have accuracy over 90% at this difficulty level. Unfortunately GLM-4.7-Flash has accuracy only around 60% in lineage-8. I tested the model via OpenRouter (z-ai provider, phala provider), via ZenMux and locally (in sglang). Accuracy was similar in all cases.

Then I decided to reduce the problem to very simple lineage graphs of size 4 (only 4 nodes). But even at reduced difficulty level GLM-4.7-Flash still makes reasoning errors (accuracy around 80%).

Note that I checked the model at recommended sampling parameters (temp 1.0, top-p 0.95), reduced temperature settings (temp 0.7, top-p 1.0) and even at temperature close to zero. Model makes logical reasoning errors in all cases.

Is this expected? I mean with such high benchmark results the model shall have excellent reasoning capabilities. Is there any other recommended reference implementation I could test?

I'm currently evaluating this model in my lineage-bench benchmark. I started with lineage graphs of 8 nodes. Other models of similar size (qwen3-32b or olmo-3-32b-think) have accuracy over 90% at this difficulty level. Unfortunately GLM-4.7-Flash has accuracy only around 60% in lineage-8. I tested the model via OpenRouter (z-ai provider, phala provider), via ZenMux and locally (in sglang). Accuracy was similar in all cases.

Then I decided to reduce the problem to very simple lineage graphs of size 4 (only 4 nodes). But even at reduced difficulty level GLM-4.7-Flash still makes reasoning errors (accuracy around 80%).

Note that I checked the model at recommended sampling parameters (temp 1.0, top-p 0.95), reduced temperature settings (temp 0.7, top-p 1.0) and even at temperature close to zero. Model makes logical reasoning errors in all cases.

Is this expected? I mean with such high benchmark results the model shall have excellent reasoning capabilities. Is there any other recommended reference implementation I could test?

the model's reasoning ability is very weak, i test many cases

Sign up or log in to comment