Problems with logical reasoning performance of GLM-4.7-Flash

#35

by sszymczyk - opened 3 days ago

3 days ago

I'm currently evaluating this model in my lineage-bench benchmark. I started with lineage graphs of 8 nodes. Other models of similar size (qwen3-32b or olmo-3-32b-think) have accuracy over 90% at this difficulty level. Unfortunately GLM-4.7-Flash has accuracy only around 60% in lineage-8. I tested the model via OpenRouter (z-ai provider, phala provider), via ZenMux and locally (in sglang). Accuracy was similar in all cases.

Then I decided to reduce the problem to very simple lineage graphs of size 4 (only 4 nodes). But even at reduced difficulty level GLM-4.7-Flash still makes reasoning errors (accuracy around 80%).

Note that I checked the model at recommended sampling parameters (temp 1.0, top-p 0.95), reduced temperature settings (temp 0.7, top-p 1.0) and even at temperature close to zero. Model makes logical reasoning errors in all cases.

Is this expected? I mean with such high benchmark results the model shall have excellent reasoning capabilities. Is there any other recommended reference implementation I could test?

pypry

3 days ago

I'm currently evaluating this model in my lineage-bench benchmark. I started with lineage graphs of 8 nodes. Other models of similar size (qwen3-32b or olmo-3-32b-think) have accuracy over 90% at this difficulty level. Unfortunately GLM-4.7-Flash has accuracy only around 60% in lineage-8. I tested the model via OpenRouter (z-ai provider, phala provider), via ZenMux and locally (in sglang). Accuracy was similar in all cases.

Then I decided to reduce the problem to very simple lineage graphs of size 4 (only 4 nodes). But even at reduced difficulty level GLM-4.7-Flash still makes reasoning errors (accuracy around 80%).

Note that I checked the model at recommended sampling parameters (temp 1.0, top-p 0.95), reduced temperature settings (temp 0.7, top-p 1.0) and even at temperature close to zero. Model makes logical reasoning errors in all cases.

Is this expected? I mean with such high benchmark results the model shall have excellent reasoning capabilities. Is there any other recommended reference implementation I could test?

the model's reasoning ability is very weak, i test many cases

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment