GLM-4.5-Air-REAP-82B-A12B-qx86g-hi-mlx
This is a custom Deckard(qx) quant with select attention paths, embedding, and head at 8 bit, data stores at 6 bit.
This quant method can be found in the Qwen3 series as qx86x or qx86n in the Qwen3-Next.
It usually outperforms the BF16 by effectively focusing cognition and reducing perplexity, while reducing the model to less than half the size.
Test suites for GLM are very slow, and it would take me a week to give you numbers, while blocking the hardware I use for other stuff. That's why I would appreciate feedback and likes, that would ensure the model stays in the collection if it is really good.
Perplexity: 7.017 ± 0.063
Peak memory: 80.44 GB
-G
This model GLM-4.5-Air-REAP-82B-A12B-qx86g-hi-mlx was converted to MLX format from cerebras/GLM-4.5-Air-REAP-82B-A12B using mlx-lm version 0.28.3.
Use with mlx
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("GLM-4.5-Air-REAP-82B-A12B-qx86g-hi-mlx")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
- Downloads last month
- 208
Model tree for nightmedia/GLM-4.5-Air-REAP-82B-A12B-qx86g-hi-mlx
Base model
zai-org/GLM-4.5-Air