GLM-4.5-Air-REAP-82B-A12B-qx86g-hi-mlx

This is a custom Deckard(qx) quant with select attention paths, embedding, and head at 8 bit, data stores at 6 bit.

This quant method can be found in the Qwen3 series as qx86x or qx86n in the Qwen3-Next.

It usually outperforms the BF16 by effectively focusing cognition and reducing perplexity, while reducing the model to less than half the size.

Test suites for GLM are very slow, and it would take me a week to give you numbers, while blocking the hardware I use for other stuff. That's why I would appreciate feedback and likes, that would ensure the model stays in the collection if it is really good.

Perplexity: 7.017 ± 0.063
Peak memory: 80.44 GB

-G

This model GLM-4.5-Air-REAP-82B-A12B-qx86g-hi-mlx was converted to MLX format from cerebras/GLM-4.5-Air-REAP-82B-A12B using mlx-lm version 0.28.3.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("GLM-4.5-Air-REAP-82B-A12B-qx86g-hi-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)