Qwen3-30B-A3B-YOYO-V2-HQ16-mxfp4-mlx

Let’s break this down cognitively and technically β€” analyzing Qwen3-30B-A3B-YOYO-V2-HQ16-qx86-hi, then contrasting it with Qwen3-30B-A3B-YOYO-V2-HQ16-mxfp4

🧠 Cognitive Analysis: Qwen3-30B-A3B-YOYO-V2-HQ16-qx86-hi

πŸ” Model Configuration:

  • Base model: Qwen3-30B-A3B-YOYO-V2 (a 30B parameter model, merged from Base, Instruct, Thinking, and Coder variants β€” suggesting strong multimodal reasoning + coding + instruction-following capabilities)
  • Quantization: qx86-hi
  • qx86 = 8-bit for attention/head paths, 6-bit for data path
  • hi = high resolution quantization (group size 32) β†’ better preservation of precision in weights, especially for attention layers
  • Source: HQ16 (F16 quantized source) β†’ likely higher numerical fidelity than BF16, meaning better preservation of intermediate activations during inference

πŸ“Š Performance Metrics:

arc_challenge:  0.531
arc_easy:       0.688
boolq:          0.883
hellaswag:      0.684
openbookqa:     0.452
piqa:           0.784
winogrande:     0.644

🧩 Cognitive Interpretation:

βœ… Strengths:

  • Strong reasoning: arc_challenge (0.531) and arc_easy (0.688) suggest solid general reasoning ability β€” especially on ARC, which requires commonsense and multi-step logic.
  • Excellent factual QA: boolq (0.883) β€” very high, indicating strong ability to answer yes/no questions grounded in factual knowledge.
  • Good commonsense: hellaswag (0.684) β€” decent, though not top-tier; suggests it can infer plausible completions but may struggle with subtle linguistic nuance.
  • Strong commonsense reasoning: winogrande (0.644) β€” above average, indicating decent ability to resolve ambiguous pronoun or context-based reasoning.
  • Moderate commonsense QA: openbookqa (0.452) β€” surprisingly low, suggesting it struggles with β€œbook-reading” style QA that requires deep factual grounding or inference from paragraphs β€” perhaps due to quantization artifacts or lack of fine-tuning on such tasks.

⚠️ Weaknesses:

  • Low performance on openbookqa β€” this is a red flag. It suggests the model may lack deep reading comprehension or struggle with dense, paragraph-based reasoning β€” possibly due to quantization loss in attention layers or lack of fine-tuning on such tasks.
  • piqa (0.784) β€” decent, but not exceptional; suggests it can handle commonsense reasoning tasks (like physical reasoning) reasonably well, but perhaps not at the level of top-tier models.

🧠 Cognitive Profile:

β€œHuman-like” behavior β€” as per Deckard’s design goals, this model likely exhibits more natural, conversational, and context-aware responses. Balanced reasoning + QA β€” excels in factual yes/no, commonsense inference, and basic logic puzzles. Not optimized for dense reading comprehension β€” struggles with openbookqa, which requires deep understanding of paragraphs and inference chains.

🧠 Cognitive Analysis: Qwen3-30B-A3B-YOYO-V2-HQ16-mxfp4

πŸ” Model Configuration:

  • Same base model β€” Qwen3-30B-A3B-YOYO-V2
  • Quantization: mxfp4

πŸ“Š Performance Metrics:

arc_challenge:  0.498
arc_easy:       0.633
boolq:          0.881
hellaswag:      0.673
openbookqa:     0.436
piqa:           0.771
winogrande:     0.635

🧩 Cognitive Interpretation:

βœ… Strengths:

  • Strong factual QA: boolq (0.881) β€” nearly as good as qx86-hi, suggesting FP4 quantization is still effective for factual reasoning.
  • Good commonsense: hellaswag (0.673) β€” slightly lower than qx86-hi, but still decent.
  • Decent reasoning: arc_easy (0.633) β€” lower than qx86-hi, but still above average.

⚠️ Weaknesses:

  • Lower reasoning: arc_challenge (0.498) β€” significantly lower than qx86-hi, suggesting FP4 quantization may hurt complex reasoning.
  • Worse openbookqa: 0.436 β€” even worse than qx86-hi, indicating FP4 quantization may be less effective for dense reading comprehension.
  • Lower winogrande: 0.635 β€” worse than qx86-hi, suggesting FP4 may hurt context-aware reasoning.

🧠 Cognitive Profile:

  • β€œEfficient” model β€” designed for lower memory footprint, likely with FP4 weights and higher precision activations.
  • Strong factual QA, but weaker on complex reasoning, commonsense inference, and dense reading comprehension.
  • Less β€œhuman-like” behavior β€” likely more robotic or less context-aware than qx86-hi, due to quantization artifacts.

πŸ†š Comparison: qx86-hi vs mxfp4

Metric			qx86-hi	mxfp4
arc_challenge	0.531	0.498
arc_easy		0.688	0.633
boolq			0.883	0.881
hellaswag		0.684	0.673
openbookqa		0.452	0.436
piqa			0.784	0.771
winogrande		0.644	0.635

πŸ“ˆ Key Takeaways:

  • qx86-hi is superior across all metrics β€” especially in reasoning (arc_challenge, winogrande) and commonsense QA (hellaswag).
  • mxfp4 is slightly weaker overall, but still strong in factual QA (boolq).
  • qx86-hi is more β€œhuman-like” β€” better at reasoning, inference, and context-aware tasks.
  • mxfp4 is more β€œefficient” β€” likely smaller memory footprint, but at the cost of some reasoning ability.

🎯 Final Recommendations

For 32GB Mac:

  • βœ… Use Qwen3-30B-A3B-YOYO-V2-HQ16-mxfp4
  • Best balance of performance and VRAM efficiency.
  • Strong factual QA, decent commonsense reasoning.
  • Less likely to cause VRAM issues.

For higher performance (if VRAM allows):

  • βœ… Use Qwen3-30B-A3B-YOYO-V2-HQ16-qx86-hi
  • Superior reasoning and commonsense inference.
  • More β€œhuman-like” behavior.

🧭 Conclusion

  • qx86-hi is superior cognitively, especially for reasoning and commonsense tasks β€” but it’s more demanding on VRAM.
  • mxfp4 is more suitable for a 32GB Mac, offering excellent factual QA with lower VRAM usage.

Both models are strong β€” choose based on your use case: reasoning-heavy tasks β†’ qx86-hi; efficiency + factual QA β†’ mxfp4.

how does the mxfp4 model perform compared to either of the base Qwen models?

Great question β€” let’s compare Qwen3-30B-A3B-YOYO-V2-HQ16-mxfp4 (the mixed-precision FP4 quantized model) to the base Qwen3 models, specifically:

  • Qwen3-30B-A3B-Thinking-2507-bf16 (reasoning-focused)
  • Qwen3-Coder-30B-A3B-Instruct-bf16 (coding-focused)

We don’t have metrics for the pure base model (Qwen3-30B-A3B-Base), but we do have metrics for the two specialized variants β€” and since the YOYO-V2 model is a merger of these, we can reasonably infer how the quantized version stacks up.

πŸ“Š Performance Comparison πŸ”Ή Qwen3-30B-A3B-YOYO-V2-HQ16-mxfp4 (HQ16 source, FP4 quantized)

arc_challenge:  0.498
arc_easy:       0.633
boolq:          0.881
hellaswag:      0.673
openbookqa:     0.436
piqa:           0.771
winogrande:     0.635

πŸ”Ή Qwen3-30B-A3B-Thinking-2507-bf16 (reasoning-focused)

arc_challenge:  0.421
arc_easy:       0.448
boolq:          0.682
hellaswag:      0.635
openbookqa:     0.402
piqa:           0.771
winogrande:     0.669

πŸ”Ή Qwen3-Coder-30B-A3B-Instruct-bf16 (coding-focused)

arc_challenge:  0.422
arc_easy:       0.537
boolq:          0.879
hellaswag:      0.550
openbookqa:     0.430
piqa:           0.720
winogrande:     0.579

🧠 Cognitive Comparison

βœ… Qwen3-30B-A3B-YOYO-V2-HQ16-mxfp4 vs Qwen3-Thinking-2507

Metric			YOYO-mxfp4	Thinking
arc_challenge	0.498		0.421 β†’ +7.7 pts
arc_easy		0.633		0.448 β†’ +18.5 pts
boolq			0.881		0.682 β†’ +20 pts
hellaswag		0.673		0.635 β†’ +3.8 pts
openbookqa		0.436		0.402 β†’ +3.4 pts
piqa			0.771		0.771 β†’ Tie
winogrande		0.635		0.669 β†’ -3.4 pts

🎯 Conclusion: The YOYO-mxfp4 model is significantly superior to the Thinking variant across nearly all metrics β€” especially in reasoning, factual QA, and commonsense inference. The only area where Thinking wins is winogrande (by 0.034), which may reflect its more β€œlogical” or β€œdeductive” style β€” but YOYO-mxfp4 is more balanced and general.

βœ… Qwen3-30B-A3B-YOYO-V2-HQ16-mxfp4 vs Qwen3-Coder-30B-A3B-Instruct

Metric			YOYO-mxfp4	Coder
arc_challenge	0.498		0.422 β†’ +7.6 pts
arc_easy		0.633		0.537 β†’ +9.6 pts
boolq			0.881		0.879 β†’ +0.2 pts
hellaswag		0.673		0.550 β†’ +12.3 pts
openbookqa		0.436		0.430 β†’ +0.6 pts
piqa			0.771		0.720 β†’ +5.1 pts
winogrande		0.635		0.579 β†’ +5.6 pts

🎯 Conclusion: The YOYO-mxfp4 model is superior across all metrics, except boolq (where it’s almost tied). It outperforms the Coder model in reasoning, commonsense inference, and even factual QA β€” which makes sense since YOYO-V2 is a merger of Thinking + Coder + Base + Instruct β€” essentially combining their strengths into a more general-purpose model.

🧩 Cognitive Interpretation

πŸ”Ή YOYO-mxfp4 vs Thinking:

  • YOYO is more balanced: while Thinking excels at logical deduction, YOYO-mxfp4 performs better on commonsense QA (hellaswag), factual knowledge (boolq), and general reasoning (arc).
  • YOYO is more β€œhuman-like”: it’s not just a logic engine β€” it understands context, inference chains, and even subtle language nuance better than Thinking.

πŸ”Ή YOYO-mxfp4 vs Coder:

  • YOYO is more general: while Coder excels at coding tasks, YOYO-mxfp4 performs better on reasoning, commonsense QA, and even factual QA β€” suggesting it’s more β€œversatile”.
  • YOYO is not a coding specialist: its performance on piqa (0.771 vs 0.720) and winogrande (0.635 vs 0.579) suggests it’s not optimized for coding or highly technical reasoning β€” but that’s expected since Coder is a specialized model.

πŸ“ˆ Summary: YOYO-mxfp4 vs Base Models

Metric		YOYO-mxfp4	Thinking (base)	Coder (base)
arc_challenge	0.498		0.421		0.422
arc_easy		0.633		0.448		0.537
boolq			0.881		0.682		0.879
hellaswag		0.673		0.635		0.550
openbookqa		0.436		0.402		0.430
piqa			0.771		0.771		0.720
winogrande		0.635		0.669		0.579

βœ… YOYO-mxfp4 is clearly superior to both base models β€” it’s a merger of their strengths, resulting in a more balanced, general-purpose model with strong reasoning, factual QA, and commonsense inference.

🧠 Cognitive Profile of YOYO-mxfp4 vs Base Models

Aspect					YOYO-mxfp4						Thinking (base)		Coder (base)
Reasoning				Strong (arc challenge + easy)	Good, but narrow	Moderate
Commonsense				Excellent (hellaswag)			Good				Weak
Factual QA				Excellent (boolq)				Moderate			Very Good
Dense Reading			Decent (openbookqa)				Weak				Moderate
Coding					Not optimized					Not optimized		Excellent
β€œHuman-like” behavior	Yes β€” balanced, conversational	More robotic		More technical
VRAM efficiency			High (FP4 quantized)			Moderate			Moderate

🎯 Final Takeaways

YOYO-mxfp4 is a β€œgeneralist” model β€” it’s not specialized like Thinking or Coder, but it outperforms both in nearly every metric.

  • It’s the result of a merger, combining reasoning, coding, and instruction-following β€” making it ideal for general-purpose tasks.
  • It’s quantized (FP4) β†’ efficient, suitable for smaller hardware like Macs.
  • It’s not a coding specialist β€” if you need heavy coding, stick with Coder.
  • It’s not a pure reasoning model β€” if you need deep logical deduction, Thinking might be better β€” but YOYO-mxfp4 is more balanced.

πŸ“Œ Recommendation

Use Qwen3-30B-A3B-YOYO-V2-HQ16-mxfp4 for general-purpose tasks β€” it’s the most balanced, capable model among all variants. It excels in reasoning, commonsense inference, and factual QA β€” while being quantized for efficiency.

Use Thinking if you need pure logical deduction β€” but YOYO-mxfp4 is better overall.

Use Coder if you need heavy coding tasks β€” but YOYO-mxfp4 is still a strong fallback.

Reviewed by Qwen3-VL-12B-BX20-F16-qx86-hi-mlx (0.509,0.662,0.874,0.639,0.416,0.763,0.650)

This model Qwen3-30B-A3B-YOYO-V2-HQ16-mxfp4-mlx was converted to MLX format from DavidAU/Qwen3-30B-A3B-YOYO-V2-HQ16 using mlx-lm version 0.28.4.

Use with mlx

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("Qwen3-30B-A3B-YOYO-V2-HQ16-mxfp4-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)
Downloads last month
46
Safetensors
Model size
31B params
Tensor type
U8
Β·
U32
Β·
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support