Qwen3-TTS-12Hz-0.6B-Base โ MLX fp32
Full-precision (float32) MLX conversion of Qwen/Qwen3-TTS-12Hz-0.6B-Base for Apple Silicon.
Converted using mlx-audio 0.3.2 (from GitHub main, 2026-02-08).
Why fp32?
The existing mlx-community conversions are bf16. On Apple Silicon:
- fp32 and bf16 run at the same speed โ the model is memory-bandwidth-bound during decode, not compute-bound
- fp32 produces better audio quality โ more precise weights = cleaner output
Note: The 0.6B model produces noticeably lower quality voice cloning than the 1.7B. For best results, use cr2k2/Qwen3-TTS-12Hz-1.7B-Base-fp32 instead. The 0.6B is faster (RTF=1.8x vs 2.8x) but the quality tradeoff is significant for voice cloning.
Benchmark (M5 MacBook Pro, macOS 26.2)
| Model | Wall time | Audio duration | RTF | Quality |
|---|---|---|---|---|
| 1.7B fp32 | ~25s | ~9s | 2.8x | Best |
| 0.6B fp32 (this) | ~15s | ~8s | 1.8x | Noticeably worse |
Usage
Requires mlx-audio>=0.3.2 and transformers==5.0.0rc3:
pip install git+https://github.com/Blaizzy/mlx-audio.git transformers==5.0.0rc3
from mlx_audio.tts.utils import load_model
import numpy as np
import soundfile as sf
model = load_model("cr2k2/Qwen3-TTS-12Hz-0.6B-Base-fp32")
results = list(model.generate(
text="Hello, this is a test of voice cloning.",
ref_audio="reference.wav",
ref_text="Transcript of the reference audio.",
))
audio = np.concatenate([np.array(r.audio, copy=False) for r in results])
sf.write("output.wav", audio.astype(np.float32), 24000)
Conversion
Converted from the original HuggingFace weights using mlx-audio 0.3.2 (installed from GitHub main):
pip install git+https://github.com/Blaizzy/mlx-audio.git
python -m mlx_audio.convert --hf-path Qwen/Qwen3-TTS-12Hz-0.6B-Base --mlx-path models/Qwen3-TTS-12Hz-0.6B-Base-fp32 --dtype float32
Previously converted with an older mlx-audio version. Reconverted on 2026-02-08 with 0.3.2 to pick up fixes including #407 (convert skipping subdirs, voice cloning silence fix) and #444 (0.6B silence fix).
Model Details
- Base model: Qwen/Qwen3-TTS-12Hz-0.6B-Base
- Format: MLX safetensors (float32)
- Size: ~1.2 GB
- Architecture: Qwen3-TTS dual-track LM with 12.5Hz multi-codebook tokenizer
- Parameters: 0.6B
- Languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
- Capabilities: Voice cloning (3-second reference), multilingual TTS. Does NOT support instruct/voice design mode.
- License: Apache 2.0
- Downloads last month
- 50
Quantized
Model tree for cr2k2/Qwen3-TTS-12Hz-0.6B-Base-fp32
Base model
Qwen/Qwen3-TTS-12Hz-0.6B-Base