Qwen3-TTS-12Hz-0.6B-Base โ€” MLX fp32

Full-precision (float32) MLX conversion of Qwen/Qwen3-TTS-12Hz-0.6B-Base for Apple Silicon.

Converted using mlx-audio 0.3.2 (from GitHub main, 2026-02-08).

Why fp32?

The existing mlx-community conversions are bf16. On Apple Silicon:

  • fp32 and bf16 run at the same speed โ€” the model is memory-bandwidth-bound during decode, not compute-bound
  • fp32 produces better audio quality โ€” more precise weights = cleaner output

Note: The 0.6B model produces noticeably lower quality voice cloning than the 1.7B. For best results, use cr2k2/Qwen3-TTS-12Hz-1.7B-Base-fp32 instead. The 0.6B is faster (RTF=1.8x vs 2.8x) but the quality tradeoff is significant for voice cloning.

Benchmark (M5 MacBook Pro, macOS 26.2)

Model Wall time Audio duration RTF Quality
1.7B fp32 ~25s ~9s 2.8x Best
0.6B fp32 (this) ~15s ~8s 1.8x Noticeably worse

Usage

Requires mlx-audio>=0.3.2 and transformers==5.0.0rc3:

pip install git+https://github.com/Blaizzy/mlx-audio.git transformers==5.0.0rc3
from mlx_audio.tts.utils import load_model
import numpy as np
import soundfile as sf

model = load_model("cr2k2/Qwen3-TTS-12Hz-0.6B-Base-fp32")

results = list(model.generate(
    text="Hello, this is a test of voice cloning.",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
))

audio = np.concatenate([np.array(r.audio, copy=False) for r in results])
sf.write("output.wav", audio.astype(np.float32), 24000)

Conversion

Converted from the original HuggingFace weights using mlx-audio 0.3.2 (installed from GitHub main):

pip install git+https://github.com/Blaizzy/mlx-audio.git
python -m mlx_audio.convert --hf-path Qwen/Qwen3-TTS-12Hz-0.6B-Base --mlx-path models/Qwen3-TTS-12Hz-0.6B-Base-fp32 --dtype float32

Previously converted with an older mlx-audio version. Reconverted on 2026-02-08 with 0.3.2 to pick up fixes including #407 (convert skipping subdirs, voice cloning silence fix) and #444 (0.6B silence fix).

Model Details

  • Base model: Qwen/Qwen3-TTS-12Hz-0.6B-Base
  • Format: MLX safetensors (float32)
  • Size: ~1.2 GB
  • Architecture: Qwen3-TTS dual-track LM with 12.5Hz multi-codebook tokenizer
  • Parameters: 0.6B
  • Languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
  • Capabilities: Voice cloning (3-second reference), multilingual TTS. Does NOT support instruct/voice design mode.
  • License: Apache 2.0
Downloads last month
50
Safetensors
Model size
0.9B params
Tensor type
BF16
ยท
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cr2k2/Qwen3-TTS-12Hz-0.6B-Base-fp32

Finetuned
(2)
this model