Qwen3-TTS-12Hz-0.6B-Base — MLX fp32

Full-precision (float32) MLX conversion of Qwen/Qwen3-TTS-12Hz-0.6B-Base for Apple Silicon.

Converted using mlx-audio 0.3.2 (from GitHub main, 2026-02-08).

Why fp32?

The existing mlx-community conversions are bf16. On Apple Silicon:

fp32 and bf16 run at the same speed — the model is memory-bandwidth-bound during decode, not compute-bound
fp32 produces better audio quality — more precise weights = cleaner output

Note: The 0.6B model produces noticeably lower quality voice cloning than the 1.7B. For best results, use cr2k2/Qwen3-TTS-12Hz-1.7B-Base-fp32 instead. The 0.6B is faster (RTF=1.8x vs 2.8x) but the quality tradeoff is significant for voice cloning.

Benchmark (M5 MacBook Pro, macOS 26.2)

Model	Wall time	Audio duration	RTF	Quality
1.7B fp32	~25s	~9s	2.8x	Best
0.6B fp32 (this)	~15s	~8s	1.8x	Noticeably worse

Usage

Requires mlx-audio>=0.3.2 and transformers==5.0.0rc3:

pip install git+https://github.com/Blaizzy/mlx-audio.git transformers==5.0.0rc3

from mlx_audio.tts.utils import load_model
import numpy as np
import soundfile as sf

model = load_model("cr2k2/Qwen3-TTS-12Hz-0.6B-Base-fp32")

results = list(model.generate(
    text="Hello, this is a test of voice cloning.",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio.",
))

audio = np.concatenate([np.array(r.audio, copy=False) for r in results])
sf.write("output.wav", audio.astype(np.float32), 24000)

Conversion

Converted from the original HuggingFace weights using mlx-audio 0.3.2 (installed from GitHub main):

pip install git+https://github.com/Blaizzy/mlx-audio.git
python -m mlx_audio.convert --hf-path Qwen/Qwen3-TTS-12Hz-0.6B-Base --mlx-path models/Qwen3-TTS-12Hz-0.6B-Base-fp32 --dtype float32

Previously converted with an older mlx-audio version. Reconverted on 2026-02-08 with 0.3.2 to pick up fixes including #407 (convert skipping subdirs, voice cloning silence fix) and #444 (0.6B silence fix).

Model Details

Base model: Qwen/Qwen3-TTS-12Hz-0.6B-Base
Format: MLX safetensors (float32)
Size: ~1.2 GB
Architecture: Qwen3-TTS dual-track LM with 12.5Hz multi-codebook tokenizer
Parameters: 0.6B
Languages: English, Chinese, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Capabilities: Voice cloning (3-second reference), multilingual TTS. Does NOT support instruct/voice design mode.
License: Apache 2.0

Downloads last month: 50

Safetensors

Model size

0.9B params

Tensor type

BF16

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cr2k2/Qwen3-TTS-12Hz-0.6B-Base-fp32

Base model

Qwen/Qwen3-TTS-12Hz-0.6B-Base

Finetuned

(2)

this model