Chatterbox Finnish — ONNX / WebGPU

Finnish fine-tuned Chatterbox TTS exported to ONNX for browser inference via WebGPU + transformers.js / ONNX Runtime Web.

Based on the Finnish-NLP/Chatterbox-Finnish fine-tune.

Repository Contents

onnx/
  language_model.onnx          # Finnish fine-tuned T3 LM (fp32, ~2 GB)
  language_model.onnx_data
  finnish_cond_emb.bin         # Precomputed Finnish conditioning embedding [1, 34, 1024]
  finnish_cond_emb_meta.json   # Metadata for the above

scripts/
  compare_onnx_vs_pytorch.py   # Browser-worker simulator + PyTorch parity checker
  browser_pipeline_sim.py      # Python mirror of the browser WebGPU worker
  analyze_audio.py             # MOS (Gemini) + WER (Groq Whisper) quality evaluator
  export_finnish_embeddings.py # Exports embed_tokens.onnx + voice_encoder.onnx

samples/
  reference_finnish.wav        # Reference voice for zero-shot Finnish TTS

Base model components (from onnx-community/chatterbox-multilingual-ONNX):

onnx/speech_encoder.onnx — reference audio → prompt tokens + speaker embeddings
onnx/embed_tokens.onnx — text token embeddings
onnx/conditional_decoder.onnx — speech tokens → waveform (S3Gen flow + HiFiGAN)

Pipeline Architecture

Reference audio (24 kHz)
    │
    ▼
speech_encoder ──────────► prompt_tokens [1, N]
    │                       speaker_emb  [1, 192]
    │                       speaker_features [1, T, 80]  (mel)
    │
finnish_cond_emb.bin ─────► cond_emb [1, 34, 1024]  (Finnish voice conditioning)
    │
    ▼
Text ──► EnTokenizer ──► embed_tokens ──► text_embeds [1, T, 1024]
    │
    ▼
Language Model (Finnish T3) — CFG, cfg_weight=0.5
  Conditioned:   [cond_emb | text_embeds | BOS] → speech tokens
  Unconditioned: [cond_emb | zeros       | BOS] → speech tokens
  Final logits = cond + 0.5 * (cond - uncond)
    │
    ▼ generated speech tokens [1, N_gen]
    │
    ├── prepend prompt_tokens ──► [prompt_tokens | generated] [1, N_prompt + N_gen]
    │
    ▼
conditional_decoder (speaker_emb, speaker_features)
    │
    ▼
waveform (24 kHz)

Critical: The conditional_decoder uses a CosyVoice-style flow model. You must prepend prompt_tokens (from speech_encoder) to the generated tokens before calling the decoder. Without this, you get ~0.18 s of noise instead of speech.

Python Usage

Install dependencies:

pip install onnxruntime-gpu huggingface_hub librosa soundfile numpy
# Plus Chatterbox-Finnish for EnTokenizer:
# git clone https://huggingface.co/Finnish-NLP/Chatterbox-Finnish

Run the browser-worker simulator (mirrors the WebGPU worker logic):

# Full parity check (PyTorch vs ONNX)
LD_LIBRARY_PATH=/path/to/cudnn/lib python scripts/compare_onnx_vs_pytorch.py --mode parity

# ONNX only (skip PyTorch)
python scripts/compare_onnx_vs_pytorch.py --mode parity --skip-pytorch

# Component-level debug
python scripts/compare_onnx_vs_pytorch.py --mode debug

Key generation parameters (matching inference_example.py):

Parameter	Value
`repetition_penalty`	1.2
`temperature`	0.8
`exaggeration`	0.6
`cfg_weight`	0.5
`min_p`	0.05

Quality Results

Evaluated with Gemini 2.5 Flash (MOS) and Groq Whisper (WER):

	PyTorch	ONNX
MOS (1–5)	3.0	3.0
WER	20%	20%
MFCC cosine	—	0.996
Duration	5.98 s	~5.6 s

Waveforms differ (mel cosine ~0.65–0.75) due to stochastic sampling and different conditioning voice, but phonetic content is nearly identical (MFCC = 0.996).

Known Limitations

Fixed conditioning voice: finnish_cond_emb.bin was computed from a specific reference recording using the Finnish cond_enc weights. Custom reference audio changes speaker identity (via speaker_emb + speaker_features) but not the T3 conditioning. A finnish_cond_enc.onnx export would fix this — see scripts/export_finnish_embeddings.py.
Watermarking skipped: The PyTorch model applies Perth watermarking; the ONNX pipeline does not.
Minimum token length: The decoder requires the combined [prompt_tokens | generated_tokens] to be at least ~150 tokens, otherwise an Expand error occurs.

Finnish-NLP/Chatterbox-Finnish — PyTorch fine-tune + training code
onnx-community/chatterbox-multilingual-ONNX — base ONNX components
ResembleAI/chatterbox — original model

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for RASMUS/Chatterbox-Finnish-ONNX

Base model

ResembleAI/chatterbox

Finetuned

Finnish-NLP/Chatterbox-Finnish

Quantized

(1)

this model

RASMUS
/

Chatterbox-Finnish-ONNX