Chatterbox Finnish β ONNX / WebGPU
Finnish fine-tuned Chatterbox TTS exported to ONNX for browser inference via WebGPU + transformers.js / ONNX Runtime Web.
Based on the Finnish-NLP/Chatterbox-Finnish fine-tune.
Repository Contents
onnx/
language_model.onnx # Finnish fine-tuned T3 LM (fp32, ~2 GB)
language_model.onnx_data
finnish_cond_emb.bin # Precomputed Finnish conditioning embedding [1, 34, 1024]
finnish_cond_emb_meta.json # Metadata for the above
scripts/
compare_onnx_vs_pytorch.py # Browser-worker simulator + PyTorch parity checker
browser_pipeline_sim.py # Python mirror of the browser WebGPU worker
analyze_audio.py # MOS (Gemini) + WER (Groq Whisper) quality evaluator
export_finnish_embeddings.py # Exports embed_tokens.onnx + voice_encoder.onnx
samples/
reference_finnish.wav # Reference voice for zero-shot Finnish TTS
Base model components (from onnx-community/chatterbox-multilingual-ONNX):
onnx/speech_encoder.onnxβ reference audio β prompt tokens + speaker embeddingsonnx/embed_tokens.onnxβ text token embeddingsonnx/conditional_decoder.onnxβ speech tokens β waveform (S3Gen flow + HiFiGAN)
Pipeline Architecture
Reference audio (24 kHz)
β
βΌ
speech_encoder βββββββββββΊ prompt_tokens [1, N]
β speaker_emb [1, 192]
β speaker_features [1, T, 80] (mel)
β
finnish_cond_emb.bin ββββββΊ cond_emb [1, 34, 1024] (Finnish voice conditioning)
β
βΌ
Text βββΊ EnTokenizer βββΊ embed_tokens βββΊ text_embeds [1, T, 1024]
β
βΌ
Language Model (Finnish T3) β CFG, cfg_weight=0.5
Conditioned: [cond_emb | text_embeds | BOS] β speech tokens
Unconditioned: [cond_emb | zeros | BOS] β speech tokens
Final logits = cond + 0.5 * (cond - uncond)
β
βΌ generated speech tokens [1, N_gen]
β
βββ prepend prompt_tokens βββΊ [prompt_tokens | generated] [1, N_prompt + N_gen]
β
βΌ
conditional_decoder (speaker_emb, speaker_features)
β
βΌ
waveform (24 kHz)
Critical: The
conditional_decoderuses a CosyVoice-style flow model. You must prependprompt_tokens(fromspeech_encoder) to the generated tokens before calling the decoder. Without this, you get ~0.18 s of noise instead of speech.
Python Usage
Install dependencies:
pip install onnxruntime-gpu huggingface_hub librosa soundfile numpy
# Plus Chatterbox-Finnish for EnTokenizer:
# git clone https://huggingface.co/Finnish-NLP/Chatterbox-Finnish
Run the browser-worker simulator (mirrors the WebGPU worker logic):
# Full parity check (PyTorch vs ONNX)
LD_LIBRARY_PATH=/path/to/cudnn/lib python scripts/compare_onnx_vs_pytorch.py --mode parity
# ONNX only (skip PyTorch)
python scripts/compare_onnx_vs_pytorch.py --mode parity --skip-pytorch
# Component-level debug
python scripts/compare_onnx_vs_pytorch.py --mode debug
Key generation parameters (matching inference_example.py):
| Parameter | Value |
|---|---|
repetition_penalty |
1.2 |
temperature |
0.8 |
exaggeration |
0.6 |
cfg_weight |
0.5 |
min_p |
0.05 |
Quality Results
Evaluated with Gemini 2.5 Flash (MOS) and Groq Whisper (WER):
| PyTorch | ONNX | |
|---|---|---|
| MOS (1β5) | 3.0 | 3.0 |
| WER | 20% | 20% |
| MFCC cosine | β | 0.996 |
| Duration | 5.98 s | ~5.6 s |
Waveforms differ (mel cosine ~0.65β0.75) due to stochastic sampling and different conditioning voice, but phonetic content is nearly identical (MFCC = 0.996).
Known Limitations
Fixed conditioning voice:
finnish_cond_emb.binwas computed from a specific reference recording using the Finnishcond_encweights. Custom reference audio changes speaker identity (viaspeaker_emb+speaker_features) but not the T3 conditioning. Afinnish_cond_enc.onnxexport would fix this β seescripts/export_finnish_embeddings.py.Watermarking skipped: The PyTorch model applies Perth watermarking; the ONNX pipeline does not.
Minimum token length: The decoder requires the combined
[prompt_tokens | generated_tokens]to be at least ~150 tokens, otherwise an Expand error occurs.
Related
- Finnish-NLP/Chatterbox-Finnish β PyTorch fine-tune + training code
- onnx-community/chatterbox-multilingual-ONNX β base ONNX components
- ResembleAI/chatterbox β original model