Chatterbox Finnish β€” ONNX / WebGPU

Finnish fine-tuned Chatterbox TTS exported to ONNX for browser inference via WebGPU + transformers.js / ONNX Runtime Web.

Based on the Finnish-NLP/Chatterbox-Finnish fine-tune.


Repository Contents

onnx/
  language_model.onnx          # Finnish fine-tuned T3 LM (fp32, ~2 GB)
  language_model.onnx_data
  finnish_cond_emb.bin         # Precomputed Finnish conditioning embedding [1, 34, 1024]
  finnish_cond_emb_meta.json   # Metadata for the above

scripts/
  compare_onnx_vs_pytorch.py   # Browser-worker simulator + PyTorch parity checker
  browser_pipeline_sim.py      # Python mirror of the browser WebGPU worker
  analyze_audio.py             # MOS (Gemini) + WER (Groq Whisper) quality evaluator
  export_finnish_embeddings.py # Exports embed_tokens.onnx + voice_encoder.onnx

samples/
  reference_finnish.wav        # Reference voice for zero-shot Finnish TTS

Base model components (from onnx-community/chatterbox-multilingual-ONNX):

  • onnx/speech_encoder.onnx β€” reference audio β†’ prompt tokens + speaker embeddings
  • onnx/embed_tokens.onnx β€” text token embeddings
  • onnx/conditional_decoder.onnx β€” speech tokens β†’ waveform (S3Gen flow + HiFiGAN)

Pipeline Architecture

Reference audio (24 kHz)
    β”‚
    β–Ό
speech_encoder ──────────► prompt_tokens [1, N]
    β”‚                       speaker_emb  [1, 192]
    β”‚                       speaker_features [1, T, 80]  (mel)
    β”‚
finnish_cond_emb.bin ─────► cond_emb [1, 34, 1024]  (Finnish voice conditioning)
    β”‚
    β–Ό
Text ──► EnTokenizer ──► embed_tokens ──► text_embeds [1, T, 1024]
    β”‚
    β–Ό
Language Model (Finnish T3) β€” CFG, cfg_weight=0.5
  Conditioned:   [cond_emb | text_embeds | BOS] β†’ speech tokens
  Unconditioned: [cond_emb | zeros       | BOS] β†’ speech tokens
  Final logits = cond + 0.5 * (cond - uncond)
    β”‚
    β–Ό generated speech tokens [1, N_gen]
    β”‚
    β”œβ”€β”€ prepend prompt_tokens ──► [prompt_tokens | generated] [1, N_prompt + N_gen]
    β”‚
    β–Ό
conditional_decoder (speaker_emb, speaker_features)
    β”‚
    β–Ό
waveform (24 kHz)

Critical: The conditional_decoder uses a CosyVoice-style flow model. You must prepend prompt_tokens (from speech_encoder) to the generated tokens before calling the decoder. Without this, you get ~0.18 s of noise instead of speech.


Python Usage

Install dependencies:

pip install onnxruntime-gpu huggingface_hub librosa soundfile numpy
# Plus Chatterbox-Finnish for EnTokenizer:
# git clone https://huggingface.co/Finnish-NLP/Chatterbox-Finnish

Run the browser-worker simulator (mirrors the WebGPU worker logic):

# Full parity check (PyTorch vs ONNX)
LD_LIBRARY_PATH=/path/to/cudnn/lib python scripts/compare_onnx_vs_pytorch.py --mode parity

# ONNX only (skip PyTorch)
python scripts/compare_onnx_vs_pytorch.py --mode parity --skip-pytorch

# Component-level debug
python scripts/compare_onnx_vs_pytorch.py --mode debug

Key generation parameters (matching inference_example.py):

Parameter Value
repetition_penalty 1.2
temperature 0.8
exaggeration 0.6
cfg_weight 0.5
min_p 0.05

Quality Results

Evaluated with Gemini 2.5 Flash (MOS) and Groq Whisper (WER):

PyTorch ONNX
MOS (1–5) 3.0 3.0
WER 20% 20%
MFCC cosine β€” 0.996
Duration 5.98 s ~5.6 s

Waveforms differ (mel cosine ~0.65–0.75) due to stochastic sampling and different conditioning voice, but phonetic content is nearly identical (MFCC = 0.996).


Known Limitations

  1. Fixed conditioning voice: finnish_cond_emb.bin was computed from a specific reference recording using the Finnish cond_enc weights. Custom reference audio changes speaker identity (via speaker_emb + speaker_features) but not the T3 conditioning. A finnish_cond_enc.onnx export would fix this β€” see scripts/export_finnish_embeddings.py.

  2. Watermarking skipped: The PyTorch model applies Perth watermarking; the ONNX pipeline does not.

  3. Minimum token length: The decoder requires the combined [prompt_tokens | generated_tokens] to be at least ~150 tokens, otherwise an Expand error occurs.


Related

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for RASMUS/Chatterbox-Finnish-ONNX

Quantized
(1)
this model