Finnish ASR: Canary-v2 Finetuning & Progress
This document provides a high-level overview of our Finnish ASR finetuning process, model architecture, and current progress for the Data Science team.
๐ Project Overview
Our goal is to adapt NVIDIA's Canary-v2 (a 1-billion parameter multilingual model) for high-accuracy Finnish Automatic Speech Recognition (ASR). We leverage four diverse datasets to ensure robustness across different domains and speaking styles.
๐๏ธ Model Architecture
Canary-v2 is an Attention-Encoder-Decoder (AED) model that utilizes the Fast-Conformer architecture. This design allows for efficient processing of long audio sequences while maintaining high accuracy.
graph TD
A[Audio Input] -->|Preprocessing| B[Mel Spectrogram]
subgraph TrainingBlock [Finetuned Components]
direction TB
subgraph Encoder [Encoder: Acoustic Modeling]
C1[Convolutional Subsampling] -->|Downsample| C2[Conformer Blocks]
C2 -->|Latent Features| C_Out[Acoustic Latents]
end
subgraph Decoder [Decoder: Language Modeling]
D1[Masked Self-Attention] --> D2[Cross-Attention]
D2 --> D3[Feed Forward]
D3 --> D_Out[Text Generation]
end
end
B -->|Input| C1
P[Input Prompts:<br/>Lang, Task, PnC] -->|Conditioning| D1
C_Out -->|Acoustic Context| D2
D_Out -->|Output| E[Finnish Text]
%% Styling
style TrainingBlock fill:#f0f7ff,stroke:#0052cc,stroke-width:3px,stroke-dasharray: 5 5
style A fill:#ffffff,stroke:#333,stroke-width:2px
style B fill:#ffffff,stroke:#333,stroke-width:2px
style P fill:#ffffff,stroke:#333,stroke-width:2px
style E fill:#e6ffed,stroke:#28a745,stroke-width:2px
style Encoder fill:#ffffff,stroke:#0052cc,stroke-width:1px
style Decoder fill:#ffffff,stroke:#0052cc,stroke-width:1px
Component Roles & Finetuning:
- Highlighted Area (Blue Dashed Box): This represents the core weights of the Canary-v2 model. During our finetuning, we update the parameters in both the Encoder and Decoder to specifically recognize Finnish phonemes and grammar.
- Mel Spectrogram: The "Vision" stage. It turns raw audio waves into a structured 2D representation of sound frequencies over time.
- Fast-Conformer Encoder: The "Acoustic Processor." We finetuned this to understand the unique sounds of the Finnish language (like double vowels and consonants).
- Input Prompts: The "Context Injector." These are the same color as other inputs because they are part of the model's standard input pipeline, telling it: "Act as a Finnish ASR system."
- Attention-Decoder: The "Linguistic Brain." We finetuned this to map the Finnish sounds from the encoder into grammatically correct Finnish text, guided by the prompts.
๐ Finetuning Workflow
Our pipeline is fully automated, from data ingestion to multi-dataset evaluation.
graph TD
subgraph DataPrep [Data Preparation]
D1[CSS10 Finnish] --> P[Unified Processing Script]
D2[FLEURS Finnish] --> P
D3[VoxPopuli Finnish] --> P
D4[Common Voice v24] --> P
P --> M1[train_manifest.json]
P --> M2[eval_fleurs.json]
P --> M3[eval_common_voice.json]
P --> M4[eval_css10.json]
P --> M5[eval_voxpopuli.json]
end
subgraph Training [Canary-v2 Finetuning]
M1 --> T[NVIDIA NeMo Trainer]
CM[nvidia/canary-1b-v2] --> T
T --> CK[Model Checkpoints]
M2 & M3 & M4 & M5 --> V[Multi-Validation]
V --> W[WandB Tracking]
end
subgraph Inference [Post-Processing]
CK --> Inf[Inference]
Inf --> K[KenLM/NGPU-LM Integration]
K --> R[Final ASR Output]
end
๐ Datasets
We use a balanced mix of datasets to cover various audio qualities and transcript styles:
| Dataset | Source | Characteristics |
|---|---|---|
| FLEURS | High-quality, diverse speakers (Benchmark) | |
| Common Voice | Mozilla | Crowdsourced, varied quality and accents |
| CSS10 | Single Speaker | Clean, high-quality audio books |
| VoxPopuli | EU Parliament | European Parliament speeches (Formal) |
๐ Training Data Analysis
This section documents the composition and length distribution of our training data (from RASMUS/canary-finnish-asr-data, accessed 2026-02-26).
Dataset Summary
| Dataset | Samples | Mean Duration | Max Duration | Total Hours |
|---|---|---|---|---|
| Common Voice v24 | 9,086 | 4.5s | 10.5s | 11.2h |
| VoxPopuli | 8,164 | 10.1s | 50.5s | 23.0h |
| CSS10 | 3,226 | 7.7s | 20.2s | 6.9h |
| FLEURS | 2,704 | 11.7s | 43.2s | 8.8h |
| TOTAL | 23,180 | 7.8s | 50.5s | ~50h |
Duration Distribution (Training Set)
0โ5s : 33.3% (7,725 samples) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
5โ10s : 43.7% (10,139 samples) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
10โ15s : 15.0% (3,473 samples) โโโโโโโโโโโโโโโโโโ
15โ20s : 5.4% (1,241 samples) โโโโโโ
20โ30s : 2.4% (562 samples) โโโ
>30s : 0.2% (40 samples)
Key insight: 77% of training samples are shorter than 10 seconds. The model has very little exposure to longer audio segments (only 0.2% are >30s). This has direct implications for long-form inference stability.
Evaluation Set Durations
| Eval Set | Samples | Mean Duration | Max Duration |
|---|---|---|---|
| FLEURS | 918 | 13.0s | 33.7s |
| Common Voice | 1,554 | 5.1s | 10.5s |
| CSS10 | 170 | 7.5s | 10.2s |
| VoxPopuli | 430 | 10.6s | 47.5s |
๐ข Number Handling Analysis
Live Inference Results: Base vs Finetuned (2026-02-26)
We ran both models on 5 FLEURS test samples to determine each model's number output style.
| # | Scenario | Reference | Base Canary-v2 | Our Finetuned |
|---|---|---|---|---|
| 1 | Spoken "sata" (hundred) | yli sata vuotta |
yli 100 vuotta โ |
yli 100 vuotta โ |
| 2 | Spoken "seitsemรคntoista" (17) | surmaten seitsemรคntoista henkeรค |
surmaten 17 henkeรค โ |
surmaten seitsemรคntoista henkeรค โ
|
| 3 | Digits in reference (15, 2011, 2017) | 15 metriรค... 2011... 2017 |
Correct โ | Correct โ |
| 4 | Abbreviation "jKr." (AD) | 400 jKr. |
400 jรคlkeen Kristuksen |
400 jรคlkeen Kristuksen |
| 5 | Range "25โ30" (en-dash U+2013) | 25โ30 vuodella |
25-30 vuodella (ASCII hyphen) |
25 โ 30 vuodella โ UNK token |
Key findings:
Base model outputs digits. When the speaker says "sata" (hundred) or "seitsemรคntoista" (seventeen), the base Canary-v2 outputs
100and17. This is NVIDIA's built-in text normalisation โ Canary always outputs digit form for numbers.Finetuning introduced inconsistency. Our finetuning partially reversed this: for
seitsemรคntoistathe finetuned model now outputs the written word (because FLEURS training transcripts used written-out numbers), but still outputs100forsata. This inconsistency is worse than either consistent policy.En-dash produces a UNK token in the finetuned model. The character
โ(U+2013 en-dash) in25โ30causes the finetuned model to emitโ(SentencePiece UNK). The base model degrades gracefully to an ASCII hyphen25-30. This is a regression introduced by finetuning โ likely because the en-dash was absent or inconsistently encoded in our training data.Abbreviations are expanded by both models.
jKr.โjรคlkeen Kristuksenin both โ this is model behaviour, not a finetuning artifact.
Policy Decision
We want digit output (not written-out Finnish number words). The base model's behaviour is correct here. The finetuned model regressed on consistency because our FLEURS training transcripts used written-out numbers.
Training Data Issues Found
- Only 2.5% (578 / 23,180) of training samples contain digit characters at all.
- FLEURS transcripts use written-out numbers (
sata vuotta) while VoxPopuli and Common Voice use digits. This gives the model conflicting signal. - En-dash (
โU+2013) may be absent or mis-encoded in training manifests, causing UNK tokens at inference time.
Action Plan: Numbers & UNK Token
Step 1 โ Normalise training transcripts to digit form
Run a pre-processing pass on train_manifest.json before the next training run:
- Use the Python library
num2wordswith localefito convert Finnish written-out numbers to digits: e.g.sataโ100,seitsemรคntoistaโ17. - OR (simpler / safer): replace the FLEURS transcripts in the manifest with their raw reference texts which already have digits (FLEURS provides both
raw_transcriptionandtranscriptioncolumns; currently we useraw_transcriptionwhich has written numbers). - Target: all numeric quantities consistently in digit form across all four datasets.
Step 2 โ Fix en-dash encoding (ROOT CAUSE CONFIRMED)
Confirmed via tokenizer inspection (2026-02-26):
m.tokenizer.text_to_ids("25โ30") # โ [16053, 1125, 1128, 0, 1126, 1123]
# โ id 0 = UNK for the en-dash!
m.tokenizer.text_to_ids("25-30") # โ [16053, 1125, 1128, 16107, 1126, 1123]
# โ ASCII hyphen tokenises correctly
- En-dash
โ(U+2013) and em-dashโ(U+2014) are NOT in the CanaryBPETokenizer vocabulary (both map to UNK id 0). - Training data contains 85 entries with en-dash (83 FLEURS, 2 Common Voice). During training, the en-dash in the TARGET text was encoded as UNK, so the model learned to produce UNK for the corresponding speech sounds.
- Fix: replace all
โandโwith ASCII hyphen-in all training transcripts before the next training run. This is a one-line preprocessing step.
# In manifest preprocessing:
text = text.replace('\u2013', '-').replace('\u2014', '-')
Step 3 โ Re-evaluate after normalisation
After normalising transcripts, re-run the 5-sample live inference test to verify:
sata vuottaaudio โ model outputs100 vuottaseitsemรคntoistaaudio โ model outputs1725โ30audio โ model outputs25-30or25โ30(no UNK)
๐ Long-Form Audio: Root Cause Analysis
Our test file moo.wav is 30 minutes (1,800s) of continuous Finnish speech. This reveals a core gap vs. our finetuned Whisper model.
How Canary-v2 Handles Long Audio (Natively)
- NVIDIA's Canary-v2 uses dynamic chunking with 1-second overlap between chunks.
- This is automatically triggered for audio longer than 40 seconds.
- The model was pre-trained on a 1.7M-hour multilingual corpus with this chunking strategy baked in.
Our Current Approach (inference_vad.py)
- Silero VAD detects speech segments.
- Segments are merged into chunks up to
chunk_lenseconds (default: 15s). - Each chunk is transcribed independently โ no shared context between chunks.
Root Causes of Degradation on Long-Form
| Issue | Detail |
|---|---|
| Training length mismatch | 77% of fine-tuning data is <10s. Inference chunks at 15s are longer than nearly all training examples, creating distribution shift. |
| No cross-chunk context | Each 15s chunk is transcribed in isolation. Canary's attention decoder has no memory of previous chunks, so topic/speaker continuity is lost at boundaries. |
| VAD vs. native chunking | Our VAD-based approach differs from Canary's built-in dynamic chunking. The model was not fine-tuned with this chunking strategy. |
| Repetition / hallucination | At chunk boundaries with silence or music, the decoder can loop. This is worsened when segments are near the edge of the model's training length distribution. |
| No overlap | Without overlap between chunks, words at segment boundaries can be dropped or doubled. |
Comparison: Canary vs. Our Finetuned Whisper on Long-Form
Whisper was explicitly designed and trained for long-form audio with:
- Sliding window inference with overlap
- Previous-chunk text as conditioning (prompt-based context)
- Timestamps for alignment
Canary's AED architecture does not use previous-chunk text as input, making long-form continuity fundamentally harder to achieve without careful chunk overlap and stitching.
๐ Progress & Results
Current Status: Model Released & Repository Consolidated
We have successfully completed the finetuning, KenLM integration, and repository consolidation phases. The model and its associated language models are now hosted on Hugging Face at RASMUS/Finnish-ASR-Canary-v2.
- Infrastructure: Finetuned on RTX 6000 PRO Blackwell (96 GB VRAM) on Verda.com platform in Finland.
- Model Suite: Acoustic model + 3 KenLM variants (1M, 2M, 5M sentences).
- Best Performance (with KenLM 5M):
- FLEURS: 7.86% WER
- Common Voice: 4.70% WER
- CSS10: 7.07% WER
- VoxPopuli: 11.65% WER
- Deployment: Integrated Silero VAD-based inference for robust long-form audio processing.
Next Steps:
- Long-form Tuning: Reduce default
chunk_lento 8โ10s (closer to training distribution median) and add 0.5โ1s overlap between chunks to reduce boundary artifacts. - Data Quality Audit: Fix 28 confirmed corrupted Common Voice entries where raw TSV metadata (client ID hashes, gender tags) was accidentally written into the
textfield. Audit VoxPopuli for missing capitalisation (all-lowercase transcripts despitepnc: yes). - Number Handling: Add Finnish-specific training data with numeric content. Consider TTS-synthesised samples covering phone numbers, years, statistics, and measurements (both digit and written-out forms paired).
- Long-form Training Data: Incorporate longer audio segments: TTS synthetic long-form audio (
fbc_monolog_processed, parliament data) into the training manifest to shift the duration distribution toward 15โ30s. - KenLM Refinement: Re-train KenLM with high-quality punctuated text. Current LM trained on mixed-quality data.
- Advanced Evaluation: Implement CER evaluation on non-normalised test sets to better capture punctuation/casing accuracy.
- Repetition Penalty: Explore repetition penalty in decoding if chunk-level loops persist after chunk length tuning.
- Real-world Evaluation: Benchmark on diverse long-form samples (podcasts, meetings, call-centre audio).
๐บ๏ธ Action Plan: Next Training Run
This section details the concrete steps for the next finetuning iteration, based on the root-cause analysis above.
Priority 1 โ Fix Training Data (before re-training)
1a. Normalise numbers to digit form (Gemini Flash)
Finnish written-out numbers in FLEURS transcripts cause the finetuned model to output inconsistent number forms. We will use the Gemini Flash API to convert all training transcripts in a single batch pass:
# Pseudocode โ run once on train_manifest.json before next training
import google.generativeai as genai
import json
genai.configure(api_key=GEMINI_API_KEY)
model = genai.GenerativeModel("gemini-2.0-flash")
SYSTEM_PROMPT = """You are a Finnish text normalizer.
Convert any written-out Finnish numbers, ordinals, or number words in the text to digit form.
Examples:
"yli sata vuotta" โ "yli 100 vuotta"
"seitsemรคntoista henkeรค" โ "17 henkeรค"
"vuonna tuhat yhdeksรคnsataa" โ "vuonna 1900"
Keep all other text exactly as-is. Return only the modified text, nothing else."""
entries = []
with open('manifests/train_manifest.json') as f:
for line in f:
d = json.loads(line)
response = model.generate_content(f"{SYSTEM_PROMPT}\n\n{d['text']}")
d['text'] = response.text.strip()
entries.append(d)
with open('manifests/train_manifest_normalised.json', 'w') as f:
for e in entries:
f.write(json.dumps(e, ensure_ascii=False) + '\n')
Cost estimate: 23,180 entries ร 50 tokens average = ~1.2M tokens. At Gemini Flash pricing ($0.075/1M tokens input) โ < $0.10 total.
1b. Fix en-dash UNK token (confirmed root cause)
The en-dash โ (U+2013) is NOT in the tokenizer vocabulary โ it maps to UNK (id 0). Replace it with ASCII hyphen before training:
# Add to the manifest preprocessing step
text = text.replace('\u2013', '-').replace('\u2014', '-')
This affects 85 entries in train_manifest.json (83 FLEURS, 2 Common Voice).
1c. Fix 28 corrupted Common Voice entries
Replace entries where the text field contains raw TSV metadata (tabs + client_id hashes). Strip everything after the first tab character.
Priority 2 โ Add Long-Form Training Data
TTS Long-Form Dataset: RASMUS/canary_asr_finetune_tts_long_data
| Property | Value |
|---|---|
| Size | 8.0 GB zip |
| Format | FLAC audio + JSONL manifest |
| Mean duration | 16.5s (vs 7.8s in current data) |
| Median duration | 15.9s |
| Max duration | 25.0s |
| Content | Finnish speech: lectures, podcasts, YouTube |
| Segments >20s | ~25% |
This dataset directly addresses the training length mismatch. Adding it will shift the duration distribution from a mean of 7.8s toward ~10โ12s and significantly increase the proportion of 15โ25s segments that match inference chunk lengths.
Integration plan:
# Download the dataset
curl -L -H "Authorization: Bearer ${HF_TOKEN}" \
"https://huggingface.co/datasets/RASMUS/canary_asr_finetune_tts_long_data/resolve/main/canary_dataset.zip" \
-o /workspace/data/tts_long_data.zip
# Extract
unzip /workspace/data/tts_long_data.zip -d /workspace/data/tts_long_data/
# Apply number normalisation and dash fix to canary_manifest.jsonl
# then merge with existing train_manifest_normalised.json
After applying number normalisation and dash fixes to the new manifest, concatenate with the existing training set. Expected combined size: ~23,180 + N (estimate 5,000โ20,000+ entries depending on total dataset size).
Priority 3 โ Inference Tuning (without re-training)
Even before re-training, we can improve moo.wav performance by adjusting inference_vad.py:
| Parameter | Current | Recommended |
|---|---|---|
chunk_len |
15s | 8โ10s (match training median of 7.8s) |
| chunk overlap | 0s | 0.5s (reduce boundary word drops) |
alpha (KenLM) |
0.2 | Try 0.1โ0.15 (current may over-constrain decoder) |
๐ Round 2: Data Pipeline & Splits
This section documents the data preparation methodology for Round 2 finetuning, including all new eval sets, the TTS integration, and the final manifest composition.
Overview of Changes vs Round 1
| Item | Round 1 | Round 2 |
|---|---|---|
| Base model | canary-1b-v2.nemo |
canary-1b-v2.nemo (fresh start) |
| Training samples | 23,180 | 28,858 |
| Training hours | ~50h | 75.6h |
| Mean duration | 7.8s | 9.4s |
| Max duration allowed | 20.0s | 30.0s |
| Transcripts normalised | No | Yes (digits, dashes fixed) |
| Eval sets | 4 | 6 |
Step 1 โ Transcript Normalisation (normalize_manifests.py)
All training transcripts were cleaned in two layers:
Deterministic fixes (no API call needed):
- En-dash
โ(U+2013) and em-dashโ(U+2014) โ ASCII hyphen-(fixes UNK token regression) - Corrupted Common Voice entries (raw TSV metadata in
textfield) โ strip everything after first tab
Gemini 2.5 Flash API calls (2,586 of 23,180 entries needed conversion):
- Pre-filtered with a Finnish number-word regex so only entries that actually contain written numbers are sent to the API (cost: ~$0.62)
- Written Finnish numbers converted to digit form:
sata vuottaโ100 vuotta,seitsemรคntoistaโ17 - Explicit DO NOT CONVERT rules: ordinals (
ensimmรคinen,toinen), superlative constructions (yksi tรคrkeimmistรค), andtoinenas "another/other"
Step 2 โ TTS Long-Form Data Integration
Downloaded RASMUS/canary_asr_finetune_tts_long_data (4.8 GB, 6,365 entries, mean 16.4s).
Aligned to NeMo training format:
- Path rewritten to relative style:
data/tts_long_data/audio/{filename} - Fields mapped:
languageโsource_lang/target_lang,task: "transcription"โtaskname: "asr", addedpnc: "yes" - Same Gemini normalisation pass applied (888 entries converted)
Step 3 โ Eval Set Construction (TTS Data)
The 6,365 normalised TTS entries were split into train / eval / long-form-test:
All TTS entries (6,365)
โ
โโโ Long-form pool (>20s): 1,501 entries
โ โโโ eval_long_form (sampled): 200 entries โ random.seed(42) shuffle โ first 200
โ โโโ Returned to training pool: 1,301 entries
โ
โโโ Medium pool (10โ20s): 4,864 entries
โโโ eval_tts (10% hold-out): 487 entries โ stratified by duration bucket
โโโ tts_train: 4,377 entries
Why eval_long_form = 200 entries? The original 1,501 long-form entries (>20s) had a total duration of ~9.4 hours โ far too long to run as a validation set every epoch. At batch_size=32 on a single GPU, each validation pass over 1,501 entries takes ~25 minutes, adding 2.5h per epoch. 200 entries (โ75 minutes of audio) provides a representative sample of the long-form distribution at reasonable cost: ~4 minutes of eval time per epoch.
eval_tts construction: 487 entries were held out from the 10โ20s duration range (10% stratified sample). This tests the model's ability to handle medium-length audio and is separate from the original 4 eval sets.
Step 4 โ Combined Training Manifest
Final train_manifest_combined.jsonl composition:
| Source | Entries | Notes |
|---|---|---|
| Original train (normalised) | 23,180 | Digits + dash fix applied |
| TTS train (10โ20s) | 4,377 | Synthesised long-form speech |
| Long-form overflow | 1,301 | >20s entries not selected for eval_long_form |
| Total | 28,858 | Mean 9.4s, 75.6h |
Final Eval Sets (Round 2)
| Set | File | Entries | Mean Duration | Purpose |
|---|---|---|---|---|
eval_fleurs |
eval_fleurs.json |
918 | 13.0s | Primary benchmark (monitored for checkpointing) |
eval_common_voice |
eval_common_voice.json |
1,554 | 5.1s | Crowdsourced quality |
eval_css10 |
eval_css10.json |
170 | 7.5s | Clean single-speaker |
eval_voxpopuli |
eval_voxpopuli.json |
430 | 10.6s | Formal/parliament speech |
eval_tts |
eval_tts.jsonl |
487 | 14.5s | Medium-length TTS (new) |
eval_long_form |
eval_long_form.jsonl |
200 | 22.5s | Long-form >20s sample (new) |
Checkpoint monitoring: val_wer tracks FLEURS (first validation set). All 6 WERs are logged independently to WandB.
Round 2 Training Config
File: configs/canary_finetune_finnish_v2.yaml
Key settings:
init_from_nemo_model:/workspace/Finnish-ASR-Canary-v2/models/canary-1b-v2.nemo(fresh start from base)max_duration: 30.0s (up from 20.0s to include TTS segments up to 25s)max_steps: 18,000 (scaled: 28,858 / 32 โ 902 steps/epoch ร 20 epochs โ 18,040)lr: 1e-5,WarmupAnnealing, 500 warmup stepsprecision: bf16, single GPU,strategy: auto
๐ ๏ธ Workflow Status Details
1. Data Preparation - DONE
- Identify and inventory all 4 datasets
- Create unified processing script (
scripts/prepare_all_manifests.py) - Run
scripts/prepare_all_manifests.pyon devcontainer - Verify manifest sample counts and audio file integrity
2. Configuration Setup - DONE
- Create Hydra training config (
configs/canary_finetune_finnish.yaml) - Configure multi-validation with 4 eval datasets
- Checkpoint monitors primary eval set (FLEURS) via
val_wer - All 4 eval WERs logged independently to WandB
3. Training - DONE
- Run finetuning via
run_training.sh - Monitor per-dataset WER in WandB
4. KenLM / NGPU-LM Language Model Integration - DONE
- Install KenLM tools (
install_beamsearch_decoders.sh) - Gather Finnish text (ASR transcripts + Wikipedia + mc4)
- Train 3 variants of KenLM (1M, 2M, 5M sentences)
- Evaluate with LM fusion on all 4 test sets
5. Repository & Long-Form Inference - IN PROGRESS
- Consolidate README and model metadata for Hugging Face release
- Upload model checkpoints and KenLM bundles to HF Hub
- Implement Silero VAD-based chunking for long-form audio (
inference_vad.py) - Root-cause analysis of long-form degradation vs. Whisper (see above)
- Reduce
chunk_lento 8โ10s and add chunk overlap (Current Focus) - Optimize
alphafor stability onmoo.wav(30 min test file)
6. Data Quality & Advanced Evaluation - PARTIALLY DONE
- Fix 28 corrupted Common Voice manifest entries (raw TSV data in text field) โ done in normalisation pass.
- Fix en-dash/em-dash UNK token regression โ done in normalisation pass.
- Audit VoxPopuli transcripts for all-lowercase entries (capitalisation missing).
- Re-train KenLM with high-quality punctuated text.
- Evaluate CER on non-normalized test sets.
7. Number Normalisation & UNK Token Fix - DONE
- Replace en-dash
โand em-dashโwith ASCII hyphen-in all training manifests (85 train + 70 TTS entries fixed). - Use Gemini 2.5 Flash to normalise written-out Finnish numbers to digit form (2,586 API calls across train + TTS).
- Re-evaluate on the 5-sample number test set after Round 2 training to verify consistency.
8. Long-Form Data Expansion - DONE
- Download
RASMUS/canary_asr_finetune_tts_long_data(4.8 GB zip, 6,365 entries, mean 16.4s). - Align TTS manifest to NeMo training format and integrate into combined training manifest.
- Round 2 training configured and ready to launch (see Round 2 section below).
- Benchmark Round 2 model against Round 1 and finetuned Whisper on
moo.wav.
๐ ๏ธ NeMo Environment Setup
This section documents the exact steps to set up a working NeMo inference/training environment, including the fixes required for the nvcr.io/nvidia/pytorch:25.01-py3 container.
Installation (from scratch on pytorch:25.01-py3 base image)
# 1. Clone the HF model repo (contains NeMo source with patches applied)
# Skip LFS to avoid downloading the 3.6 GB model during clone
GIT_LFS_SKIP_SMUDGE=1 git clone \
"https://user:${HF_TOKEN}@huggingface.co/RASMUS/Finnish-ASR-Canary-v2" \
/workspace/Finnish-ASR-Canary-v2
# 2. Install NeMo in editable mode from the patched source
cd /workspace/Finnish-ASR-Canary-v2/NeMo
pip install -e ".[asr]"
# 3. Install pinned dependencies
pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' kaldialign wandb
Required Compatibility Fixes
The pytorch:25.01-py3 container ships with packages that conflict with NeMo 2.8.0rc0:
# Fix 1: Downgrade lightning to the version NeMo requires (<=2.4.0)
# The container ships lightning 2.4.0 but pip may upgrade it โ pin it back.
pip install "lightning==2.4.0" "pytorch-lightning==2.4.0"
# Fix 2: Remove incompatible torchvision
# The container's torchvision (0.20.0a0) was built against torch 2.6.0a0 (the original
# container torch), but NeMo's install upgrades torch to ~2.10. torchvision then fails
# on import and blocks NeMo. ASR does not need torchvision.
pip uninstall -y torchvision
Downloading the Finetuned Model
# Download the finetuned acoustic model (3.6 GB)
curl -L \
-H "Authorization: Bearer ${HF_TOKEN}" \
"https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/canary-finnish.nemo" \
-o /workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo
# KenLM models are also LFS โ download the 5M variant (best WER):
curl -L \
-H "Authorization: Bearer ${HF_TOKEN}" \
"https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/kenlm_5M.nemo" \
-o /workspace/Finnish-ASR-Canary-v2/kenlm_5M.nemo
Quick Inference Smoke Test
import warnings; warnings.filterwarnings('ignore')
from nemo.collections.asr.models import EncDecMultiTaskModel
model = EncDecMultiTaskModel.restore_from(
'/workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo',
map_location='cuda'
)
model.eval()
results = model.transcribe(
audio=['path/to/audio.wav'],
task='asr', source_lang='fi', target_lang='fi', pnc='yes'
)
print(results[0].text)
Loading the Base Model (for comparison)
# Downloads ~3.6 GB on first run, cached in ~/.cache/huggingface/
model_base = EncDecMultiTaskModel.from_pretrained("nvidia/canary-1b-v2", map_location='cuda')
๐ Progress Log
- 2026-01-11: Initial project setup.
- 2026-02-08: Redesigned data pipeline for 4 real datasets (CSS10, FLEURS, VoxPopuli, Common Voice).
- 2026-02-10: Finetuning complete. Epoch 11 reached
val_wer=0.1258on FLEURS. - 2026-02-13: Mermaid diagrams and project documentation for DS team.
- 2026-02-18: KenLM benchmarks finished. Consolidated repository structure. Applied NeMo patches for inference stability.
- 2026-02-20: Model Released. Release of
Finnish-ASR-Canary-v2on HF. Implemented VAD-based inference pipeline. Currently tuning for long-form stability onmoo.wavwith variousalphasettings (0.0 - 0.4 tested). - 2026-02-26: Root-cause analysis complete. Investigated long-form gap vs. Whisper and number handling. Key findings: (1) 77% of training data is <10s, creating distribution shift at inference chunk lengths; (2) No cross-chunk context in Canary's AED architecture; (3) Only 2.5% of training samples contain digit characters โ numbers are a known weak point; (4) 28 corrupted Common Voice entries found (TSV metadata in text field); (5)
moo.wavtest file confirmed as 30 minutes. Action plan: shorten chunk_len, add chunk overlap, fix data corruption, and plan a long-form training data expansion round. - 2026-02-26: Live number inference + tokenizer audit completed. Ran base Canary-v2 vs. finetuned model on 5 FLEURS samples. Confirmed: (1) base model always outputs digits (
100,17); (2) finetuned model regressed to mixed output โ sometimes written words, sometimes digits โ due to inconsistent training transcripts; (3) en-dash (โ) produces UNK tokenโin finetuned model, base model degrades gracefully to ASCII hyphen. Policy decision: standardise on digit output and fix en-dash encoding in training manifests before next training run. NeMo environment setup documented (with fixes fortorchvisionandlightningversion conflicts). TTS long-form dataset (canary_asr_finetune_tts_long_data, 8GB, mean 16.5s/segment) identified as key data source for next training run. Action plan for next run: (1) normalise numbers to digits via Gemini Flash API, (2) fix en-dash โ ASCII hyphen, (3) fix 28 corrupted CV entries, (4) add TTS long-form data. - 2026-03-01: Round 2 data pipeline complete. Ran
normalize_manifests.py: 2,586 Gemini 2.5 Flash API calls (~$0.62), 1,137 number changes in train + 888 in TTS, 85 en-dash and 28 corrupted CV entries fixed. Downloaded and extracted TTS long-form dataset (6,365 entries, 4.8 GB). Split TTS data into train (4,377), eval_tts (487, mean 14.5s), and long-form pool (1,501 entries >20s). Sampled 200 entries intoeval_long_form.jsonl(seed 42) and returned 1,301 to training, yieldingtrain_manifest_combined.jsonl(28,858 entries, 75.6h). Round 2 training config created (configs/canary_finetune_finnish_v2.yaml). Training ready to launch. - 2026-03-01: Training crash diagnosed and fixed. Round 2 training ran 505 steps then crashed with CUDA
vectorized_gather_kernel index out of bounds. Root cause: entry 14857 intrain_manifest_combined.jsonlcontained 11,247 chars of Python code (Gemini normalization returned a code block instead of a transcript forvoxpopuli_005371.wav). When tokenized with the canary2 prompt format, the sequence far exceeded the decoder'smax_sequence_length=1024, causing position-embedding OOB. Additionally, 4 entries ineval_common_voice.jsonhad TSV metadata contamination (same v1 issue, not previously caught in the v2 eval set). Both manifests fixed. Config rewritten from full-architecture spec to minimal v1-style format (tokenizer: update_tokenizer: false) usingspeech_to_text_finetune.py(which restores the full model from the.nemofile). Training re-launched. Manifests synced tocanary-finnish-asr-dataHuggingFace dataset repo.