Finnish-ASR-Canary-v2 / PLAN_AND_PROGRESS.md

Upload PLAN_AND_PROGRESS.md with huggingface_hub

4d2ba80 verified about 1 month ago

33.7 kB

Finnish ASR: Canary-v2 Finetuning & Progress

This document provides a high-level overview of our Finnish ASR finetuning process, model architecture, and current progress for the Data Science team.

📊 Project Overview

Our goal is to adapt NVIDIA's Canary-v2 (a 1-billion parameter multilingual model) for high-accuracy Finnish Automatic Speech Recognition (ASR). We leverage four diverse datasets to ensure robustness across different domains and speaking styles.

🏗️ Model Architecture

Canary-v2 is an Attention-Encoder-Decoder (AED) model that utilizes the Fast-Conformer architecture. This design allows for efficient processing of long audio sequences while maintaining high accuracy.

graph TD
    A[Audio Input] -->|Preprocessing| B[Mel Spectrogram]
    
    subgraph TrainingBlock [Finetuned Components]
        direction TB
        subgraph Encoder [Encoder: Acoustic Modeling]
            C1[Convolutional Subsampling] -->|Downsample| C2[Conformer Blocks]
            C2 -->|Latent Features| C_Out[Acoustic Latents]
        end

        subgraph Decoder [Decoder: Language Modeling]
            D1[Masked Self-Attention] --> D2[Cross-Attention]
            D2 --> D3[Feed Forward]
            D3 --> D_Out[Text Generation]
        end
    end

    B -->|Input| C1
    P[Input Prompts:<br/>Lang, Task, PnC] -->|Conditioning| D1
    C_Out -->|Acoustic Context| D2
    D_Out -->|Output| E[Finnish Text]

    %% Styling
    style TrainingBlock fill:#f0f7ff,stroke:#0052cc,stroke-width:3px,stroke-dasharray: 5 5
    style A fill:#ffffff,stroke:#333,stroke-width:2px
    style B fill:#ffffff,stroke:#333,stroke-width:2px
    style P fill:#ffffff,stroke:#333,stroke-width:2px
    style E fill:#e6ffed,stroke:#28a745,stroke-width:2px
    
    style Encoder fill:#ffffff,stroke:#0052cc,stroke-width:1px
    style Decoder fill:#ffffff,stroke:#0052cc,stroke-width:1px

Component Roles & Finetuning:

Highlighted Area (Blue Dashed Box): This represents the core weights of the Canary-v2 model. During our finetuning, we update the parameters in both the Encoder and Decoder to specifically recognize Finnish phonemes and grammar.
Mel Spectrogram: The "Vision" stage. It turns raw audio waves into a structured 2D representation of sound frequencies over time.
Fast-Conformer Encoder: The "Acoustic Processor." We finetuned this to understand the unique sounds of the Finnish language (like double vowels and consonants).
Input Prompts: The "Context Injector." These are the same color as other inputs because they are part of the model's standard input pipeline, telling it: "Act as a Finnish ASR system."
Attention-Decoder: The "Linguistic Brain." We finetuned this to map the Finnish sounds from the encoder into grammatically correct Finnish text, guided by the prompts.

🔄 Finetuning Workflow

Our pipeline is fully automated, from data ingestion to multi-dataset evaluation.

graph TD
    subgraph DataPrep [Data Preparation]
        D1[CSS10 Finnish] --> P[Unified Processing Script]
        D2[FLEURS Finnish] --> P
        D3[VoxPopuli Finnish] --> P
        D4[Common Voice v24] --> P
        P --> M1[train_manifest.json]
        P --> M2[eval_fleurs.json]
        P --> M3[eval_common_voice.json]
        P --> M4[eval_css10.json]
        P --> M5[eval_voxpopuli.json]
    end

    subgraph Training [Canary-v2 Finetuning]
        M1 --> T[NVIDIA NeMo Trainer]
        CM[nvidia/canary-1b-v2] --> T
        T --> CK[Model Checkpoints]
        M2 & M3 & M4 & M5 --> V[Multi-Validation]
        V --> W[WandB Tracking]
    end

    subgraph Inference [Post-Processing]
        CK --> Inf[Inference]
        Inf --> K[KenLM/NGPU-LM Integration]
        K --> R[Final ASR Output]
    end

📚 Datasets

We use a balanced mix of datasets to cover various audio qualities and transcript styles:

Dataset	Source	Characteristics
FLEURS	Google	High-quality, diverse speakers (Benchmark)
Common Voice	Mozilla	Crowdsourced, varied quality and accents
CSS10	Single Speaker	Clean, high-quality audio books
VoxPopuli	EU Parliament	European Parliament speeches (Formal)

📊 Training Data Analysis

This section documents the composition and length distribution of our training data (from RASMUS/canary-finnish-asr-data, accessed 2026-02-26).

Dataset Summary

Dataset	Samples	Mean Duration	Max Duration	Total Hours
Common Voice v24	9,086	4.5s	10.5s	11.2h
VoxPopuli	8,164	10.1s	50.5s	23.0h
CSS10	3,226	7.7s	20.2s	6.9h
FLEURS	2,704	11.7s	43.2s	8.8h
TOTAL	23,180	7.8s	50.5s	~50h

Duration Distribution (Training Set)

 0–5s   : 33.3%  (7,725 samples)  ████████████████████████████████████████
 5–10s  : 43.7%  (10,139 samples) █████████████████████████████████████████████████████
10–15s  : 15.0%  (3,473 samples)  ██████████████████
15–20s  :  5.4%  (1,241 samples)  ██████
20–30s  :  2.4%  (562 samples)    ███
  >30s  :  0.2%  (40 samples)

Key insight: 77% of training samples are shorter than 10 seconds. The model has very little exposure to longer audio segments (only 0.2% are >30s). This has direct implications for long-form inference stability.

Evaluation Set Durations

Eval Set	Samples	Mean Duration	Max Duration
FLEURS	918	13.0s	33.7s
Common Voice	1,554	5.1s	10.5s
CSS10	170	7.5s	10.2s
VoxPopuli	430	10.6s	47.5s

🔢 Number Handling Analysis

Live Inference Results: Base vs Finetuned (2026-02-26)

We ran both models on 5 FLEURS test samples to determine each model's number output style.

#	Scenario	Reference	Base Canary-v2	Our Finetuned
1	Spoken "sata" (hundred)	`yli sata vuotta`	`yli 100 vuotta` ❌	`yli 100 vuotta` ❌
2	Spoken "seitsemäntoista" (17)	`surmaten seitsemäntoista henkeä`	`surmaten 17 henkeä` ❌	`surmaten seitsemäntoista henkeä` ✅
3	Digits in reference (15, 2011, 2017)	`15 metriä... 2011... 2017`	Correct ✅	Correct ✅
4	Abbreviation "jKr." (AD)	`400 jKr.`	`400 jälkeen Kristuksen`	`400 jälkeen Kristuksen`
5	Range "25–30" (en-dash U+2013)	`25–30 vuodella`	`25-30 vuodella` (ASCII hyphen)	`25 ⁇ 30 vuodella` ❌ UNK token

Key findings:

Base model outputs digits. When the speaker says "sata" (hundred) or "seitsemäntoista" (seventeen), the base Canary-v2 outputs 100 and 17. This is NVIDIA's built-in text normalisation — Canary always outputs digit form for numbers.
Finetuning introduced inconsistency. Our finetuning partially reversed this: for seitsemäntoista the finetuned model now outputs the written word (because FLEURS training transcripts used written-out numbers), but still outputs 100 for sata. This inconsistency is worse than either consistent policy.
En-dash produces a UNK token in the finetuned model. The character – (U+2013 en-dash) in 25–30 causes the finetuned model to emit ⁇ (SentencePiece UNK). The base model degrades gracefully to an ASCII hyphen 25-30. This is a regression introduced by finetuning — likely because the en-dash was absent or inconsistently encoded in our training data.
Abbreviations are expanded by both models. jKr. → jälkeen Kristuksen in both — this is model behaviour, not a finetuning artifact.

Policy Decision

We want digit output (not written-out Finnish number words). The base model's behaviour is correct here. The finetuned model regressed on consistency because our FLEURS training transcripts used written-out numbers.

Training Data Issues Found

Only 2.5% (578 / 23,180) of training samples contain digit characters at all.
FLEURS transcripts use written-out numbers (sata vuotta) while VoxPopuli and Common Voice use digits. This gives the model conflicting signal.
En-dash (– U+2013) may be absent or mis-encoded in training manifests, causing UNK tokens at inference time.

Action Plan: Numbers & UNK Token

Step 1 — Normalise training transcripts to digit form

Run a pre-processing pass on train_manifest.json before the next training run:

Use the Python library num2words with locale fi to convert Finnish written-out numbers to digits: e.g. sata → 100, seitsemäntoista → 17.
OR (simpler / safer): replace the FLEURS transcripts in the manifest with their raw reference texts which already have digits (FLEURS provides both raw_transcription and transcription columns; currently we use raw_transcription which has written numbers).
Target: all numeric quantities consistently in digit form across all four datasets.

Step 2 — Fix en-dash encoding (ROOT CAUSE CONFIRMED)

Confirmed via tokenizer inspection (2026-02-26):

m.tokenizer.text_to_ids("25–30")  # → [16053, 1125, 1128, 0, 1126, 1123]
#                                              ↑ id 0 = UNK for the en-dash!
m.tokenizer.text_to_ids("25-30")  # → [16053, 1125, 1128, 16107, 1126, 1123]
#                                              ↑ ASCII hyphen tokenises correctly

En-dash – (U+2013) and em-dash — (U+2014) are NOT in the CanaryBPETokenizer vocabulary (both map to UNK id 0).
Training data contains 85 entries with en-dash (83 FLEURS, 2 Common Voice). During training, the en-dash in the TARGET text was encoded as UNK, so the model learned to produce UNK for the corresponding speech sounds.
Fix: replace all – and — with ASCII hyphen - in all training transcripts before the next training run. This is a one-line preprocessing step.

# In manifest preprocessing:
text = text.replace('\u2013', '-').replace('\u2014', '-')

Step 3 — Re-evaluate after normalisation

After normalising transcripts, re-run the 5-sample live inference test to verify:

sata vuotta audio → model outputs 100 vuotta
seitsemäntoista audio → model outputs 17
25–30 audio → model outputs 25-30 or 25–30 (no UNK)

🔈 Long-Form Audio: Root Cause Analysis

Our test file moo.wav is 30 minutes (1,800s) of continuous Finnish speech. This reveals a core gap vs. our finetuned Whisper model.

How Canary-v2 Handles Long Audio (Natively)

NVIDIA's Canary-v2 uses dynamic chunking with 1-second overlap between chunks.
This is automatically triggered for audio longer than 40 seconds.
The model was pre-trained on a 1.7M-hour multilingual corpus with this chunking strategy baked in.

Our Current Approach (`inference_vad.py`)

Silero VAD detects speech segments.
Segments are merged into chunks up to chunk_len seconds (default: 15s).
Each chunk is transcribed independently — no shared context between chunks.

Root Causes of Degradation on Long-Form

Issue	Detail
Training length mismatch	77% of fine-tuning data is <10s. Inference chunks at 15s are longer than nearly all training examples, creating distribution shift.
No cross-chunk context	Each 15s chunk is transcribed in isolation. Canary's attention decoder has no memory of previous chunks, so topic/speaker continuity is lost at boundaries.
VAD vs. native chunking	Our VAD-based approach differs from Canary's built-in dynamic chunking. The model was not fine-tuned with this chunking strategy.
Repetition / hallucination	At chunk boundaries with silence or music, the decoder can loop. This is worsened when segments are near the edge of the model's training length distribution.
No overlap	Without overlap between chunks, words at segment boundaries can be dropped or doubled.

Comparison: Canary vs. Our Finetuned Whisper on Long-Form

Whisper was explicitly designed and trained for long-form audio with:

Sliding window inference with overlap
Previous-chunk text as conditioning (prompt-based context)
Timestamps for alignment

Canary's AED architecture does not use previous-chunk text as input, making long-form continuity fundamentally harder to achieve without careful chunk overlap and stitching.

🚀 Progress & Results

Current Status: Model Released & Repository Consolidated

We have successfully completed the finetuning, KenLM integration, and repository consolidation phases. The model and its associated language models are now hosted on Hugging Face at RASMUS/Finnish-ASR-Canary-v2.

Infrastructure: Finetuned on RTX 6000 PRO Blackwell (96 GB VRAM) on Verda.com platform in Finland.
Model Suite: Acoustic model + 3 KenLM variants (1M, 2M, 5M sentences).
Best Performance (with KenLM 5M):
- FLEURS: 7.86% WER
- Common Voice: 4.70% WER
- CSS10: 7.07% WER
- VoxPopuli: 11.65% WER
Deployment: Integrated Silero VAD-based inference for robust long-form audio processing.

Next Steps:

Long-form Tuning: Reduce default chunk_len to 8–10s (closer to training distribution median) and add 0.5–1s overlap between chunks to reduce boundary artifacts.
Data Quality Audit: Fix 28 confirmed corrupted Common Voice entries where raw TSV metadata (client ID hashes, gender tags) was accidentally written into the text field. Audit VoxPopuli for missing capitalisation (all-lowercase transcripts despite pnc: yes).
Number Handling: Add Finnish-specific training data with numeric content. Consider TTS-synthesised samples covering phone numbers, years, statistics, and measurements (both digit and written-out forms paired).
Long-form Training Data: Incorporate longer audio segments: TTS synthetic long-form audio (fbc_monolog_processed, parliament data) into the training manifest to shift the duration distribution toward 15–30s.
KenLM Refinement: Re-train KenLM with high-quality punctuated text. Current LM trained on mixed-quality data.
Advanced Evaluation: Implement CER evaluation on non-normalised test sets to better capture punctuation/casing accuracy.
Repetition Penalty: Explore repetition penalty in decoding if chunk-level loops persist after chunk length tuning.
Real-world Evaluation: Benchmark on diverse long-form samples (podcasts, meetings, call-centre audio).

🗺️ Action Plan: Next Training Run

This section details the concrete steps for the next finetuning iteration, based on the root-cause analysis above.

Priority 1 — Fix Training Data (before re-training)

1a. Normalise numbers to digit form (Gemini Flash)

Finnish written-out numbers in FLEURS transcripts cause the finetuned model to output inconsistent number forms. We will use the Gemini Flash API to convert all training transcripts in a single batch pass:

# Pseudocode — run once on train_manifest.json before next training
import google.generativeai as genai
import json

genai.configure(api_key=GEMINI_API_KEY)
model = genai.GenerativeModel("gemini-2.0-flash")

SYSTEM_PROMPT = """You are a Finnish text normalizer.
Convert any written-out Finnish numbers, ordinals, or number words in the text to digit form.
Examples:
  "yli sata vuotta" → "yli 100 vuotta"
  "seitsemäntoista henkeä" → "17 henkeä"
  "vuonna tuhat yhdeksänsataa" → "vuonna 1900"
Keep all other text exactly as-is. Return only the modified text, nothing else."""

entries = []
with open('manifests/train_manifest.json') as f:
    for line in f:
        d = json.loads(line)
        response = model.generate_content(f"{SYSTEM_PROMPT}\n\n{d['text']}")
        d['text'] = response.text.strip()
        entries.append(d)

with open('manifests/train_manifest_normalised.json', 'w') as f:
    for e in entries:
        f.write(json.dumps(e, ensure_ascii=False) + '\n')

Cost estimate: 23,180 entries × ~~50 tokens average = ~1.2M tokens. At Gemini Flash pricing (~~$0.075/1M tokens input) ≈ < $0.10 total.

1b. Fix en-dash UNK token (confirmed root cause)

The en-dash – (U+2013) is NOT in the tokenizer vocabulary — it maps to UNK (id 0). Replace it with ASCII hyphen before training:

# Add to the manifest preprocessing step
text = text.replace('\u2013', '-').replace('\u2014', '-')

This affects 85 entries in train_manifest.json (83 FLEURS, 2 Common Voice).

1c. Fix 28 corrupted Common Voice entries

Replace entries where the text field contains raw TSV metadata (tabs + client_id hashes). Strip everything after the first tab character.

Priority 2 — Add Long-Form Training Data

TTS Long-Form Dataset: `RASMUS/canary_asr_finetune_tts_long_data`

Property	Value
Size	8.0 GB zip
Format	FLAC audio + JSONL manifest
Mean duration	16.5s (vs 7.8s in current data)
Median duration	15.9s
Max duration	25.0s
Content	Finnish speech: lectures, podcasts, YouTube
Segments >20s	~25%

This dataset directly addresses the training length mismatch. Adding it will shift the duration distribution from a mean of 7.8s toward ~10–12s and significantly increase the proportion of 15–25s segments that match inference chunk lengths.

Integration plan:

# Download the dataset
curl -L -H "Authorization: Bearer ${HF_TOKEN}" \
  "https://huggingface.co/datasets/RASMUS/canary_asr_finetune_tts_long_data/resolve/main/canary_dataset.zip" \
  -o /workspace/data/tts_long_data.zip

# Extract
unzip /workspace/data/tts_long_data.zip -d /workspace/data/tts_long_data/

# Apply number normalisation and dash fix to canary_manifest.jsonl
# then merge with existing train_manifest_normalised.json

After applying number normalisation and dash fixes to the new manifest, concatenate with the existing training set. Expected combined size: ~23,180 + N (estimate 5,000–20,000+ entries depending on total dataset size).

Priority 3 — Inference Tuning (without re-training)

Even before re-training, we can improve moo.wav performance by adjusting inference_vad.py:

Parameter	Current	Recommended
`chunk_len`	15s	8–10s (match training median of 7.8s)
chunk overlap	0s	0.5s (reduce boundary word drops)
`alpha` (KenLM)	0.2	Try 0.1–0.15 (current may over-constrain decoder)

🔄 Round 2: Data Pipeline & Splits

This section documents the data preparation methodology for Round 2 finetuning, including all new eval sets, the TTS integration, and the final manifest composition.

Overview of Changes vs Round 1

Item	Round 1	Round 2
Base model	`canary-1b-v2.nemo`	`canary-1b-v2.nemo` (fresh start)
Training samples	23,180	28,858
Training hours	~50h	75.6h
Mean duration	7.8s	9.4s
Max duration allowed	20.0s	30.0s
Transcripts normalised	No	Yes (digits, dashes fixed)
Eval sets	4	6

Step 1 — Transcript Normalisation (`normalize_manifests.py`)

All training transcripts were cleaned in two layers:

Deterministic fixes (no API call needed):

En-dash – (U+2013) and em-dash — (U+2014) → ASCII hyphen - (fixes UNK token regression)
Corrupted Common Voice entries (raw TSV metadata in text field) → strip everything after first tab

Gemini 2.5 Flash API calls (2,586 of 23,180 entries needed conversion):

Pre-filtered with a Finnish number-word regex so only entries that actually contain written numbers are sent to the API (cost: ~$0.62)
Written Finnish numbers converted to digit form: sata vuotta → 100 vuotta, seitsemäntoista → 17
Explicit DO NOT CONVERT rules: ordinals (ensimmäinen, toinen), superlative constructions (yksi tärkeimmistä), and toinen as "another/other"

Step 2 — TTS Long-Form Data Integration

Downloaded RASMUS/canary_asr_finetune_tts_long_data (4.8 GB, 6,365 entries, mean 16.4s).

Aligned to NeMo training format:

Path rewritten to relative style: data/tts_long_data/audio/{filename}
Fields mapped: language → source_lang/target_lang, task: "transcription" → taskname: "asr", added pnc: "yes"
Same Gemini normalisation pass applied (888 entries converted)

Step 3 — Eval Set Construction (TTS Data)

The 6,365 normalised TTS entries were split into train / eval / long-form-test:

All TTS entries (6,365)
│
├── Long-form pool (>20s): 1,501 entries
│   ├── eval_long_form (sampled): 200 entries  ← random.seed(42) shuffle → first 200
│   └── Returned to training pool: 1,301 entries
│
└── Medium pool (10–20s): 4,864 entries
    ├── eval_tts (10% hold-out): 487 entries  ← stratified by duration bucket
    └── tts_train: 4,377 entries

Why eval_long_form = 200 entries? The original 1,501 long-form entries (>20s) had a total duration of ~9.4 hours — far too long to run as a validation set every epoch. At batch_size=32 on a single GPU, each validation pass over 1,501 entries takes ~25 minutes, adding 2.5h per epoch. 200 entries (≈75 minutes of audio) provides a representative sample of the long-form distribution at reasonable cost: ~4 minutes of eval time per epoch.

eval_tts construction: 487 entries were held out from the 10–20s duration range (10% stratified sample). This tests the model's ability to handle medium-length audio and is separate from the original 4 eval sets.

Step 4 — Combined Training Manifest

Final train_manifest_combined.jsonl composition:

Source	Entries	Notes
Original train (normalised)	23,180	Digits + dash fix applied
TTS train (10–20s)	4,377	Synthesised long-form speech
Long-form overflow	1,301	>20s entries not selected for eval_long_form
Total	28,858	Mean 9.4s, 75.6h

Final Eval Sets (Round 2)

Set	File	Entries	Mean Duration	Purpose
`eval_fleurs`	`eval_fleurs.json`	918	13.0s	Primary benchmark (monitored for checkpointing)
`eval_common_voice`	`eval_common_voice.json`	1,554	5.1s	Crowdsourced quality
`eval_css10`	`eval_css10.json`	170	7.5s	Clean single-speaker
`eval_voxpopuli`	`eval_voxpopuli.json`	430	10.6s	Formal/parliament speech
`eval_tts`	`eval_tts.jsonl`	487	14.5s	Medium-length TTS (new)
`eval_long_form`	`eval_long_form.jsonl`	200	22.5s	Long-form >20s sample (new)

Checkpoint monitoring: val_wer tracks FLEURS (first validation set). All 6 WERs are logged independently to WandB.

Round 2 Training Config

File: configs/canary_finetune_finnish_v2.yaml Key settings:

init_from_nemo_model: /workspace/Finnish-ASR-Canary-v2/models/canary-1b-v2.nemo (fresh start from base)
max_duration: 30.0s (up from 20.0s to include TTS segments up to 25s)
max_steps: 18,000 (scaled: 28,858 / 32 ≈ 902 steps/epoch × 20 epochs ≈ 18,040)
lr: 1e-5, WarmupAnnealing, 500 warmup steps
precision: bf16, single GPU, strategy: auto

🛠️ Workflow Status Details

1. Data Preparation - DONE

Identify and inventory all 4 datasets
Create unified processing script (scripts/prepare_all_manifests.py)
Run scripts/prepare_all_manifests.py on devcontainer
Verify manifest sample counts and audio file integrity

2. Configuration Setup - DONE

Create Hydra training config (configs/canary_finetune_finnish.yaml)
Configure multi-validation with 4 eval datasets
Checkpoint monitors primary eval set (FLEURS) via val_wer
All 4 eval WERs logged independently to WandB

3. Training - DONE

Run finetuning via run_training.sh
Monitor per-dataset WER in WandB

4. KenLM / NGPU-LM Language Model Integration - DONE

Install KenLM tools (install_beamsearch_decoders.sh)
Gather Finnish text (ASR transcripts + Wikipedia + mc4)
Train 3 variants of KenLM (1M, 2M, 5M sentences)
Evaluate with LM fusion on all 4 test sets

5. Repository & Long-Form Inference - IN PROGRESS

Consolidate README and model metadata for Hugging Face release
Upload model checkpoints and KenLM bundles to HF Hub
Implement Silero VAD-based chunking for long-form audio (inference_vad.py)
Root-cause analysis of long-form degradation vs. Whisper (see above)
Reduce chunk_len to 8–10s and add chunk overlap (Current Focus)
Optimize alpha for stability on moo.wav (30 min test file)

6. Data Quality & Advanced Evaluation - PARTIALLY DONE

Fix 28 corrupted Common Voice manifest entries (raw TSV data in text field) — done in normalisation pass.
Fix en-dash/em-dash UNK token regression — done in normalisation pass.
Audit VoxPopuli transcripts for all-lowercase entries (capitalisation missing).
Re-train KenLM with high-quality punctuated text.
Evaluate CER on non-normalized test sets.

7. Number Normalisation & UNK Token Fix - DONE

Replace en-dash – and em-dash — with ASCII hyphen - in all training manifests (85 train + 70 TTS entries fixed).
Use Gemini 2.5 Flash to normalise written-out Finnish numbers to digit form (2,586 API calls across train + TTS).
Re-evaluate on the 5-sample number test set after Round 2 training to verify consistency.

8. Long-Form Data Expansion - DONE

Download RASMUS/canary_asr_finetune_tts_long_data (4.8 GB zip, 6,365 entries, mean 16.4s).
Align TTS manifest to NeMo training format and integrate into combined training manifest.
Round 2 training configured and ready to launch (see Round 2 section below).
Benchmark Round 2 model against Round 1 and finetuned Whisper on moo.wav.

🛠️ NeMo Environment Setup

This section documents the exact steps to set up a working NeMo inference/training environment, including the fixes required for the nvcr.io/nvidia/pytorch:25.01-py3 container.

Installation (from scratch on pytorch:25.01-py3 base image)

# 1. Clone the HF model repo (contains NeMo source with patches applied)
#    Skip LFS to avoid downloading the 3.6 GB model during clone
GIT_LFS_SKIP_SMUDGE=1 git clone \
  "https://user:${HF_TOKEN}@huggingface.co/RASMUS/Finnish-ASR-Canary-v2" \
  /workspace/Finnish-ASR-Canary-v2

# 2. Install NeMo in editable mode from the patched source
cd /workspace/Finnish-ASR-Canary-v2/NeMo
pip install -e ".[asr]"

# 3. Install pinned dependencies
pip install 'fsspec==2024.12.0' 'numpy<2.0' 'librosa>=0.11.0' kaldialign wandb

Required Compatibility Fixes

The pytorch:25.01-py3 container ships with packages that conflict with NeMo 2.8.0rc0:

# Fix 1: Downgrade lightning to the version NeMo requires (<=2.4.0)
# The container ships lightning 2.4.0 but pip may upgrade it — pin it back.
pip install "lightning==2.4.0" "pytorch-lightning==2.4.0"

# Fix 2: Remove incompatible torchvision
# The container's torchvision (0.20.0a0) was built against torch 2.6.0a0 (the original
# container torch), but NeMo's install upgrades torch to ~2.10. torchvision then fails
# on import and blocks NeMo. ASR does not need torchvision.
pip uninstall -y torchvision

Downloading the Finetuned Model

# Download the finetuned acoustic model (3.6 GB)
curl -L \
  -H "Authorization: Bearer ${HF_TOKEN}" \
  "https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/canary-finnish.nemo" \
  -o /workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo

# KenLM models are also LFS — download the 5M variant (best WER):
curl -L \
  -H "Authorization: Bearer ${HF_TOKEN}" \
  "https://huggingface.co/RASMUS/Finnish-ASR-Canary-v2/resolve/main/kenlm_5M.nemo" \
  -o /workspace/Finnish-ASR-Canary-v2/kenlm_5M.nemo

Quick Inference Smoke Test

import warnings; warnings.filterwarnings('ignore')
from nemo.collections.asr.models import EncDecMultiTaskModel

model = EncDecMultiTaskModel.restore_from(
    '/workspace/Finnish-ASR-Canary-v2/canary-finnish.nemo',
    map_location='cuda'
)
model.eval()

results = model.transcribe(
    audio=['path/to/audio.wav'],
    task='asr', source_lang='fi', target_lang='fi', pnc='yes'
)
print(results[0].text)

Loading the Base Model (for comparison)

# Downloads ~3.6 GB on first run, cached in ~/.cache/huggingface/
model_base = EncDecMultiTaskModel.from_pretrained("nvidia/canary-1b-v2", map_location='cuda')

📝 Progress Log

2026-01-11: Initial project setup.
2026-02-08: Redesigned data pipeline for 4 real datasets (CSS10, FLEURS, VoxPopuli, Common Voice).
2026-02-10: Finetuning complete. Epoch 11 reached val_wer=0.1258 on FLEURS.
2026-02-13: Mermaid diagrams and project documentation for DS team.
2026-02-18: KenLM benchmarks finished. Consolidated repository structure. Applied NeMo patches for inference stability.
2026-02-20: Model Released. Release of Finnish-ASR-Canary-v2 on HF. Implemented VAD-based inference pipeline. Currently tuning for long-form stability on moo.wav with various alpha settings (0.0 - 0.4 tested).
2026-02-26: Root-cause analysis complete. Investigated long-form gap vs. Whisper and number handling. Key findings: (1) 77% of training data is <10s, creating distribution shift at inference chunk lengths; (2) No cross-chunk context in Canary's AED architecture; (3) Only 2.5% of training samples contain digit characters — numbers are a known weak point; (4) 28 corrupted Common Voice entries found (TSV metadata in text field); (5) moo.wav test file confirmed as 30 minutes. Action plan: shorten chunk_len, add chunk overlap, fix data corruption, and plan a long-form training data expansion round.
2026-02-26: Live number inference + tokenizer audit completed. Ran base Canary-v2 vs. finetuned model on 5 FLEURS samples. Confirmed: (1) base model always outputs digits (100, 17); (2) finetuned model regressed to mixed output — sometimes written words, sometimes digits — due to inconsistent training transcripts; (3) en-dash (–) produces UNK token ⁇ in finetuned model, base model degrades gracefully to ASCII hyphen. Policy decision: standardise on digit output and fix en-dash encoding in training manifests before next training run. NeMo environment setup documented (with fixes for torchvision and lightning version conflicts). TTS long-form dataset (canary_asr_finetune_tts_long_data, 8GB, mean 16.5s/segment) identified as key data source for next training run. Action plan for next run: (1) normalise numbers to digits via Gemini Flash API, (2) fix en-dash → ASCII hyphen, (3) fix 28 corrupted CV entries, (4) add TTS long-form data.
2026-03-01: Round 2 data pipeline complete. Ran normalize_manifests.py: 2,586 Gemini 2.5 Flash API calls (~$0.62), 1,137 number changes in train + 888 in TTS, 85 en-dash and 28 corrupted CV entries fixed. Downloaded and extracted TTS long-form dataset (6,365 entries, 4.8 GB). Split TTS data into train (4,377), eval_tts (487, mean 14.5s), and long-form pool (1,501 entries >20s). Sampled 200 entries into eval_long_form.jsonl (seed 42) and returned 1,301 to training, yielding train_manifest_combined.jsonl (28,858 entries, 75.6h). Round 2 training config created (configs/canary_finetune_finnish_v2.yaml). Training ready to launch.
2026-03-01: Training crash diagnosed and fixed. Round 2 training ran 505 steps then crashed with CUDA vectorized_gather_kernel index out of bounds. Root cause: entry 14857 in train_manifest_combined.jsonl contained 11,247 chars of Python code (Gemini normalization returned a code block instead of a transcript for voxpopuli_005371.wav). When tokenized with the canary2 prompt format, the sequence far exceeded the decoder's max_sequence_length=1024, causing position-embedding OOB. Additionally, 4 entries in eval_common_voice.json had TSV metadata contamination (same v1 issue, not previously caught in the v2 eval set). Both manifests fixed. Config rewritten from full-architecture spec to minimal v1-style format (tokenizer: update_tokenizer: false) using speech_to_text_finetune.py (which restores the full model from the .nemo file). Training re-launched. Manifests synced to canary-finnish-asr-data HuggingFace dataset repo.