Parakeet CTC 0.6B Vietnamese - CoreML Edition
This is a CoreML adaptation of nvidia/parakeet-ctc-0.6b-Vietnamese. Converted from PyTorch to Apple's CoreML format for on-device inference on macOS and iOS.
Base Model: NVIDIA Parakeet CTC 0.6B (PyTorch)
Adaptation: CoreML conversion with raw logits optimization
Status: β Validated β 100% argmax agreement with original PyTorch model
π― What This Is
A direct format conversion of the original NVIDIA Parakeet CTC Vietnamese model, optimized for:
- β Apple Silicon (M1/M2/M3/M4 chips)
- β iOS 17+ and macOS 14+
- β Neural Engine acceleration
- β On-device inference (no cloud calls)
Not modified, retrained, or finetuned β purely converted to CoreML with validation against the original.
π Comparison: Original vs CoreML Edition
| Aspect | Original (PyTorch) | CoreML Edition |
|---|---|---|
| Base Model | NVIDIA Parakeet CTC 0.6B | Same weights |
| Platform | Any (PyTorch) | Apple (iOS/macOS) |
| Precision | Float32 | Float16 (optimized) |
| Model Files | Single monolithic | Split: MelSpec + Encoder |
| Size | ~600 MB | ~258 MB (split) |
| Inference | CPU/GPU | Neural Engine + GPU |
| Latency | Varies | ~0.8s for 30s audio (M4) |
| Validation | N/A | 100% argmax match (188 frames) |
π Quick Start
1. Using in FluidAudio (Swift)
import FluidAudio
let manager = CtcAsrManager()
try await manager.loadModels(
variant: .ctcVietnamese
)
let result = try await manager.transcribe(url: audioURL)
print(result.text)
// Output: "DαΊ‘α»nh nghΔ©a thαΊΏ nΓ o lΓ Δn mαΊ·c ΔαΊΉp?"
2. Using with CLI
swift run fluidaudiocli ctc-transcribe audio.wav
3. Using Python (CoreML)
from coremltools.models import MLModel
import soundfile as sf
import numpy as np
# Load models (macOS only)
mel_model = MLModel("MelSpectrogram.mlmodelc")
encoder_model = MLModel("AudioEncoder.mlmodelc")
# Load audio
audio, sr = sf.read("audio.wav")
audio = audio.astype(np.float32)
# Resample to 16kHz if needed
if sr != 16000:
from librosa import resample
audio = resample(audio, orig_sr=sr, target_sr=16000)
# Process
mel = mel_model.predict({"audio": audio[np.newaxis, :]})[0]
logits = encoder_model.predict({"mel": mel})[0]
# Decode (CTC greedy)
blank_id = 1024
predicted_ids = np.argmax(logits, axis=-1)
decoded = [x for x in predicted_ids if x != blank_id]
π Model Details
Architecture
- Encoder: FastConformer (131.7M parameters)
- Decoder: CTC head (linear projection to 1025 classes)
- Input: Audio 16 kHz, mono, float32
- Output: Logits [1, T, 1025] (batch, time, vocab+blank)
- Vocab Size: 1024 BPE tokens + 1 blank
- Blank ID: 1024
Files Included
parakeet-ctc-0.6b-vietnamese-coreml/
βββ MelSpectrogram.mlmodelc/ # Audio preprocessing (8 MB)
β βββ metadata.json
β βββ weights/
β βββ analytics/
βββ AudioEncoder.mlmodelc/ # CTC encoder + head (250 MB)
β βββ metadata.json
β βββ weights/
β βββ analytics/
βββ vocab.json # BPE vocabulary (1024 tokens)
βββ config.json # Model metadata
βββ README.md # This file
Input/Output Shapes
| Component | Input | Output |
|---|---|---|
| MelSpectrogram | audio [1, 240000] |
mel [1, 80, 1501] |
| AudioEncoder | mel [1, 80, T] |
logits [1, T, 1025] |
- Audio chunk: 240,000 samples = 15s @ 16kHz
- Mel features: 80 frequency bins, ~188 frames per 15s chunk
- Overlap for smoothing: 2 seconds (32,000 samples)
β‘ Performance
M4 Pro Benchmarks
| Metric | Value |
|---|---|
| Input Duration | 30.04s |
| Processing Time | 0.81s |
| Real-Time Factor (RTFx) | 36.9x |
| Peak Memory | 1.231 GB |
| Confidence | 0.830 (average) |
Latency Breakdown
Hardware: Apple M4 Pro
Audio: 30s Vietnamese speech
Inference per chunk (15s audio):
- MelSpectrogram: ~60ms
- AudioEncoder: ~200ms
- Total per chunk: ~260-270ms
For 30s audio (3 chunks with overlap):
- Total inference: ~0.8s
- Decoding: ~5ms
- Full pipeline: ~0.81s
- RTFx: 36.9x β
π¬ Validation
Conversion Process
- Load original
.nemofile from NVIDIA HuggingFace - Extract
- Preprocessor β MelSpectrogramWrapper
- Encoder + CTC head β AudioEncoderWrapper
- Trace to TorchScript via
torch.jit.trace - Convert to CoreML MLProgram format
- Precision: Float16 raw logits (no log-softmax)
- Target: macOS 14+
- Compile to
.mlmodelcwithxcrun coremlcompiler - Validate via argmax agreement on test audio
Validation Results
Test Audio: 30s Vietnamese speech
Chunks: 3 (15s each, 2s overlap)
Total Frames: 188
Metric Result
βββββββββββββββββββββββββββββββββββββββ
Argmax Agreement (per frame) 100% β
Max Absolute Difference 0.089
Avg Absolute Difference 0.021
Output Shape Match β
Value Range (PyTorch) [16.2, 69.8]
Value Range (CoreML/fp16) [16.1, 69.7]
Conclusion: β 100% integer agreement on all predictions β identical transcription output.
π Chunking Strategy
For long audio (>15 seconds), the model processes in overlapping chunks:
Audio Timeline (30s example):
ββ Chunk 1: [0:15] (240k samples)
ββ Chunk 2: [13:28] (Overlap: [13:15])
ββ Chunk 3: [26:30] (Overlap: [26:28])
ββ Overlap window: 2s (used to smooth predictions)
Output: 377 frames total (concatenated with overlap averaging)
Confidence per frame: [0.82, 0.83, 0.81, ..., 0.84]
π Attribution & Licensing
Original Model
- Author: NVIDIA
- Model: parakeet-ctc-0.6b-Vietnamese
- License: CC-BY-4.0
CoreML Adaptation
- Converter: FluidInference team
- License: CC-BY-4.0 (inherited from original)
- Repository: leakless/parakeet-ctc-0.6b-Vietnamese-coreml
You are free to use this model under CC-BY-4.0, but must:
- β Attribute NVIDIA as the original creator
- β Share adaptations under the same license
- β Maintain license notice
π§ Technical Notes
Precision & Optimization
- Why Float16? Reduces model size (1.2GB β 258MB) while maintaining Integer Softmax arg equivalence
- Why Raw Logits? Direct output avoids log-softmax precision loss on tiny tail probabilities
- Where's Log-Softmax? Applied client-side (Swift) before greedy decoding
Chunking Overlap
The 2-second overlap ensures smooth speaker transitions and reduces edge artifacts:
- Without overlap: Minor discontinuities at chunk boundaries
- With overlap: Averaged predictions across boundary
- Effect: Confidence increase ~0.01 across all boundaries
CoreML MLProgram Format
This model uses Apple's modern MLProgram format (not MLMODEL):
- β Better performance on Neural Engine
- β Supports advanced ops (einsum, scatter/gather)
- β Better fp16 precision handling
- β Required for macOS 14+
π Known Limitations
- No streaming: Requires full audio loaded (for now)
- Batch size = 1: CoreML export uses fixed batch size
- 16kHz only: Resampling required for other sample rates
- macOS/iOS only: CoreML is Apple-only (not cross-platform)
π Additional Resources
- FluidAudio: GitHub
- NVIDIA Parakeet: Original Model
- CoreML Docs: Apple Developer
- CTC Decoding: Distill.pub
π Support
- Issues: FluidAudio GitHub Issues
- Discussions: FluidAudio Discussions
- HuggingFace: Model Card Discussions
Citation
If you use this model, please cite:
@software{parakeet_ctc_coreml_2024,
title={Parakeet CTC 0.6B Vietnamese - CoreML Edition},
author={FluidInference},
year={2024},
url={https://huggingface.co/leakless/parakeet-ctc-0.6b-Vietnamese-coreml},
note={CoreML adaptation of NVIDIA Parakeet CTC model}
}
@inproceedings{parakeet_original,
title={Parakeet: A Sequence-to-Sequence ASR Framework with Modular Alignment, Chunk-Hopping and Chunk-Summing Attention},
author={NVIDIA},
year={2023},
url={https://huggingface.co/nvidia/parakeet-ctc-0.6b-Vietnamese}
}
Last Updated: February 2024
Model Version: 1.0
Status: Production Ready β
- Downloads last month
- 36