Sukuma-STT: Automatic Speech Recognition for Sukuma
Model Description
Sukuma-STT is the first automatic speech recognition (ASR) model for Sukuma (Kisukuma), a Bantu language spoken by approximately 10 million people in northern Tanzania. This model was developed as part of the research presented in "Learning from Scarcity: Building and Benchmarking Speech Technology for Sukuma".
- Model Architecture: Whisper Large V3
- Fine-tuning Framework: Unsloth
- Language: Sukuma (sk)
- Task: Speech-to-Text Transcription
- License: [Specify your license]
Intended Use
Primary Use Cases
- Transcription of Sukuma speech to text
- Research on low-resource African language ASR
- Development of inclusive voice-enabled applications for Sukuma speakers
- Language documentation and preservation efforts
Out-of-Scope Use
- Real-time transcription in safety-critical applications without human verification
- Languages other than Sukuma
- Noisy environments significantly different from training data
Training Data
Dataset: Sukuma Voices
The model was trained on the Sukuma Voices dataset, the first open speech corpus for Sukuma.
| Metric | Value |
|---|---|
| Total Samples | 2,588 |
| Total Duration | 7.47 hours |
| Average Duration | 10.39 ± 4.20 sec |
| Duration Range | 1.74 - 30.36 sec |
| Total Words | 51,107 |
| Unique Vocabulary | 10,007 |
| Avg. Words/Sample | 19.7 |
| Speaking Rate | 115.5 WPM |
Data Source
Audio recordings and textual transcriptions from the Sukuma New Testament 2000 translation, sourced from Bible.com.
Data Split
- Training: 80%
- Test: 20%
Training Procedure
Hyperparameters
| Parameter | Value |
|---|---|
| Base Model | Whisper Large V3 |
| Fine-tuning Method | Full fine-tuning |
| Batch Size | 2 |
| Gradient Accumulation Steps | 4 |
| Effective Batch Size | 8 |
| Learning Rate | 1e-4 |
| LR Scheduler | Linear |
| Warmup Steps | 5 |
| Epochs | 4 |
| Weight Decay | 0.01 |
| Optimizer | AdamW 8-bit |
| Precision | FP32 |
| Audio Sampling Rate | 16kHz |
Training Infrastructure
- Framework: Unsloth + Hugging Face Transformers
- Logging: Weights & Biases
Evaluation Results
Performance Metrics
| Metric | Original Speech | Synthetic Speech |
|---|---|---|
| Final WER | 25.19% | 32.60% |
| Min WER | 22.01% | 29.97% |
| WER Reduction | 82.94% | 78.93% |
Training Dynamics
The model demonstrated stable convergence with the following characteristics:
Final 5 Steps (77-81):
- Original: Mean WER 25.27 (σ=0.44)
- Synthetic: Mean WER 33.03 (σ=0.51)
Learning Trajectory: Strong correlation between original and synthetic evaluation curves (Pearson's r = 0.997, p < 0.001)
Comparison with Baseline
Whisper Large V3 significantly outperformed Wav2Vec2-large-XLSR-53 for this low-resource setting, achieving faster convergence and superior final WER.
Limitations
Data Limitations
- Domain Bias: Training data is primarily from biblical texts, which may not generalize well to everyday conversational Sukuma
- Limited Size: Only 7.47 hours of training data
- Single Source: All data from one recording source, limiting speaker diversity
Orthographic Challenges
- Sukuma has two written forms (with and without diacritics)
- Current model trained on non-diacritic version only
- Standard WER metrics may not fully capture diacritic-related errors
Performance Gaps
- 28.34% average relative increase in WER on synthetic speech compared to original recordings
- Model performance may vary on spontaneous speech, different accents, or noisy environments
Ethical Considerations
- Bias: Models trained on religious texts may reflect biases present in that domain
- Consent: All human evaluators provided informed consent
- Intended Benefit: This work aims to develop inclusive speech technologies for underrepresented language communities
How to Use
Installation
pip install transformers torch unsloth
Inference
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("sartifyllc/sukuma-voices-asr")
processor = WhisperProcessor.from_pretrained("sartifyllc/sukuma-voices-asr")
# Load and preprocess audio
audio_array = ... # Your audio as numpy array at 16kHz
input_features = processor(
audio_array,
sampling_rate=16000,
return_tensors="pt"
).input_features
# Generate transcription
with torch.no_grad():
predicted_ids = model.generate(input_features)
# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Training Script
See train_asr.py for the complete training pipeline.
Citation
If you use this model or the Sukuma Voices dataset, please cite:
@inproceedings{mgonzo2025sukuma,
title={Learning from Scarcity: Building and Benchmarking Speech Technology for Sukuma},
author={Mgonzo, Macton and Oketch, Kezia and Etori, Naome A. and Mang'eni, Winnie and Nyaki, Elizabeth and Mollel, Michael S.},
booktitle={Proceedings of the Association for Computational Linguistics},
year={2025},
institution={Brown University, University of Notre Dame, University of Minnesota - Twin Cities, Pawa AI, Sartify Company Limited}
}
Resources
- Code: https://github.com/Sartify/sukuma-voices
- Dataset: https://huggingface.co/datasets/sartifyllc/Sukuma-Voices-ACL
- Model: https://huggingface.co/sartifyllc/Sukuma-STT
Acknowledgments
This work was supported by:
Special thanks to all volunteers who contributed to the evaluation process.
Contact
For questions or collaboration inquiries, please contact:
- Macton Mgonzo: macton_mgonzo@brown.edu
Model Card Version: 1.0
Last Updated: February 2025
- Downloads last month
- 26