Sukuma-STT: Automatic Speech Recognition for Sukuma

Model Description

Sukuma-STT is the first automatic speech recognition (ASR) model for Sukuma (Kisukuma), a Bantu language spoken by approximately 10 million people in northern Tanzania. This model was developed as part of the research presented in "Learning from Scarcity: Building and Benchmarking Speech Technology for Sukuma".

Model Architecture: Whisper Large V3
Fine-tuning Framework: Unsloth
Language: Sukuma (sk)
Task: Speech-to-Text Transcription
License: [Specify your license]

Intended Use

Primary Use Cases

Transcription of Sukuma speech to text
Research on low-resource African language ASR
Development of inclusive voice-enabled applications for Sukuma speakers
Language documentation and preservation efforts

Out-of-Scope Use

Real-time transcription in safety-critical applications without human verification
Languages other than Sukuma
Noisy environments significantly different from training data

Training Data

Dataset: Sukuma Voices

The model was trained on the Sukuma Voices dataset, the first open speech corpus for Sukuma.

Metric	Value
Total Samples	2,588
Total Duration	7.47 hours
Average Duration	10.39 ± 4.20 sec
Duration Range	1.74 - 30.36 sec
Total Words	51,107
Unique Vocabulary	10,007
Avg. Words/Sample	19.7
Speaking Rate	115.5 WPM

Data Source

Audio recordings and textual transcriptions from the Sukuma New Testament 2000 translation, sourced from Bible.com.

Data Split

Training: 80%
Test: 20%

Training Procedure

Hyperparameters

Parameter	Value
Base Model	Whisper Large V3
Fine-tuning Method	Full fine-tuning
Batch Size	2
Gradient Accumulation Steps	4
Effective Batch Size	8
Learning Rate	1e-4
LR Scheduler	Linear
Warmup Steps	5
Epochs	4
Weight Decay	0.01
Optimizer	AdamW 8-bit
Precision	FP32
Audio Sampling Rate	16kHz

Training Infrastructure

Framework: Unsloth + Hugging Face Transformers
Logging: Weights & Biases

Evaluation Results

Performance Metrics

Metric	Original Speech	Synthetic Speech
Final WER	25.19%	32.60%
Min WER	22.01%	29.97%
WER Reduction	82.94%	78.93%

Training Dynamics

The model demonstrated stable convergence with the following characteristics:

Final 5 Steps (77-81):
- Original: Mean WER 25.27 (σ=0.44)
- Synthetic: Mean WER 33.03 (σ=0.51)
Learning Trajectory: Strong correlation between original and synthetic evaluation curves (Pearson's r = 0.997, p < 0.001)

Comparison with Baseline

Whisper Large V3 significantly outperformed Wav2Vec2-large-XLSR-53 for this low-resource setting, achieving faster convergence and superior final WER.

Limitations

Data Limitations

Domain Bias: Training data is primarily from biblical texts, which may not generalize well to everyday conversational Sukuma
Limited Size: Only 7.47 hours of training data
Single Source: All data from one recording source, limiting speaker diversity

Orthographic Challenges

Sukuma has two written forms (with and without diacritics)
Current model trained on non-diacritic version only
Standard WER metrics may not fully capture diacritic-related errors

Performance Gaps

28.34% average relative increase in WER on synthetic speech compared to original recordings
Model performance may vary on spontaneous speech, different accents, or noisy environments

Ethical Considerations

Bias: Models trained on religious texts may reflect biases present in that domain
Consent: All human evaluators provided informed consent
Intended Benefit: This work aims to develop inclusive speech technologies for underrepresented language communities

How to Use

Installation

pip install transformers torch unsloth

Inference

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("sartifyllc/sukuma-voices-asr")
processor = WhisperProcessor.from_pretrained("sartifyllc/sukuma-voices-asr")

# Load and preprocess audio
audio_array = ...  # Your audio as numpy array at 16kHz

input_features = processor(
    audio_array, 
    sampling_rate=16000, 
    return_tensors="pt"
).input_features

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(input_features)

# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Training Script

See train_asr.py for the complete training pipeline.

Citation

If you use this model or the Sukuma Voices dataset, please cite:

@inproceedings{mgonzo2025sukuma,
  title={Learning from Scarcity: Building and Benchmarking Speech Technology for Sukuma},
  author={Mgonzo, Macton and Oketch, Kezia and Etori, Naome A. and Mang'eni, Winnie and Nyaki, Elizabeth and Mollel, Michael S.},
  booktitle={Proceedings of the Association for Computational Linguistics},
  year={2025},
  institution={Brown University, University of Notre Dame, University of Minnesota - Twin Cities, Pawa AI, Sartify Company Limited}
}

Resources

Code: https://github.com/Sartify/sukuma-voices
Dataset: https://huggingface.co/datasets/sartifyllc/Sukuma-Voices-ACL
Model: https://huggingface.co/sartifyllc/Sukuma-STT

Acknowledgments

This work was supported by:

Special thanks to all volunteers who contributed to the evaluation process.

Contact

For questions or collaboration inquiries, please contact:

Macton Mgonzo: macton_mgonzo@brown.edu

Model Card Version: 1.0
Last Updated: February 2025

Downloads last month: 26

Safetensors

Model size

2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sartifyllc/sukuma-voices-asr

Base model

openai/whisper-large-v3

Finetuned

unsloth/whisper-large-v3