Sukuma-STT: Automatic Speech Recognition for Sukuma

Model Description

Sukuma-STT is the first automatic speech recognition (ASR) model for Sukuma (Kisukuma), a Bantu language spoken by approximately 10 million people in northern Tanzania. This model was developed as part of the research presented in "Learning from Scarcity: Building and Benchmarking Speech Technology for Sukuma".

  • Model Architecture: Whisper Large V3
  • Fine-tuning Framework: Unsloth
  • Language: Sukuma (sk)
  • Task: Speech-to-Text Transcription
  • License: [Specify your license]

Intended Use

Primary Use Cases

  • Transcription of Sukuma speech to text
  • Research on low-resource African language ASR
  • Development of inclusive voice-enabled applications for Sukuma speakers
  • Language documentation and preservation efforts

Out-of-Scope Use

  • Real-time transcription in safety-critical applications without human verification
  • Languages other than Sukuma
  • Noisy environments significantly different from training data

Training Data

Dataset: Sukuma Voices

The model was trained on the Sukuma Voices dataset, the first open speech corpus for Sukuma.

Metric Value
Total Samples 2,588
Total Duration 7.47 hours
Average Duration 10.39 ± 4.20 sec
Duration Range 1.74 - 30.36 sec
Total Words 51,107
Unique Vocabulary 10,007
Avg. Words/Sample 19.7
Speaking Rate 115.5 WPM

Data Source

Audio recordings and textual transcriptions from the Sukuma New Testament 2000 translation, sourced from Bible.com.

Data Split

  • Training: 80%
  • Test: 20%

Training Procedure

Hyperparameters

Parameter Value
Base Model Whisper Large V3
Fine-tuning Method Full fine-tuning
Batch Size 2
Gradient Accumulation Steps 4
Effective Batch Size 8
Learning Rate 1e-4
LR Scheduler Linear
Warmup Steps 5
Epochs 4
Weight Decay 0.01
Optimizer AdamW 8-bit
Precision FP32
Audio Sampling Rate 16kHz

Training Infrastructure

  • Framework: Unsloth + Hugging Face Transformers
  • Logging: Weights & Biases

Evaluation Results

Performance Metrics

Metric Original Speech Synthetic Speech
Final WER 25.19% 32.60%
Min WER 22.01% 29.97%
WER Reduction 82.94% 78.93%

Training Dynamics

The model demonstrated stable convergence with the following characteristics:

  • Final 5 Steps (77-81):

    • Original: Mean WER 25.27 (σ=0.44)
    • Synthetic: Mean WER 33.03 (σ=0.51)
  • Learning Trajectory: Strong correlation between original and synthetic evaluation curves (Pearson's r = 0.997, p < 0.001)

Comparison with Baseline

Whisper Large V3 significantly outperformed Wav2Vec2-large-XLSR-53 for this low-resource setting, achieving faster convergence and superior final WER.

Limitations

Data Limitations

  • Domain Bias: Training data is primarily from biblical texts, which may not generalize well to everyday conversational Sukuma
  • Limited Size: Only 7.47 hours of training data
  • Single Source: All data from one recording source, limiting speaker diversity

Orthographic Challenges

  • Sukuma has two written forms (with and without diacritics)
  • Current model trained on non-diacritic version only
  • Standard WER metrics may not fully capture diacritic-related errors

Performance Gaps

  • 28.34% average relative increase in WER on synthetic speech compared to original recordings
  • Model performance may vary on spontaneous speech, different accents, or noisy environments

Ethical Considerations

  • Bias: Models trained on religious texts may reflect biases present in that domain
  • Consent: All human evaluators provided informed consent
  • Intended Benefit: This work aims to develop inclusive speech technologies for underrepresented language communities

How to Use

Installation

pip install transformers torch unsloth

Inference

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("sartifyllc/sukuma-voices-asr")
processor = WhisperProcessor.from_pretrained("sartifyllc/sukuma-voices-asr")

# Load and preprocess audio
audio_array = ...  # Your audio as numpy array at 16kHz

input_features = processor(
    audio_array, 
    sampling_rate=16000, 
    return_tensors="pt"
).input_features

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(input_features)

# Decode
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Training Script

See train_asr.py for the complete training pipeline.

Citation

If you use this model or the Sukuma Voices dataset, please cite:

@inproceedings{mgonzo2025sukuma,
  title={Learning from Scarcity: Building and Benchmarking Speech Technology for Sukuma},
  author={Mgonzo, Macton and Oketch, Kezia and Etori, Naome A. and Mang'eni, Winnie and Nyaki, Elizabeth and Mollel, Michael S.},
  booktitle={Proceedings of the Association for Computational Linguistics},
  year={2025},
  institution={Brown University, University of Notre Dame, University of Minnesota - Twin Cities, Pawa AI, Sartify Company Limited}
}

Resources

Acknowledgments

This work was supported by:

Special thanks to all volunteers who contributed to the evaluation process.

Contact

For questions or collaboration inquiries, please contact:


Model Card Version: 1.0
Last Updated: February 2025

Downloads last month
26
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sartifyllc/sukuma-voices-asr

Finetuned
(55)
this model

Datasets used to train sartifyllc/sukuma-voices-asr