metadata
library_name: transformers
language:
- ar
license: apache-2.0
base_model: openai/whisper-base
tags:
- generated_from_trainer
datasets:
- google/fleurs
- fixie-ai/common_voice_17_0
- UBC-NLP/Casablanca
- ymoslem/MediaSpeech
- deepdml/Tunisian_MSA
metrics:
- wer
model-index:
- name: Whisper Base ar
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 17.0
type: google/fleurs
metrics:
- name: Wer
type: wer
value: 40.550118433374344
Whisper Base ar
This model is a fine-tuned version of openai/whisper-base on the Common Voice 17.0 dataset. It achieves the following results on the evaluation set:
- Loss: 0.5179
- Wer: 40.5501
- Cer: 13.2382
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.04
- training_steps: 18000
Training results
| Training Loss | Epoch | Step | Validation Loss | Wer | Cer |
|---|---|---|---|---|---|
| 0.7397 | 0.0556 | 1000 | 0.6305 | 54.8668 | 18.9365 |
| 0.3962 | 0.1111 | 2000 | 0.5805 | 50.5793 | 16.9481 |
| 0.1913 | 0.1667 | 3000 | 0.5593 | 48.8019 | 16.2853 |
| 0.1031 | 0.2222 | 4000 | 0.5390 | 46.7766 | 15.6262 |
| 0.0743 | 0.2778 | 5000 | 0.5193 | 46.1321 | 15.5048 |
| 0.0463 | 0.3333 | 6000 | 0.5074 | 44.1857 | 14.5137 |
| 0.0296 | 1.0197 | 7000 | 0.5135 | 43.6074 | 14.0715 |
| 0.0288 | 1.0752 | 8000 | 0.5119 | 43.6514 | 14.6808 |
| 0.0232 | 1.1308 | 9000 | 0.4999 | 41.8538 | 13.6624 |
| 0.022 | 1.1863 | 10000 | 0.4930 | 41.8813 | 13.6632 |
| 0.0226 | 1.2419 | 11000 | 0.4779 | 41.8208 | 13.8859 |
| 0.0213 | 1.2974 | 12000 | 0.4795 | 41.0569 | 13.3648 |
| 0.0194 | 1.353 | 13000 | 0.4831 | 41.0881 | 13.3223 |
| 0.0148 | 2.0393 | 14000 | 0.5064 | 41.2644 | 13.4050 |
| 0.0131 | 2.0949 | 15000 | 0.5116 | 41.2570 | 13.5709 |
| 0.0116 | 2.1504 | 16000 | 0.5102 | 40.6860 | 13.2589 |
| 0.0088 | 2.206 | 17000 | 0.5196 | 40.4859 | 13.2482 |
| 0.0129 | 2.2616 | 18000 | 0.5179 | 40.5501 | 13.2382 |
Framework versions
- Transformers 4.48.0.dev0
- Pytorch 2.5.1+cu121
- Datasets 3.6.0
- Tokenizers 0.21.0
Citation
Please cite the model using the following BibTeX entry:
@misc{deepdml/whisper-base-ar-mix-norm,
title={Fine-tuned Whisper base ASR model for speech recognition in Arabic},
author={Jimenez, David},
howpublished={\url{https://huggingface.co/deepdml/whisper-base-ar-mix-norm}},
year={2026}
}