WordLlamaDetect / docs /languages.md
dleemiller's picture
Upload languages.md
87c23d2 verified

Supported Languages

WLDetect supports 148 languages trained on OpenLID-v2 and evaluated on FLORES+.

Performance Summary

  • Average Accuracy: 92.92%
  • Macro Precision: 0.9399
  • Macro Recall: 0.9294
  • Macro F1: 0.9274
  • Languages ≥ 95% accuracy: 103/148
  • Languages ≥ 90% accuracy: 113/148

Language List

Languages sorted by FLORES accuracy (highest to lowest).

Language Code Accuracy Precision F1
asm_Beng 100.00% 1.0000 1.0000
ben_Beng 100.00% 0.9930 0.9965
cmn_Hant 100.00% 0.9379 0.9680
dzo_Tibt 100.00% 1.0000 1.0000
ell_Grek 100.00% 0.9970 0.9985
guj_Gujr 100.00% 1.0000 1.0000
heb_Hebr 100.00% 1.0000 1.0000
hun_Latn 100.00% 0.9477 0.9732
hye_Armn 100.00% 1.0000 1.0000
jpn_Jpan 100.00% 0.9990 0.9995
kan_Knda 100.00% 1.0000 1.0000
kat_Geor 100.00% 1.0000 1.0000
khm_Khmr 100.00% 1.0000 1.0000
kor_Hang 100.00% 1.0000 1.0000
lao_Laoo 100.00% 1.0000 1.0000
mal_Mlym 100.00% 1.0000 1.0000
mya_Mymr 100.00% 1.0000 1.0000
ory_Orya 100.00% 1.0000 1.0000
pan_Guru 100.00% 1.0000 1.0000
pes_Arab 100.00% 0.8692 0.9300
sat_Olck 100.00% 1.0000 1.0000
shn_Mymr 100.00% 1.0000 1.0000
sin_Sinh 100.00% 1.0000 1.0000
snd_Arab 100.00% 0.9970 0.9985
tam_Taml 100.00% 1.0000 1.0000
taq_Tfng 100.00% 1.0000 1.0000
tel_Telu 100.00% 1.0000 1.0000
tha_Thai 100.00% 1.0000 1.0000
uig_Arab 100.00% 0.9990 0.9995
ukr_Cyrl 100.00% 0.9842 0.9920
urd_Arab 100.00% 0.9130 0.9545
vie_Latn 100.00% 0.9891 0.9945
ckb_Arab 99.90% 1.0000 0.9995
hin_Deva 99.90% 0.5605 0.7181
kir_Cyrl 99.90% 0.9891 0.9940
lit_Latn 99.90% 0.9755 0.9871
lvs_Latn 99.90% 0.8078 0.8933
npi_Deva 99.90% 0.9970 0.9980
rus_Cyrl 99.90% 0.9930 0.9960
amh_Ethi 99.80% 0.9531 0.9750
arb_Arab 99.80% 0.4802 0.6484
mar_Deva 99.80% 0.9891 0.9935
ron_Latn 99.80% 0.9698 0.9837
tuk_Latn 99.80% 0.9822 0.9900
tur_Latn 99.80% 0.9679 0.9827
eng_Latn 99.70% 0.8955 0.9435
kik_Latn 99.70% 0.9832 0.9900
pbt_Arab 99.70% 1.0000 0.9985
pol_Latn 99.70% 0.9395 0.9674
als_Latn 99.60% 0.9641 0.9798
bjn_Arab 99.60% 0.9940 0.9950
deu_Latn 99.60% 0.9697 0.9827
khk_Cyrl 99.60% 0.9990 0.9975
mlt_Latn 99.60% 0.9890 0.9925
por_Latn 99.60% 0.9077 0.9498
azj_Latn 99.50% 0.7619 0.8630
bul_Cyrl 99.50% 0.9940 0.9945
fra_Latn 99.50% 0.9026 0.9466
tat_Cyrl 99.40% 0.8528 0.9180
tgk_Cyrl 99.40% 1.0000 0.9970
ekk_Latn 99.30% 0.9252 0.9579
mni_Beng 99.30% 1.0000 0.9965
fin_Latn 99.20% 0.9556 0.9734
kaz_Cyrl 99.20% 0.9940 0.9930
uzn_Latn 99.20% 0.8942 0.9406
ilo_Latn 99.00% 0.7992 0.8844
nld_Latn 99.00% 0.7711 0.8669
slk_Latn 99.00% 0.9164 0.9518
epo_Latn 98.90% 0.9880 0.9885
bel_Cyrl 98.80% 1.0000 0.9939
cym_Latn 98.80% 0.9970 0.9924
mkd_Cyrl 98.80% 0.9572 0.9724
tpi_Latn 98.80% 0.9919 0.9899
hau_Latn 98.70% 0.9619 0.9743
ita_Latn 98.70% 0.8586 0.9183
nus_Latn 98.70% 1.0000 0.9934
eus_Latn 98.50% 0.9590 0.9718
ewe_Latn 98.50% 0.9534 0.9689
ces_Latn 97.99% 0.9939 0.9869
gaz_Latn 97.89% 0.9683 0.9736
swe_Latn 97.89% 0.9597 0.9692
bak_Cyrl 97.79% 1.0000 0.9888
spa_Latn 97.69% 0.9137 0.9443
ceb_Latn 97.59% 0.8935 0.9329
cmn_Hans 97.49% 1.0000 0.9873
slv_Latn 97.29% 0.9327 0.9524
tsn_Latn 97.19% 0.9133 0.9417
afr_Latn 96.89% 0.9244 0.9461
som_Latn 96.79% 0.9718 0.9698
fij_Latn 96.69% 0.9377 0.9521
hat_Latn 96.59% 0.9008 0.9322
gle_Latn 96.39% 0.9049 0.9335
fil_Latn 96.29% 0.9152 0.9384
ind_Latn 96.29% 0.5000 0.6582
lin_Latn 95.89% 0.9775 0.9681
srp_Cyrl 95.89% 0.9927 0.9755
yue_Hant 95.79% 1.0000 0.9785
twi_Latn 95.74% 0.9770 0.9671
ibo_Latn 95.59% 0.9958 0.9754
nya_Latn 95.59% 0.7975 0.8695
sna_Latn 95.39% 0.9342 0.9439
tso_Latn 95.29% 0.8482 0.8975
tir_Ethi 95.09% 0.9979 0.9738
hrv_Latn 94.88% 0.9643 0.9565
swh_Latn 94.18% 0.9418 0.9418
war_Latn 93.58% 0.9648 0.9501
kab_Latn 93.48% 0.9759 0.9549
bem_Latn 92.78% 0.9095 0.9186
run_Latn 92.38% 0.8583 0.8899
kmr_Latn 91.57% 0.9796 0.9466
yor_Latn 91.27% 0.9681 0.9396
nob_Latn 91.22% 0.9182 0.9152
kas_Arab 90.17% 0.9967 0.9468
pag_Latn 89.87% 0.9614 0.9290
pap_Latn 89.77% 0.9179 0.9077
gug_Latn 89.67% 0.8756 0.8860
oci_Latn 88.52% 0.9231 0.9037
lua_Latn 88.47% 0.8991 0.8918
gla_Latn 88.16% 0.9681 0.9228
lus_Latn 87.96% 0.9300 0.9041
quy_Latn 87.26% 0.9285 0.8997
dan_Latn 87.16% 0.8076 0.8384
ktu_Latn 87.06% 0.9538 0.9103
fao_Latn 85.96% 0.8248 0.8418
mos_Latn 85.96% 0.9695 0.9112
fur_Latn 85.36% 0.9092 0.8805
san_Deva 84.85% 1.0000 0.9181
smo_Latn 84.05% 0.9405 0.8877
cat_Latn 83.45% 0.9014 0.8667
isl_Latn 81.44% 0.8817 0.8467
lug_Latn 81.34% 0.9632 0.8820
tum_Latn 80.54% 0.9710 0.8805
zul_Latn 80.34% 0.7629 0.7826
vec_Latn 78.44% 0.9861 0.8737
xho_Latn 78.44% 0.7045 0.7423
jav_Latn 77.03% 0.8321 0.8000
ayr_Latn 76.43% 0.9361 0.8415
plt_Latn 75.93% 0.9895 0.8593
sag_Latn 72.42% 0.9014 0.8031
mri_Latn 71.01% 0.9944 0.8286
ban_Latn 63.59% 0.7955 0.7068
lim_Latn 63.09% 0.9844 0.7689
sun_Latn 55.07% 0.9515 0.6976
knc_Latn 53.56% 1.0000 0.6976
zsm_Latn 51.25% 0.8559 0.6412
knc_Arab 45.84% 1.0000 0.6286
bho_Deva 35.71% 0.9972 0.5258
arz_Arab 27.88% 0.8968 0.4254

Notes

  • Language Codes: ISO 639-3 language code + ISO 15924 script code
    • Format: {lang}_{Script} (e.g., eng_Latn for English in Latin script)
  • FLORES Evaluation: FLORES+ dev set (1012 sentences per language)
  • Removed Languages: Languages with high confusion or insufficient training data:
    • crh_Latn (Crimean Tatar)
    • ltz_Latn (Luxembourgish)