Upload languages.md
Browse files- docs/languages.md +176 -0
docs/languages.md
ADDED
|
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Supported Languages
|
| 2 |
+
|
| 3 |
+
WLDetect supports **148 languages** trained on OpenLID-v2 and evaluated on FLORES+.
|
| 4 |
+
|
| 5 |
+
## Performance Summary
|
| 6 |
+
|
| 7 |
+
- **Average Accuracy**: 92.92%
|
| 8 |
+
- **Macro Precision**: 0.9399
|
| 9 |
+
- **Macro Recall**: 0.9294
|
| 10 |
+
- **Macro F1**: 0.9274
|
| 11 |
+
- **Languages ≥ 95% accuracy**: 103/148
|
| 12 |
+
- **Languages ≥ 90% accuracy**: 113/148
|
| 13 |
+
|
| 14 |
+
## Language List
|
| 15 |
+
|
| 16 |
+
Languages sorted by FLORES accuracy (highest to lowest).
|
| 17 |
+
|
| 18 |
+
| Language Code | Accuracy | Precision | F1 |
|
| 19 |
+
|--------------|----------|-----------|-----|
|
| 20 |
+
| asm_Beng | 100.00% | 1.0000 | 1.0000 |
|
| 21 |
+
| ben_Beng | 100.00% | 0.9930 | 0.9965 |
|
| 22 |
+
| cmn_Hant | 100.00% | 0.9379 | 0.9680 |
|
| 23 |
+
| dzo_Tibt | 100.00% | 1.0000 | 1.0000 |
|
| 24 |
+
| ell_Grek | 100.00% | 0.9970 | 0.9985 |
|
| 25 |
+
| guj_Gujr | 100.00% | 1.0000 | 1.0000 |
|
| 26 |
+
| heb_Hebr | 100.00% | 1.0000 | 1.0000 |
|
| 27 |
+
| hun_Latn | 100.00% | 0.9477 | 0.9732 |
|
| 28 |
+
| hye_Armn | 100.00% | 1.0000 | 1.0000 |
|
| 29 |
+
| jpn_Jpan | 100.00% | 0.9990 | 0.9995 |
|
| 30 |
+
| kan_Knda | 100.00% | 1.0000 | 1.0000 |
|
| 31 |
+
| kat_Geor | 100.00% | 1.0000 | 1.0000 |
|
| 32 |
+
| khm_Khmr | 100.00% | 1.0000 | 1.0000 |
|
| 33 |
+
| kor_Hang | 100.00% | 1.0000 | 1.0000 |
|
| 34 |
+
| lao_Laoo | 100.00% | 1.0000 | 1.0000 |
|
| 35 |
+
| mal_Mlym | 100.00% | 1.0000 | 1.0000 |
|
| 36 |
+
| mya_Mymr | 100.00% | 1.0000 | 1.0000 |
|
| 37 |
+
| ory_Orya | 100.00% | 1.0000 | 1.0000 |
|
| 38 |
+
| pan_Guru | 100.00% | 1.0000 | 1.0000 |
|
| 39 |
+
| pes_Arab | 100.00% | 0.8692 | 0.9300 |
|
| 40 |
+
| sat_Olck | 100.00% | 1.0000 | 1.0000 |
|
| 41 |
+
| shn_Mymr | 100.00% | 1.0000 | 1.0000 |
|
| 42 |
+
| sin_Sinh | 100.00% | 1.0000 | 1.0000 |
|
| 43 |
+
| snd_Arab | 100.00% | 0.9970 | 0.9985 |
|
| 44 |
+
| tam_Taml | 100.00% | 1.0000 | 1.0000 |
|
| 45 |
+
| taq_Tfng | 100.00% | 1.0000 | 1.0000 |
|
| 46 |
+
| tel_Telu | 100.00% | 1.0000 | 1.0000 |
|
| 47 |
+
| tha_Thai | 100.00% | 1.0000 | 1.0000 |
|
| 48 |
+
| uig_Arab | 100.00% | 0.9990 | 0.9995 |
|
| 49 |
+
| ukr_Cyrl | 100.00% | 0.9842 | 0.9920 |
|
| 50 |
+
| urd_Arab | 100.00% | 0.9130 | 0.9545 |
|
| 51 |
+
| vie_Latn | 100.00% | 0.9891 | 0.9945 |
|
| 52 |
+
| ckb_Arab | 99.90% | 1.0000 | 0.9995 |
|
| 53 |
+
| hin_Deva | 99.90% | 0.5605 | 0.7181 |
|
| 54 |
+
| kir_Cyrl | 99.90% | 0.9891 | 0.9940 |
|
| 55 |
+
| lit_Latn | 99.90% | 0.9755 | 0.9871 |
|
| 56 |
+
| lvs_Latn | 99.90% | 0.8078 | 0.8933 |
|
| 57 |
+
| npi_Deva | 99.90% | 0.9970 | 0.9980 |
|
| 58 |
+
| rus_Cyrl | 99.90% | 0.9930 | 0.9960 |
|
| 59 |
+
| amh_Ethi | 99.80% | 0.9531 | 0.9750 |
|
| 60 |
+
| arb_Arab | 99.80% | 0.4802 | 0.6484 |
|
| 61 |
+
| mar_Deva | 99.80% | 0.9891 | 0.9935 |
|
| 62 |
+
| ron_Latn | 99.80% | 0.9698 | 0.9837 |
|
| 63 |
+
| tuk_Latn | 99.80% | 0.9822 | 0.9900 |
|
| 64 |
+
| tur_Latn | 99.80% | 0.9679 | 0.9827 |
|
| 65 |
+
| eng_Latn | 99.70% | 0.8955 | 0.9435 |
|
| 66 |
+
| kik_Latn | 99.70% | 0.9832 | 0.9900 |
|
| 67 |
+
| pbt_Arab | 99.70% | 1.0000 | 0.9985 |
|
| 68 |
+
| pol_Latn | 99.70% | 0.9395 | 0.9674 |
|
| 69 |
+
| als_Latn | 99.60% | 0.9641 | 0.9798 |
|
| 70 |
+
| bjn_Arab | 99.60% | 0.9940 | 0.9950 |
|
| 71 |
+
| deu_Latn | 99.60% | 0.9697 | 0.9827 |
|
| 72 |
+
| khk_Cyrl | 99.60% | 0.9990 | 0.9975 |
|
| 73 |
+
| mlt_Latn | 99.60% | 0.9890 | 0.9925 |
|
| 74 |
+
| por_Latn | 99.60% | 0.9077 | 0.9498 |
|
| 75 |
+
| azj_Latn | 99.50% | 0.7619 | 0.8630 |
|
| 76 |
+
| bul_Cyrl | 99.50% | 0.9940 | 0.9945 |
|
| 77 |
+
| fra_Latn | 99.50% | 0.9026 | 0.9466 |
|
| 78 |
+
| tat_Cyrl | 99.40% | 0.8528 | 0.9180 |
|
| 79 |
+
| tgk_Cyrl | 99.40% | 1.0000 | 0.9970 |
|
| 80 |
+
| ekk_Latn | 99.30% | 0.9252 | 0.9579 |
|
| 81 |
+
| mni_Beng | 99.30% | 1.0000 | 0.9965 |
|
| 82 |
+
| fin_Latn | 99.20% | 0.9556 | 0.9734 |
|
| 83 |
+
| kaz_Cyrl | 99.20% | 0.9940 | 0.9930 |
|
| 84 |
+
| uzn_Latn | 99.20% | 0.8942 | 0.9406 |
|
| 85 |
+
| ilo_Latn | 99.00% | 0.7992 | 0.8844 |
|
| 86 |
+
| nld_Latn | 99.00% | 0.7711 | 0.8669 |
|
| 87 |
+
| slk_Latn | 99.00% | 0.9164 | 0.9518 |
|
| 88 |
+
| epo_Latn | 98.90% | 0.9880 | 0.9885 |
|
| 89 |
+
| bel_Cyrl | 98.80% | 1.0000 | 0.9939 |
|
| 90 |
+
| cym_Latn | 98.80% | 0.9970 | 0.9924 |
|
| 91 |
+
| mkd_Cyrl | 98.80% | 0.9572 | 0.9724 |
|
| 92 |
+
| tpi_Latn | 98.80% | 0.9919 | 0.9899 |
|
| 93 |
+
| hau_Latn | 98.70% | 0.9619 | 0.9743 |
|
| 94 |
+
| ita_Latn | 98.70% | 0.8586 | 0.9183 |
|
| 95 |
+
| nus_Latn | 98.70% | 1.0000 | 0.9934 |
|
| 96 |
+
| eus_Latn | 98.50% | 0.9590 | 0.9718 |
|
| 97 |
+
| ewe_Latn | 98.50% | 0.9534 | 0.9689 |
|
| 98 |
+
| ces_Latn | 97.99% | 0.9939 | 0.9869 |
|
| 99 |
+
| gaz_Latn | 97.89% | 0.9683 | 0.9736 |
|
| 100 |
+
| swe_Latn | 97.89% | 0.9597 | 0.9692 |
|
| 101 |
+
| bak_Cyrl | 97.79% | 1.0000 | 0.9888 |
|
| 102 |
+
| spa_Latn | 97.69% | 0.9137 | 0.9443 |
|
| 103 |
+
| ceb_Latn | 97.59% | 0.8935 | 0.9329 |
|
| 104 |
+
| cmn_Hans | 97.49% | 1.0000 | 0.9873 |
|
| 105 |
+
| slv_Latn | 97.29% | 0.9327 | 0.9524 |
|
| 106 |
+
| tsn_Latn | 97.19% | 0.9133 | 0.9417 |
|
| 107 |
+
| afr_Latn | 96.89% | 0.9244 | 0.9461 |
|
| 108 |
+
| som_Latn | 96.79% | 0.9718 | 0.9698 |
|
| 109 |
+
| fij_Latn | 96.69% | 0.9377 | 0.9521 |
|
| 110 |
+
| hat_Latn | 96.59% | 0.9008 | 0.9322 |
|
| 111 |
+
| gle_Latn | 96.39% | 0.9049 | 0.9335 |
|
| 112 |
+
| fil_Latn | 96.29% | 0.9152 | 0.9384 |
|
| 113 |
+
| ind_Latn | 96.29% | 0.5000 | 0.6582 |
|
| 114 |
+
| lin_Latn | 95.89% | 0.9775 | 0.9681 |
|
| 115 |
+
| srp_Cyrl | 95.89% | 0.9927 | 0.9755 |
|
| 116 |
+
| yue_Hant | 95.79% | 1.0000 | 0.9785 |
|
| 117 |
+
| twi_Latn | 95.74% | 0.9770 | 0.9671 |
|
| 118 |
+
| ibo_Latn | 95.59% | 0.9958 | 0.9754 |
|
| 119 |
+
| nya_Latn | 95.59% | 0.7975 | 0.8695 |
|
| 120 |
+
| sna_Latn | 95.39% | 0.9342 | 0.9439 |
|
| 121 |
+
| tso_Latn | 95.29% | 0.8482 | 0.8975 |
|
| 122 |
+
| tir_Ethi | 95.09% | 0.9979 | 0.9738 |
|
| 123 |
+
| hrv_Latn | 94.88% | 0.9643 | 0.9565 |
|
| 124 |
+
| swh_Latn | 94.18% | 0.9418 | 0.9418 |
|
| 125 |
+
| war_Latn | 93.58% | 0.9648 | 0.9501 |
|
| 126 |
+
| kab_Latn | 93.48% | 0.9759 | 0.9549 |
|
| 127 |
+
| bem_Latn | 92.78% | 0.9095 | 0.9186 |
|
| 128 |
+
| run_Latn | 92.38% | 0.8583 | 0.8899 |
|
| 129 |
+
| kmr_Latn | 91.57% | 0.9796 | 0.9466 |
|
| 130 |
+
| yor_Latn | 91.27% | 0.9681 | 0.9396 |
|
| 131 |
+
| nob_Latn | 91.22% | 0.9182 | 0.9152 |
|
| 132 |
+
| kas_Arab | 90.17% | 0.9967 | 0.9468 |
|
| 133 |
+
| pag_Latn | 89.87% | 0.9614 | 0.9290 |
|
| 134 |
+
| pap_Latn | 89.77% | 0.9179 | 0.9077 |
|
| 135 |
+
| gug_Latn | 89.67% | 0.8756 | 0.8860 |
|
| 136 |
+
| oci_Latn | 88.52% | 0.9231 | 0.9037 |
|
| 137 |
+
| lua_Latn | 88.47% | 0.8991 | 0.8918 |
|
| 138 |
+
| gla_Latn | 88.16% | 0.9681 | 0.9228 |
|
| 139 |
+
| lus_Latn | 87.96% | 0.9300 | 0.9041 |
|
| 140 |
+
| quy_Latn | 87.26% | 0.9285 | 0.8997 |
|
| 141 |
+
| dan_Latn | 87.16% | 0.8076 | 0.8384 |
|
| 142 |
+
| ktu_Latn | 87.06% | 0.9538 | 0.9103 |
|
| 143 |
+
| fao_Latn | 85.96% | 0.8248 | 0.8418 |
|
| 144 |
+
| mos_Latn | 85.96% | 0.9695 | 0.9112 |
|
| 145 |
+
| fur_Latn | 85.36% | 0.9092 | 0.8805 |
|
| 146 |
+
| san_Deva | 84.85% | 1.0000 | 0.9181 |
|
| 147 |
+
| smo_Latn | 84.05% | 0.9405 | 0.8877 |
|
| 148 |
+
| cat_Latn | 83.45% | 0.9014 | 0.8667 |
|
| 149 |
+
| isl_Latn | 81.44% | 0.8817 | 0.8467 |
|
| 150 |
+
| lug_Latn | 81.34% | 0.9632 | 0.8820 |
|
| 151 |
+
| tum_Latn | 80.54% | 0.9710 | 0.8805 |
|
| 152 |
+
| zul_Latn | 80.34% | 0.7629 | 0.7826 |
|
| 153 |
+
| vec_Latn | 78.44% | 0.9861 | 0.8737 |
|
| 154 |
+
| xho_Latn | 78.44% | 0.7045 | 0.7423 |
|
| 155 |
+
| jav_Latn | 77.03% | 0.8321 | 0.8000 |
|
| 156 |
+
| ayr_Latn | 76.43% | 0.9361 | 0.8415 |
|
| 157 |
+
| plt_Latn | 75.93% | 0.9895 | 0.8593 |
|
| 158 |
+
| sag_Latn | 72.42% | 0.9014 | 0.8031 |
|
| 159 |
+
| mri_Latn | 71.01% | 0.9944 | 0.8286 |
|
| 160 |
+
| ban_Latn | 63.59% | 0.7955 | 0.7068 |
|
| 161 |
+
| lim_Latn | 63.09% | 0.9844 | 0.7689 |
|
| 162 |
+
| sun_Latn | 55.07% | 0.9515 | 0.6976 |
|
| 163 |
+
| knc_Latn | 53.56% | 1.0000 | 0.6976 |
|
| 164 |
+
| zsm_Latn | 51.25% | 0.8559 | 0.6412 |
|
| 165 |
+
| knc_Arab | 45.84% | 1.0000 | 0.6286 |
|
| 166 |
+
| bho_Deva | 35.71% | 0.9972 | 0.5258 |
|
| 167 |
+
| arz_Arab | 27.88% | 0.8968 | 0.4254 |
|
| 168 |
+
|
| 169 |
+
## Notes
|
| 170 |
+
|
| 171 |
+
- **Language Codes**: ISO 639-3 language code + ISO 15924 script code
|
| 172 |
+
- Format: `{lang}_{Script}` (e.g., `eng_Latn` for English in Latin script)
|
| 173 |
+
- **FLORES Evaluation**: FLORES+ dev set (1012 sentences per language)
|
| 174 |
+
- **Removed Languages**: Languages with high confusion or insufficient training data:
|
| 175 |
+
- `crh_Latn` (Crimean Tatar)
|
| 176 |
+
- `ltz_Latn` (Luxembourgish)
|