dleemiller commited on
Commit
87c23d2
·
verified ·
1 Parent(s): 8cc94a0

Upload languages.md

Browse files
Files changed (1) hide show
  1. docs/languages.md +176 -0
docs/languages.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Supported Languages
2
+
3
+ WLDetect supports **148 languages** trained on OpenLID-v2 and evaluated on FLORES+.
4
+
5
+ ## Performance Summary
6
+
7
+ - **Average Accuracy**: 92.92%
8
+ - **Macro Precision**: 0.9399
9
+ - **Macro Recall**: 0.9294
10
+ - **Macro F1**: 0.9274
11
+ - **Languages ≥ 95% accuracy**: 103/148
12
+ - **Languages ≥ 90% accuracy**: 113/148
13
+
14
+ ## Language List
15
+
16
+ Languages sorted by FLORES accuracy (highest to lowest).
17
+
18
+ | Language Code | Accuracy | Precision | F1 |
19
+ |--------------|----------|-----------|-----|
20
+ | asm_Beng | 100.00% | 1.0000 | 1.0000 |
21
+ | ben_Beng | 100.00% | 0.9930 | 0.9965 |
22
+ | cmn_Hant | 100.00% | 0.9379 | 0.9680 |
23
+ | dzo_Tibt | 100.00% | 1.0000 | 1.0000 |
24
+ | ell_Grek | 100.00% | 0.9970 | 0.9985 |
25
+ | guj_Gujr | 100.00% | 1.0000 | 1.0000 |
26
+ | heb_Hebr | 100.00% | 1.0000 | 1.0000 |
27
+ | hun_Latn | 100.00% | 0.9477 | 0.9732 |
28
+ | hye_Armn | 100.00% | 1.0000 | 1.0000 |
29
+ | jpn_Jpan | 100.00% | 0.9990 | 0.9995 |
30
+ | kan_Knda | 100.00% | 1.0000 | 1.0000 |
31
+ | kat_Geor | 100.00% | 1.0000 | 1.0000 |
32
+ | khm_Khmr | 100.00% | 1.0000 | 1.0000 |
33
+ | kor_Hang | 100.00% | 1.0000 | 1.0000 |
34
+ | lao_Laoo | 100.00% | 1.0000 | 1.0000 |
35
+ | mal_Mlym | 100.00% | 1.0000 | 1.0000 |
36
+ | mya_Mymr | 100.00% | 1.0000 | 1.0000 |
37
+ | ory_Orya | 100.00% | 1.0000 | 1.0000 |
38
+ | pan_Guru | 100.00% | 1.0000 | 1.0000 |
39
+ | pes_Arab | 100.00% | 0.8692 | 0.9300 |
40
+ | sat_Olck | 100.00% | 1.0000 | 1.0000 |
41
+ | shn_Mymr | 100.00% | 1.0000 | 1.0000 |
42
+ | sin_Sinh | 100.00% | 1.0000 | 1.0000 |
43
+ | snd_Arab | 100.00% | 0.9970 | 0.9985 |
44
+ | tam_Taml | 100.00% | 1.0000 | 1.0000 |
45
+ | taq_Tfng | 100.00% | 1.0000 | 1.0000 |
46
+ | tel_Telu | 100.00% | 1.0000 | 1.0000 |
47
+ | tha_Thai | 100.00% | 1.0000 | 1.0000 |
48
+ | uig_Arab | 100.00% | 0.9990 | 0.9995 |
49
+ | ukr_Cyrl | 100.00% | 0.9842 | 0.9920 |
50
+ | urd_Arab | 100.00% | 0.9130 | 0.9545 |
51
+ | vie_Latn | 100.00% | 0.9891 | 0.9945 |
52
+ | ckb_Arab | 99.90% | 1.0000 | 0.9995 |
53
+ | hin_Deva | 99.90% | 0.5605 | 0.7181 |
54
+ | kir_Cyrl | 99.90% | 0.9891 | 0.9940 |
55
+ | lit_Latn | 99.90% | 0.9755 | 0.9871 |
56
+ | lvs_Latn | 99.90% | 0.8078 | 0.8933 |
57
+ | npi_Deva | 99.90% | 0.9970 | 0.9980 |
58
+ | rus_Cyrl | 99.90% | 0.9930 | 0.9960 |
59
+ | amh_Ethi | 99.80% | 0.9531 | 0.9750 |
60
+ | arb_Arab | 99.80% | 0.4802 | 0.6484 |
61
+ | mar_Deva | 99.80% | 0.9891 | 0.9935 |
62
+ | ron_Latn | 99.80% | 0.9698 | 0.9837 |
63
+ | tuk_Latn | 99.80% | 0.9822 | 0.9900 |
64
+ | tur_Latn | 99.80% | 0.9679 | 0.9827 |
65
+ | eng_Latn | 99.70% | 0.8955 | 0.9435 |
66
+ | kik_Latn | 99.70% | 0.9832 | 0.9900 |
67
+ | pbt_Arab | 99.70% | 1.0000 | 0.9985 |
68
+ | pol_Latn | 99.70% | 0.9395 | 0.9674 |
69
+ | als_Latn | 99.60% | 0.9641 | 0.9798 |
70
+ | bjn_Arab | 99.60% | 0.9940 | 0.9950 |
71
+ | deu_Latn | 99.60% | 0.9697 | 0.9827 |
72
+ | khk_Cyrl | 99.60% | 0.9990 | 0.9975 |
73
+ | mlt_Latn | 99.60% | 0.9890 | 0.9925 |
74
+ | por_Latn | 99.60% | 0.9077 | 0.9498 |
75
+ | azj_Latn | 99.50% | 0.7619 | 0.8630 |
76
+ | bul_Cyrl | 99.50% | 0.9940 | 0.9945 |
77
+ | fra_Latn | 99.50% | 0.9026 | 0.9466 |
78
+ | tat_Cyrl | 99.40% | 0.8528 | 0.9180 |
79
+ | tgk_Cyrl | 99.40% | 1.0000 | 0.9970 |
80
+ | ekk_Latn | 99.30% | 0.9252 | 0.9579 |
81
+ | mni_Beng | 99.30% | 1.0000 | 0.9965 |
82
+ | fin_Latn | 99.20% | 0.9556 | 0.9734 |
83
+ | kaz_Cyrl | 99.20% | 0.9940 | 0.9930 |
84
+ | uzn_Latn | 99.20% | 0.8942 | 0.9406 |
85
+ | ilo_Latn | 99.00% | 0.7992 | 0.8844 |
86
+ | nld_Latn | 99.00% | 0.7711 | 0.8669 |
87
+ | slk_Latn | 99.00% | 0.9164 | 0.9518 |
88
+ | epo_Latn | 98.90% | 0.9880 | 0.9885 |
89
+ | bel_Cyrl | 98.80% | 1.0000 | 0.9939 |
90
+ | cym_Latn | 98.80% | 0.9970 | 0.9924 |
91
+ | mkd_Cyrl | 98.80% | 0.9572 | 0.9724 |
92
+ | tpi_Latn | 98.80% | 0.9919 | 0.9899 |
93
+ | hau_Latn | 98.70% | 0.9619 | 0.9743 |
94
+ | ita_Latn | 98.70% | 0.8586 | 0.9183 |
95
+ | nus_Latn | 98.70% | 1.0000 | 0.9934 |
96
+ | eus_Latn | 98.50% | 0.9590 | 0.9718 |
97
+ | ewe_Latn | 98.50% | 0.9534 | 0.9689 |
98
+ | ces_Latn | 97.99% | 0.9939 | 0.9869 |
99
+ | gaz_Latn | 97.89% | 0.9683 | 0.9736 |
100
+ | swe_Latn | 97.89% | 0.9597 | 0.9692 |
101
+ | bak_Cyrl | 97.79% | 1.0000 | 0.9888 |
102
+ | spa_Latn | 97.69% | 0.9137 | 0.9443 |
103
+ | ceb_Latn | 97.59% | 0.8935 | 0.9329 |
104
+ | cmn_Hans | 97.49% | 1.0000 | 0.9873 |
105
+ | slv_Latn | 97.29% | 0.9327 | 0.9524 |
106
+ | tsn_Latn | 97.19% | 0.9133 | 0.9417 |
107
+ | afr_Latn | 96.89% | 0.9244 | 0.9461 |
108
+ | som_Latn | 96.79% | 0.9718 | 0.9698 |
109
+ | fij_Latn | 96.69% | 0.9377 | 0.9521 |
110
+ | hat_Latn | 96.59% | 0.9008 | 0.9322 |
111
+ | gle_Latn | 96.39% | 0.9049 | 0.9335 |
112
+ | fil_Latn | 96.29% | 0.9152 | 0.9384 |
113
+ | ind_Latn | 96.29% | 0.5000 | 0.6582 |
114
+ | lin_Latn | 95.89% | 0.9775 | 0.9681 |
115
+ | srp_Cyrl | 95.89% | 0.9927 | 0.9755 |
116
+ | yue_Hant | 95.79% | 1.0000 | 0.9785 |
117
+ | twi_Latn | 95.74% | 0.9770 | 0.9671 |
118
+ | ibo_Latn | 95.59% | 0.9958 | 0.9754 |
119
+ | nya_Latn | 95.59% | 0.7975 | 0.8695 |
120
+ | sna_Latn | 95.39% | 0.9342 | 0.9439 |
121
+ | tso_Latn | 95.29% | 0.8482 | 0.8975 |
122
+ | tir_Ethi | 95.09% | 0.9979 | 0.9738 |
123
+ | hrv_Latn | 94.88% | 0.9643 | 0.9565 |
124
+ | swh_Latn | 94.18% | 0.9418 | 0.9418 |
125
+ | war_Latn | 93.58% | 0.9648 | 0.9501 |
126
+ | kab_Latn | 93.48% | 0.9759 | 0.9549 |
127
+ | bem_Latn | 92.78% | 0.9095 | 0.9186 |
128
+ | run_Latn | 92.38% | 0.8583 | 0.8899 |
129
+ | kmr_Latn | 91.57% | 0.9796 | 0.9466 |
130
+ | yor_Latn | 91.27% | 0.9681 | 0.9396 |
131
+ | nob_Latn | 91.22% | 0.9182 | 0.9152 |
132
+ | kas_Arab | 90.17% | 0.9967 | 0.9468 |
133
+ | pag_Latn | 89.87% | 0.9614 | 0.9290 |
134
+ | pap_Latn | 89.77% | 0.9179 | 0.9077 |
135
+ | gug_Latn | 89.67% | 0.8756 | 0.8860 |
136
+ | oci_Latn | 88.52% | 0.9231 | 0.9037 |
137
+ | lua_Latn | 88.47% | 0.8991 | 0.8918 |
138
+ | gla_Latn | 88.16% | 0.9681 | 0.9228 |
139
+ | lus_Latn | 87.96% | 0.9300 | 0.9041 |
140
+ | quy_Latn | 87.26% | 0.9285 | 0.8997 |
141
+ | dan_Latn | 87.16% | 0.8076 | 0.8384 |
142
+ | ktu_Latn | 87.06% | 0.9538 | 0.9103 |
143
+ | fao_Latn | 85.96% | 0.8248 | 0.8418 |
144
+ | mos_Latn | 85.96% | 0.9695 | 0.9112 |
145
+ | fur_Latn | 85.36% | 0.9092 | 0.8805 |
146
+ | san_Deva | 84.85% | 1.0000 | 0.9181 |
147
+ | smo_Latn | 84.05% | 0.9405 | 0.8877 |
148
+ | cat_Latn | 83.45% | 0.9014 | 0.8667 |
149
+ | isl_Latn | 81.44% | 0.8817 | 0.8467 |
150
+ | lug_Latn | 81.34% | 0.9632 | 0.8820 |
151
+ | tum_Latn | 80.54% | 0.9710 | 0.8805 |
152
+ | zul_Latn | 80.34% | 0.7629 | 0.7826 |
153
+ | vec_Latn | 78.44% | 0.9861 | 0.8737 |
154
+ | xho_Latn | 78.44% | 0.7045 | 0.7423 |
155
+ | jav_Latn | 77.03% | 0.8321 | 0.8000 |
156
+ | ayr_Latn | 76.43% | 0.9361 | 0.8415 |
157
+ | plt_Latn | 75.93% | 0.9895 | 0.8593 |
158
+ | sag_Latn | 72.42% | 0.9014 | 0.8031 |
159
+ | mri_Latn | 71.01% | 0.9944 | 0.8286 |
160
+ | ban_Latn | 63.59% | 0.7955 | 0.7068 |
161
+ | lim_Latn | 63.09% | 0.9844 | 0.7689 |
162
+ | sun_Latn | 55.07% | 0.9515 | 0.6976 |
163
+ | knc_Latn | 53.56% | 1.0000 | 0.6976 |
164
+ | zsm_Latn | 51.25% | 0.8559 | 0.6412 |
165
+ | knc_Arab | 45.84% | 1.0000 | 0.6286 |
166
+ | bho_Deva | 35.71% | 0.9972 | 0.5258 |
167
+ | arz_Arab | 27.88% | 0.8968 | 0.4254 |
168
+
169
+ ## Notes
170
+
171
+ - **Language Codes**: ISO 639-3 language code + ISO 15924 script code
172
+ - Format: `{lang}_{Script}` (e.g., `eng_Latn` for English in Latin script)
173
+ - **FLORES Evaluation**: FLORES+ dev set (1012 sentences per language)
174
+ - **Removed Languages**: Languages with high confusion or insufficient training data:
175
+ - `crh_Latn` (Crimean Tatar)
176
+ - `ltz_Latn` (Luxembourgish)