--- language: th tags: - thai - pos-tagging - token-classification - phayathaibert datasets: - universal_dependencies license: mit base_model: - clicknext/phayathaibert metrics: - accuracy - f1 pipeline_tag: token-classification model-index: - name: phayathaibert-thai-pos-tagger results: - task: type: token-classification name: POS Tagging dataset: type: universal_dependencies name: UD_Thai-TUD metrics: - name: Test Accuracy type: accuracy value: 0.9064 - name: Test Micro F1 type: f1 value: 0.9064 - name: Test Macro F1 type: f1 value: 0.8134 --- # PhayaThaiBERT Thai POS Tagger Fine-tuned **PhayaThaiBERT** for **UPOS Part-of-Speech tagging** on Thai sentences. Trained on the **UD_Thai-TUD** treebank following Universal Dependencies conventions. --- ## Model Description - **Base model:** PhayaThaiBERT - **Task:** Token Classification (POS tagging) - **Dataset:** UD_Thai-TUD - **Tags (15 UPOS):** `ADJ, ADP, ADV, AUX, CCONJ, DET, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB` This model predicts word-level POS tags for Thai text. For best performance, use **pre-segmented Thai words**. --- # Usage ## 1) Load the model ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch tokenizer = AutoTokenizer.from_pretrained("sandpapat/phayathaibert-thai-pos-tagger") model = AutoModelForTokenClassification.from_pretrained("sandpapat/phayathaibert-thai-pos-tagger") model.eval() ``` ### Option 1: Raw Thai text (automatic word segmentation) ```python from pythainlp.tokenize import word_tokenize def predict_pos(text: str): """POS tag Thai text (raw string).""" # 1. Word segmentation words = word_tokenize(text) # 2. Tokenize with alignment encoded = tokenizer( words, is_split_into_words=True, return_tensors="pt" ) word_ids = encoded.word_ids() with torch.no_grad(): outputs = model(**encoded) preds = outputs.logits.argmax(dim=-1)[0] # 3. Align subwords → words results = [] prev = None for idx, w_id in enumerate(word_ids): if w_id is None: continue if w_id != prev: label = model.config.id2label[preds[idx].item()] results.append((words[w_id], label)) prev = w_id return results # Example text = "ฉันกินข้าวที่ร้านอาหาร" for w, p in predict_pos(text): print(f"{w:15s} {p}") ``` ### Option 2: Pre-segmented Words (Recommended - Better Accuracy) ```python def predict_pos_from_words(words): """POS tag a list of pre-segmented Thai words.""" encoded = tokenizer( words, is_split_into_words=True, return_tensors="pt" ) word_ids = encoded.word_ids() with torch.no_grad(): outputs = model(**encoded) preds = outputs.logits.argmax(dim=-1)[0] results = [] prev = None for idx, w_id in enumerate(word_ids): if w_id is None: continue if w_id != prev: label = model.config.id2label[preds[idx].item()] results.append((words[w_id], label)) prev = w_id return results # Example words = ["ฉัน", "กิน", "ข้าว", "ที่", "ร้านอาหาร"] for w, p in predict_pos_from_words(words): print(f"{w:15s} {p}") ``` ### Example Output ``` Input: "ฉันกินข้าวที่ร้านอาหาร" ฉัน: PRON กิน: VERB ข้าว: NOUN ที่: ADP ร้านอาหาร: NOUN ``` ## Training Details ### Dataset - **Source**: [UD_Thai-TUD](https://github.com/UniversalDependencies/UD_Thai-TUD) - **Training Set**: 2,902 sentences - **Development Set**: 362 sentences - **Test Set**: 363 sentences ### Training Configuration - **Epochs**: 3 - **Batch Size**: 16 - **Learning Rate**: 3e-5 - **Optimizer**: AdamW - **Warmup Ratio**: 0.1 - **Weight Decay**: 0.01 ### Hardware - Trained on GPU (CUDA) - Mixed precision training (FP16) ## Limitations - The model was trained on Universal Dependencies data, which may not cover all domains - Performance may vary on informal text, social media, or specialized domains - Word segmentation quality affects accuracy (use consistent segmentation) - Limited to 15 UPOS tags (coarse-grained POS categories) ## Ethical Considerations - The model should not be used as the sole basis for critical decisions - Performance may vary across different text types and domains - Users should validate outputs for their specific use cases ## Citation If you use this model, please cite the base model PhayaThaiBERT: ```bibtex @inproceedings{lowphansirikul2021wangchanberta, title={WangchanBERTa: Pretraining transformer-based Thai Language Models}, author={Lowphansirikul, Lalita and Polpanumas, Charin and Rutherford, Attapol T and Nutanong, Sarana}, booktitle={2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)}, pages={1--6}, year={2021}, organization={IEEE} } ``` ## Acknowledgements - Base model: [PhayaThaiBERT](https://huggingface.co/clicknext/phayathaibert) - Training data: [Universal Dependencies Thai Treebank](https://github.com/UniversalDependencies/UD_Thai-TUD) ## License MIT License'