---
language: th
tags:
- thai
- pos-tagging
- token-classification
- phayathaibert
datasets:
- universal_dependencies
license: mit
base_model:
- clicknext/phayathaibert
metrics:
- accuracy
- f1
pipeline_tag: token-classification

model-index:
- name: phayathaibert-thai-pos-tagger
  results:
  - task:
      type: token-classification
      name: POS Tagging
    dataset:
      type: universal_dependencies
      name: UD_Thai-TUD
    metrics:
    - name: Test Accuracy
      type: accuracy
      value: 0.9064
    - name: Test Micro F1
      type: f1
      value: 0.9064
    - name: Test Macro F1
      type: f1
      value: 0.8134
---

# PhayaThaiBERT Thai POS Tagger

Fine-tuned **PhayaThaiBERT** for **UPOS Part-of-Speech tagging** on Thai sentences.  
Trained on the **UD_Thai-TUD** treebank following Universal Dependencies conventions.

---

## Model Description

- **Base model:** PhayaThaiBERT  
- **Task:** Token Classification (POS tagging)  
- **Dataset:** UD_Thai-TUD  
- **Tags (15 UPOS):**  
  `ADJ, ADP, ADV, AUX, CCONJ, DET, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB`

This model predicts word-level POS tags for Thai text. For best performance, use **pre-segmented Thai words**.

---

# Usage

## 1) Load the model

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("sandpapat/phayathaibert-thai-pos-tagger")
model = AutoModelForTokenClassification.from_pretrained("sandpapat/phayathaibert-thai-pos-tagger")

model.eval()
```

### Option 1: Raw Thai text (automatic word segmentation)

```python
from pythainlp.tokenize import word_tokenize

def predict_pos(text: str):
    """POS tag Thai text (raw string)."""

    # 1. Word segmentation
    words = word_tokenize(text)

    # 2. Tokenize with alignment
    encoded = tokenizer(
        words,
        is_split_into_words=True,
        return_tensors="pt"
    )
    word_ids = encoded.word_ids()

    with torch.no_grad():
        outputs = model(**encoded)
        preds = outputs.logits.argmax(dim=-1)[0]

    # 3. Align subwords → words
    results = []
    prev = None
    for idx, w_id in enumerate(word_ids):
        if w_id is None:
            continue
        if w_id != prev:
            label = model.config.id2label[preds[idx].item()]
            results.append((words[w_id], label))
        prev = w_id

    return results

# Example
text = "ฉันกินข้าวที่ร้านอาหาร"
for w, p in predict_pos(text):
    print(f"{w:15s} {p}")
```

### Option 2: Pre-segmented Words (Recommended - Better Accuracy)

```python
def predict_pos_from_words(words):
    """POS tag a list of pre-segmented Thai words."""

    encoded = tokenizer(
        words,
        is_split_into_words=True,
        return_tensors="pt"
    )
    word_ids = encoded.word_ids()

    with torch.no_grad():
        outputs = model(**encoded)
        preds = outputs.logits.argmax(dim=-1)[0]

    results = []
    prev = None
    for idx, w_id in enumerate(word_ids):
        if w_id is None:
            continue
        if w_id != prev:
            label = model.config.id2label[preds[idx].item()]
            results.append((words[w_id], label))
        prev = w_id

    return results

# Example
words = ["ฉัน", "กิน", "ข้าว", "ที่", "ร้านอาหาร"]
for w, p in predict_pos_from_words(words):
    print(f"{w:15s} {p}")
```

### Example Output

```
Input: "ฉันกินข้าวที่ร้านอาหาร"

ฉัน: PRON
กิน: VERB
ข้าว: NOUN
ที่: ADP
ร้านอาหาร: NOUN
```

## Training Details

### Dataset
- **Source**: [UD_Thai-TUD](https://github.com/UniversalDependencies/UD_Thai-TUD)
- **Training Set**: 2,902 sentences
- **Development Set**: 362 sentences
- **Test Set**: 363 sentences

### Training Configuration
- **Epochs**: 3
- **Batch Size**: 16
- **Learning Rate**: 3e-5
- **Optimizer**: AdamW
- **Warmup Ratio**: 0.1
- **Weight Decay**: 0.01

### Hardware
- Trained on GPU (CUDA)
- Mixed precision training (FP16)

## Limitations

- The model was trained on Universal Dependencies data, which may not cover all domains
- Performance may vary on informal text, social media, or specialized domains
- Word segmentation quality affects accuracy (use consistent segmentation)
- Limited to 15 UPOS tags (coarse-grained POS categories)

## Ethical Considerations

- The model should not be used as the sole basis for critical decisions
- Performance may vary across different text types and domains
- Users should validate outputs for their specific use cases

## Citation

If you use this model, please cite the base model PhayaThaiBERT:

```bibtex
@inproceedings{lowphansirikul2021wangchanberta,
  title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
  author={Lowphansirikul, Lalita and Polpanumas, Charin and Rutherford, Attapol T and Nutanong, Sarana},
  booktitle={2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)},
  pages={1--6},
  year={2021},
  organization={IEEE}
}
```

## Acknowledgements

- Base model: [PhayaThaiBERT](https://huggingface.co/clicknext/phayathaibert)
- Training data: [Universal Dependencies Thai Treebank](https://github.com/UniversalDependencies/UD_Thai-TUD)

## License

MIT License'