Update README.md

73e2ffb verified 4 months ago

5.39 kB

language: th
tags:
  - thai
  - pos-tagging
  - token-classification
  - phayathaibert
datasets:
  - universal_dependencies
license: mit
base_model:
  - clicknext/phayathaibert
metrics:
  - accuracy
  - f1
pipeline_tag: token-classification
model-index:
  - name: phayathaibert-thai-pos-tagger
    results:
      - task:
          type: token-classification
          name: POS Tagging
        dataset:
          type: universal_dependencies
          name: UD_Thai-TUD
        metrics:
          - name: Test Accuracy
            type: accuracy
            value: 0.9064
          - name: Test Micro F1
            type: f1
            value: 0.9064
          - name: Test Macro F1
            type: f1
            value: 0.8134

PhayaThaiBERT Thai POS Tagger

Fine-tuned PhayaThaiBERT for UPOS Part-of-Speech tagging on Thai sentences.
Trained on the UD_Thai-TUD treebank following Universal Dependencies conventions.

Model Description

Base model: PhayaThaiBERT
Task: Token Classification (POS tagging)
Dataset: UD_Thai-TUD
Tags (15 UPOS):
ADJ, ADP, ADV, AUX, CCONJ, DET, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB

This model predicts word-level POS tags for Thai text. For best performance, use pre-segmented Thai words.

Usage

1) Load the model

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("sandpapat/phayathaibert-thai-pos-tagger")
model = AutoModelForTokenClassification.from_pretrained("sandpapat/phayathaibert-thai-pos-tagger")

model.eval()

Option 1: Raw Thai text (automatic word segmentation)

from pythainlp.tokenize import word_tokenize

def predict_pos(text: str):
    """POS tag Thai text (raw string)."""

    # 1. Word segmentation
    words = word_tokenize(text)

    # 2. Tokenize with alignment
    encoded = tokenizer(
        words,
        is_split_into_words=True,
        return_tensors="pt"
    )
    word_ids = encoded.word_ids()

    with torch.no_grad():
        outputs = model(**encoded)
        preds = outputs.logits.argmax(dim=-1)[0]

    # 3. Align subwords → words
    results = []
    prev = None
    for idx, w_id in enumerate(word_ids):
        if w_id is None:
            continue
        if w_id != prev:
            label = model.config.id2label[preds[idx].item()]
            results.append((words[w_id], label))
        prev = w_id

    return results

# Example
text = "ฉันกินข้าวที่ร้านอาหาร"
for w, p in predict_pos(text):
    print(f"{w:15s} {p}")

Option 2: Pre-segmented Words (Recommended - Better Accuracy)

def predict_pos_from_words(words):
    """POS tag a list of pre-segmented Thai words."""

    encoded = tokenizer(
        words,
        is_split_into_words=True,
        return_tensors="pt"
    )
    word_ids = encoded.word_ids()

    with torch.no_grad():
        outputs = model(**encoded)
        preds = outputs.logits.argmax(dim=-1)[0]

    results = []
    prev = None
    for idx, w_id in enumerate(word_ids):
        if w_id is None:
            continue
        if w_id != prev:
            label = model.config.id2label[preds[idx].item()]
            results.append((words[w_id], label))
        prev = w_id

    return results

# Example
words = ["ฉัน", "กิน", "ข้าว", "ที่", "ร้านอาหาร"]
for w, p in predict_pos_from_words(words):
    print(f"{w:15s} {p}")

Example Output

Input: "ฉันกินข้าวที่ร้านอาหาร"

ฉัน: PRON
กิน: VERB
ข้าว: NOUN
ที่: ADP
ร้านอาหาร: NOUN

Training Details

Dataset

Source: UD_Thai-TUD
Training Set: 2,902 sentences
Development Set: 362 sentences
Test Set: 363 sentences

Training Configuration

Epochs: 3
Batch Size: 16
Learning Rate: 3e-5
Optimizer: AdamW
Warmup Ratio: 0.1
Weight Decay: 0.01

Hardware

Trained on GPU (CUDA)
Mixed precision training (FP16)

Limitations

The model was trained on Universal Dependencies data, which may not cover all domains
Performance may vary on informal text, social media, or specialized domains
Word segmentation quality affects accuracy (use consistent segmentation)
Limited to 15 UPOS tags (coarse-grained POS categories)

Ethical Considerations

The model should not be used as the sole basis for critical decisions
Performance may vary across different text types and domains
Users should validate outputs for their specific use cases

Citation

If you use this model, please cite the base model PhayaThaiBERT:

@inproceedings{lowphansirikul2021wangchanberta,
  title={WangchanBERTa: Pretraining transformer-based Thai Language Models},
  author={Lowphansirikul, Lalita and Polpanumas, Charin and Rutherford, Attapol T and Nutanong, Sarana},
  booktitle={2021 16th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP)},
  pages={1--6},
  year={2021},
  organization={IEEE}
}

Acknowledgements

Base model: PhayaThaiBERT
Training data: Universal Dependencies Thai Treebank

License

MIT License'