|
|
--- |
|
|
license: "cc-by-sa-4.0" |
|
|
language: "mni" |
|
|
tags: |
|
|
- masked-language-modeling |
|
|
- transformer |
|
|
- roberta |
|
|
- meitei |
|
|
- manipuri |
|
|
- bengali-script |
|
|
- low-resource |
|
|
datasets: |
|
|
- MWirelabs/meitei-monolingual-corpus |
|
|
model-index: |
|
|
- name: MWirelabs/meitei-roberta |
|
|
results: |
|
|
- task: |
|
|
type: fill-mask |
|
|
name: Masked Language Modeling |
|
|
dataset: |
|
|
type: MWirelabs/meitei-monolingual-corpus |
|
|
name: Meitei Monolingual Corpus |
|
|
metrics: |
|
|
- name: Training Loss |
|
|
type: training_loss |
|
|
value: 4.185500 |
|
|
path: training_loss_history.csv |
|
|
- name: Perplexity |
|
|
type: perplexity |
|
|
value: 65.89 |
|
|
--- |
|
|
|
|
|
# Meitei-RoBERTa-Base (Monolingual, Bengali Script) |
|
|
|
|
|
The **Meitei-RoBERTa-Base** model is a high-performance, monolingual transformer encoder pre-trained from scratch on the entire **Meitei Monolingual Corpus** (MWirelabs/meitei-monolingual-corpus). It is developed using the RoBERTa training methodology, establishing a **foundational language representation** for Meitei (Manipuri) in Bengali script. |
|
|
|
|
|
This model serves as a robust backbone for accelerating downstream NLP tasks such as Named Entity Recognition (NER), Sentiment Analysis, and Text Classification in Meitei. |
|
|
|
|
|
## Model Architecture & Details |
|
|
|
|
|
The architecture is based on the highly effective **RoBERTa Base** configuration, ensuring maximum performance while managing computational complexity. |
|
|
|
|
|
### Configuration |
|
|
|
|
|
| Component | Detail | Specification | |
|
|
| :--- | :--- | :--- | |
|
|
| **Architecture** | RoBERTa Base Encoder | 12 Layers, 12 Attention Heads | |
|
|
| **Hidden Dimension** | 768 | Standard Base Size | |
|
|
| **Total Parameters** | 125,000,000 | (125 Million) | |
|
|
| **Max Context Length** | 512 Tokens | Maximum sequence length (optimized for GPU memory) | |
|
|
|
|
|
### Tokenizer Details |
|
|
|
|
|
| Component | Detail | Specification | |
|
|
| :--- | :--- | :--- | |
|
|
| **Tokenizer Type** | Byte-Level Byte Pair Encoding (BPE) | Robust to handle complex morphology and unseen characters inherent in Indic scripts. | |
|
|
| **Vocabulary Size** | 52,000 Tokens | Custom-trained on the corpus for optimal subword efficiency and low Out-of-Vocabulary (OOV) rate. | |
|
|
| **Special Tokens** | `<s>`, `</s>`, `<unk>`, `<pad>`, `<mask>` | RoBERTa standard. | |
|
|
|
|
|
## Pre-training Details |
|
|
|
|
|
The model was trained from a randomly initialized state on the full corpus using the Masked Language Modeling (MLM) objective. |
|
|
|
|
|
### Training Parameters |
|
|
|
|
|
| Parameter | Value | Rationale | |
|
|
| :--- | :--- | :--- | |
|
|
| **Training Corpus** | MWirelabs/meitei-monolingual-corpus (Train Split) | High-quality, estimated 76M+ word corpus. | |
|
|
| **Training Task** | Masked Language Modeling (MLM) | RoBERTa's core objective (15% mask probability). | |
|
|
| **Data Size (Chunks)** | 353,123 blocks of 512 tokens | Full utilization of available corpus data. | |
|
|
| **Effective Batch Size** | 256 | Optimized for high-throughput pre-training. | |
|
|
| **Learning Rate** | 6e-4 | Standard aggressive rate with linear decay and warmup. | |
|
|
| **Total Epochs** | 3 | Full training run until convergence. | |
|
|
| **Final Training Loss** | 4.185500 | Confirms successful learning (significant reduction from initial random loss). | |
|
|
|
|
|
## Training Metrics & Loss Curve |
|
|
|
|
|
The model demonstrates strong convergence, successfully learning the grammatical and semantic structure of Meitei. |
|
|
|
|
|
The full log history, including training loss and learning rate evolution, is available in the repository for detailed analysis: |
|
|
|
|
|
* **Log File:** `training_loss_history.csv` |
|
|
|
|
|
* **Metric:** Training Loss (`loss`) against Training Step (`step`). |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation Metrics |
|
|
|
|
|
| Metric | Value | Description | |
|
|
| :--- | :--- | :--- | |
|
|
| **Final Training Loss** | 4.185500 | Recorded loss at the final training step. | |
|
|
| **Perplexity (PPL)** | **65.89** | Calculated PPL on a held-out validation set, indicating strong language fluency. | |
|
|
| **PPL vs. Baselines** | **5.4x better** | This model (PPL 65.89) performed 5.4 times better at predicting Meitei text than MuRIL (PPL 355.65), proving the value of custom pre-training. | |
|
|
|
|
|
### Comparative Performance |
|
|
|
|
|
| Model | Evaluation Loss | Perplexity (PPL) | |
|
|
| :--- | :--- | :--- | |
|
|
| **Meitei-RoBERTa (Custom)** | **4.1880** | **65.89** | |
|
|
| mBERT (Baseline) | 5.8335 | 341.56 | |
|
|
| MuRIL (Baseline) | 5.8740 | 355.65 | |
|
|
|
|
|
The full log history, including training loss and learning rate evolution, is available in the repository for detailed analysis: |
|
|
* **Log File:** `training_loss_history.csv` |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## 💡 How to Use (For Inference and Fine-tuning) |
|
|
|
|
|
This model can be loaded directly into any Hugging Face pipeline or used as the encoder in a custom model for fine-tuning. |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline |
|
|
|
|
|
# Using the lowercase repository ID for robust loading |
|
|
REPO_ID = "MWirelabs/meitei-roberta" |
|
|
|
|
|
# 1. Load Model and Tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained(REPO_ID) |
|
|
model = AutoModelForMaskedLM.from_pretrained(REPO_ID) |
|
|
|
|
|
# 2. Example: Tokenize Text |
|
|
# NOTE: Example text MUST be in the Bengali script, as the model was trained only on this script. |
|
|
meitei_text = "আমি গতকাল স্কুল থেকে ফিরেছি। এই বইটি পড়তে ভাল লাগে।" |
|
|
inputs = tokenizer(meitei_text, return_tensors="pt") |
|
|
|
|
|
# 3. Example: Fill-Mask Pipeline Test (Testing fluency) |
|
|
unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
|
|
# Masking a word in Bengali script Meitei |
|
|
results = unmasker("আমাদের দেশে <mask> অনেক সমস্যা আছে।") |
|
|
# ... process results |
|
|
|
|
|
``` |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is primarily intended for research purposes and as a foundational encoder for Meitei NLP downstream tasks, including: |
|
|
|
|
|
Fine-tuning on sequence classification, token classification (NER), and summarization. |
|
|
|
|
|
Feature extraction to generate high-quality Meitei text embeddings. The model is not intended for deployment in applications that require safety-critical decision-making without further domain-specific fine-tuning and validation. |
|
|
|
|
|
## Limitations and Bias |
|
|
|
|
|
Script Dependence: This model was exclusively trained on the Bengali script version of Meitei and will perform poorly on Meitei text written in the Meitei Mayek (Meetei Mayek) script. |
|
|
|
|
|
Monolingual Focus: The model is not suitable for cross-lingual tasks without further fine-tuning. |
|
|
|
|
|
## 📚 Citation |
|
|
|
|
|
If you use this model or the Meitei corpus in your work, please cite it as: |
|
|
|
|
|
```bibtex |
|
|
@misc{mwirelabs_meitei_roberta_2025, |
|
|
title = {Meitei-RoBERTa-Base (Bengali Script) Model}, |
|
|
author = {MWire Labs}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{[https://huggingface.co/MWirelabs/meitei-roberta](https://huggingface.co/MWirelabs/meitei-roberta)}}, |
|
|
note = {RoBERTa Base pre-trained from scratch on the Meitei Monolingual Corpus} |
|
|
} |
|
|
``` |
|
|
|
|
|
## About MWire Labs |
|
|
|
|
|
MWire Labs builds ethical, region-first AI infrastructure for Northeast India—focusing on low-resource languages and public accessibility. |
|
|
|
|
|
Learn more at [www.mwirelabs.com](https://www.mwirelabs.com) |
|
|
|
|
|
--- |
|
|
|
|
|
## Contributions & Feedback |
|
|
|
|
|
We welcome feedback, contributions, and civic collaborations. |
|
|
Reach out via [Hugging Face](https://huggingface.co/MWirelabs). |