meitei-roberta / README.md

Update README.md

2ee4dd3 verified 3 months ago

7.23 kB

	---
	license: "cc-by-sa-4.0"
	language: "mni"
	tags:
	- masked-language-modeling
	- transformer
	- roberta
	- meitei
	- manipuri
	- bengali-script
	- low-resource
	datasets:
	- MWirelabs/meitei-monolingual-corpus
	model-index:
	- name: MWirelabs/meitei-roberta
	results:
	- task:
	type: fill-mask
	name: Masked Language Modeling
	dataset:
	type: MWirelabs/meitei-monolingual-corpus
	name: Meitei Monolingual Corpus
	metrics:
	- name: Training Loss
	type: training_loss
	value: 4.185500
	path: training_loss_history.csv
	- name: Perplexity
	type: perplexity
	value: 65.89
	---

	# Meitei-RoBERTa-Base (Monolingual, Bengali Script)

	The Meitei-RoBERTa-Base model is a high-performance, monolingual transformer encoder pre-trained from scratch on the entire Meitei Monolingual Corpus (MWirelabs/meitei-monolingual-corpus). It is developed using the RoBERTa training methodology, establishing a foundational language representation for Meitei (Manipuri) in Bengali script.

	This model serves as a robust backbone for accelerating downstream NLP tasks such as Named Entity Recognition (NER), Sentiment Analysis, and Text Classification in Meitei.

	## Model Architecture & Details

	The architecture is based on the highly effective RoBERTa Base configuration, ensuring maximum performance while managing computational complexity.

	### Configuration

	\| Component \| Detail \| Specification \|
	\| :--- \| :--- \| :--- \|
	\| Architecture \| RoBERTa Base Encoder \| 12 Layers, 12 Attention Heads \|
	\| Hidden Dimension \| 768 \| Standard Base Size \|
	\| Total Parameters \| 125,000,000 \| (125 Million) \|
	\| Max Context Length \| 512 Tokens \| Maximum sequence length (optimized for GPU memory) \|

	### Tokenizer Details

	\| Component \| Detail \| Specification \|
	\| :--- \| :--- \| :--- \|
	\| Tokenizer Type \| Byte-Level Byte Pair Encoding (BPE) \| Robust to handle complex morphology and unseen characters inherent in Indic scripts. \|
	\| Vocabulary Size \| 52,000 Tokens \| Custom-trained on the corpus for optimal subword efficiency and low Out-of-Vocabulary (OOV) rate. \|
	\| Special Tokens \| `<s>`, `</s>`, `<unk>`, `<pad>`, `<mask>` \| RoBERTa standard. \|

	## Pre-training Details

	The model was trained from a randomly initialized state on the full corpus using the Masked Language Modeling (MLM) objective.

	### Training Parameters

	\| Parameter \| Value \| Rationale \|
	\| :--- \| :--- \| :--- \|
	\| Training Corpus \| MWirelabs/meitei-monolingual-corpus (Train Split) \| High-quality, estimated 76M+ word corpus. \|
	\| Training Task \| Masked Language Modeling (MLM) \| RoBERTa's core objective (15% mask probability). \|
	\| Data Size (Chunks) \| 353,123 blocks of 512 tokens \| Full utilization of available corpus data. \|
	\| Effective Batch Size \| 256 \| Optimized for high-throughput pre-training. \|
	\| Learning Rate \| 6e-4 \| Standard aggressive rate with linear decay and warmup. \|
	\| Total Epochs \| 3 \| Full training run until convergence. \|
	\| Final Training Loss \| 4.185500 \| Confirms successful learning (significant reduction from initial random loss). \|

	## Training Metrics & Loss Curve

	The model demonstrates strong convergence, successfully learning the grammatical and semantic structure of Meitei.

	The full log history, including training loss and learning rate evolution, is available in the repository for detailed analysis:

	* Log File: `training_loss_history.csv`

	* Metric: Training Loss (`loss`) against Training Step (`step`).

	---

	## Evaluation Metrics

	\| Metric \| Value \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Final Training Loss \| 4.185500 \| Recorded loss at the final training step. \|
	\| Perplexity (PPL) \| 65.89 \| Calculated PPL on a held-out validation set, indicating strong language fluency. \|
	\| PPL vs. Baselines \| 5.4x better \| This model (PPL 65.89) performed 5.4 times better at predicting Meitei text than MuRIL (PPL 355.65), proving the value of custom pre-training. \|

	### Comparative Performance

	\| Model \| Evaluation Loss \| Perplexity (PPL) \|
	\| :--- \| :--- \| :--- \|
	\| Meitei-RoBERTa (Custom) \| 4.1880 \| 65.89 \|
	\| mBERT (Baseline) \| 5.8335 \| 341.56 \|
	\| MuRIL (Baseline) \| 5.8740 \| 355.65 \|

	The full log history, including training loss and learning rate evolution, is available in the repository for detailed analysis:
	* Log File: `training_loss_history.csv`


	---

	## 💡 How to Use (For Inference and Fine-tuning)

	This model can be loaded directly into any Hugging Face pipeline or used as the encoder in a custom model for fine-tuning.

	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

	# Using the lowercase repository ID for robust loading
	REPO_ID = "MWirelabs/meitei-roberta"

	# 1. Load Model and Tokenizer
	tokenizer = AutoTokenizer.from_pretrained(REPO_ID)
	model = AutoModelForMaskedLM.from_pretrained(REPO_ID)

	# 2. Example: Tokenize Text
	# NOTE: Example text MUST be in the Bengali script, as the model was trained only on this script.
	meitei_text = "আমি গতকাল স্কুল থেকে ফিরেছি। এই বইটি পড়তে ভাল লাগে।"
	inputs = tokenizer(meitei_text, return_tensors="pt")

	# 3. Example: Fill-Mask Pipeline Test (Testing fluency)
	unmasker = pipeline("fill-mask", model=model, tokenizer=tokenizer)
	# Masking a word in Bengali script Meitei
	results = unmasker("আমাদের দেশে <mask> অনেক সমস্যা আছে।")
	# ... process results

	```

	## Intended Use

	This model is primarily intended for research purposes and as a foundational encoder for Meitei NLP downstream tasks, including:

	Fine-tuning on sequence classification, token classification (NER), and summarization.

	Feature extraction to generate high-quality Meitei text embeddings. The model is not intended for deployment in applications that require safety-critical decision-making without further domain-specific fine-tuning and validation.

	## Limitations and Bias

	Script Dependence: This model was exclusively trained on the Bengali script version of Meitei and will perform poorly on Meitei text written in the Meitei Mayek (Meetei Mayek) script.

	Monolingual Focus: The model is not suitable for cross-lingual tasks without further fine-tuning.

	## 📚 Citation

	If you use this model or the Meitei corpus in your work, please cite it as:

	```bibtex
	@misc{mwirelabs_meitei_roberta_2025,
	title = {Meitei-RoBERTa-Base (Bengali Script) Model},
	author = {MWire Labs},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{[https://huggingface.co/MWirelabs/meitei-roberta](https://huggingface.co/MWirelabs/meitei-roberta)}},
	note = {RoBERTa Base pre-trained from scratch on the Meitei Monolingual Corpus}
	}
	```

	## About MWire Labs

	MWire Labs builds ethical, region-first AI infrastructure for Northeast India—focusing on low-resource languages and public accessibility.

	Learn more at [www.mwirelabs.com](https://www.mwirelabs.com)

	---

	## Contributions & Feedback

	We welcome feedback, contributions, and civic collaborations.
	Reach out via [Hugging Face](https://huggingface.co/MWirelabs).