🌍 Multilingual Spam Analysis for Social Media
Indonesian 🇮🇩 | English | Multilabel | IndoBERT Base P1
A fine-tuned xlm-roberta-base model for Spam Analysis on noisy social media text.
This model is optimized for multilingual informal content commonly found on:
- Twitter / X
- TikTok
- Online forums
It supports Bahasa Indonesia, English, Arabic, Portugues, and many language making it suitable for moderation systems, social listening, and content intelligence pipelines.
🔍 Model Overview
- Architecture:
FacebookAI/xlm-roberta-base - Task: Text Classification (Spam Analysis)
- Languages: Indonesian, English, Arabic, Portugues, etc
- Domain: Informal & Social Media Text
- Training Date: 2026-03-05
🏷️ Supported Emotion Labels
This model detects the following emotion types:
| Label | Description |
|---|---|
| LABEL_0 | Ham (Not Spam) |
| LABEL_1 | Spam |
📊 Model Performance
Evaluated on held-out validation dataset:
| Metric | Score |
|---|---|
| F1 Score | 0.96 |
| Precision | 0.96 |
| Recall | 0.96 |
| Training Loss | 0.082100 |
| Validation Loss | 0.202610 |
🏗️ Training Configuration
| Parameter | Value |
|---|---|
| Base Model | xlm-roberta-base |
| Training Samples | 77,456 |
| Epochs | 3 |
| Learning Rate | 2e-5 |
| Batch Size | 16 (train), 32 (eval) |
| Optimizer | AdamW |
| Framework | Hugging Face Transformers |
🚀 Usage
Preprocessing Configuration
import re
def clean_text(text):
if not isinstance(text, str):
return text
text = text.replace("#", "")
text = re.sub(r"https?://\S+|www\.\S+", "<link>", text)
text = re.sub(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "<email>", text)
text = re.sub(r"@\w+", "<user>", text)
text = text.replace('"', "").replace("'", "")
text = text.replace("\n", " ")
text = text.replace("\\n", " ")
text = re.sub(r"\s+", " ", text).strip()
return text
Quick Inference (Single Text)
from transformers import pipeline
import torch
clf = pipeline(
"text-classification",
model="HuggingFace/Model/Path",
tokenizer="HuggingFace/Model/Path",
device=0 if torch.cuda.is_available() else -1
)
text = """
Gran película, pero el final sí está en otra categoría. El mejor final que vi este año, y ese le va a dar todos los premios. Irán tiene historias bien poderosas, esta vale muchísimo la pena, y creo que recién llegó a cines. #itwasjustanaccident #iran #cine #jafarpanahi
"""
text = clean_text(text)
print(text)
print(clf(text))
Quick Inference (Batch Size)
from transformers import pipeline
import torch
clf = pipeline(
"text-classification",
model="HuggingFace/Model/Path",
tokenizer="HuggingFace/Model/Path",
device=0 if torch.cuda.is_available() else -1
)
batch_size = 32
results = []
texts = [
"اللحظة التي توقف فيها البث! 😱 صاروخ يصيب استديو الإيرانية وهي تقدم الخبر! شاهد ما حدث في الثواني الأخيرة! 👇🔥 #إيران | #طهران | #عاجل | #اكسبلور | #fyp إيران | طهران | انفجار | قصف | مذيعة | أخبار عاجلة | بوز | روينة | تصعيد | 2026 | فيديو صادم | امير_ | Iran | Tehran | Attack",
"ตึงเครียด! IRGC ยิงขีปนาวุธใส่เรือ USS Abraham Lincoln ของสหรัฐฯ",
"#HubunganBilateral #Indonesia #Brasil #BeritaTerkini #InformasiPublik #BergerakBerdampak #SetahunBerdampak https://t.co/9aGPb5ACIo",
"Gran película, pero el final sí está en otra categoría. El mejor final que vi este año, y ese le va a dar todos los premios. Irán tiene historias bien poderosas, esta vale muchísimo la pena, y creo que recién llegó a cines. #itwasjustanaccident #iran #cine #jafarpanahi"
]
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
preds = clf(
batch,
truncation=True
)
results.extend(preds)
labels = [
1 if p["label"] in ["LABEL_1", "spam"] else 0
for p in results
]
output = [
{"text": t, "spam": l}
for t, l in zip(texts, labels)
]
print(output)
🎯 Intended Use Cases
- Social media spam analysis
- Comment & post filtering
- Content moderation assistance
- Political monitoring
- Brand & organization tracking
- Multilingual content intelligence systems
⚠️ Limitations
- Supports only the defined emotion labels set:
labels = ['ham', 'spam'] - Not optimized for:
- Formal academic/legal documents
- Extremely short or ambiguous messages
- Heavy slang or sarcastic expressions
- Performance may degrade on highly code-mixed sentences
- The model may inherit bias from training data
⚖️ Ethical Considerations
This model may reflect demographic, geopolitical, or cultural biases present in the training dataset.
It is not intended to replace human judgment in high-risk or sensitive decision-making systems.
Human-in-the-loop review is strongly recommended for moderation or governance-related deployments.
🖥️ Hardware Recommendations
- Recommended: GPU (≥ 8GB VRAM) for optimal performance
- CPU inference supported but slower
- Compatible with FP16 mixed precision for faster inference
📜 License
Released under the Apache 2.0 License.
Free for commercial and research use.
📚 Citation
@misc{purba2026multilingualspamanalysis,
author = {M. Iqbal Purba},
title = {Multilabel Emotion Analysis for Social Media},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/iqbalpurba26/dev-emot-indobert}
}
🙌 Acknowledgements
- Hugging Face Transformers
- Facebook AI Research — XLM-RoBERTa
- Open-source NLP community
- Contributors and dataset annotators
- Downloads last month
- 49
Model tree for nahiar/spam-detection-xlm-roberta-v3
Base model
FacebookAI/xlm-roberta-base