🌍 Multilingual Spam Analysis for Social Media

Indonesian 🇮🇩 | English | Multilabel | IndoBERT Base P1

A fine-tuned xlm-roberta-base model for Spam Analysis on noisy social media text.

This model is optimized for multilingual informal content commonly found on:

Twitter / X
Instagram
TikTok
Facebook
Online forums

It supports Bahasa Indonesia, English, Arabic, Portugues, and many language making it suitable for moderation systems, social listening, and content intelligence pipelines.

🔍 Model Overview

Architecture: FacebookAI/xlm-roberta-base
Task: Text Classification (Spam Analysis)
Languages: Indonesian, English, Arabic, Portugues, etc
Domain: Informal & Social Media Text
Training Date: 2026-03-05

🏷️ Supported Emotion Labels

This model detects the following emotion types:

Label	Description
LABEL_0	Ham (Not Spam)
LABEL_1	Spam

📊 Model Performance

Evaluated on held-out validation dataset:

Metric	Score
F1 Score	0.96
Precision	0.96
Recall	0.96
Training Loss	0.082100
Validation Loss	0.202610

🏗️ Training Configuration

Parameter	Value
Base Model	xlm-roberta-base
Training Samples	77,456
Epochs	3
Learning Rate	2e-5
Batch Size	16 (train), 32 (eval)
Optimizer	AdamW
Framework	Hugging Face Transformers

🚀 Usage

Preprocessing Configuration

import re

def clean_text(text):
    if not isinstance(text, str):
        return text
    text = text.replace("#", "")
    text = re.sub(r"https?://\S+|www\.\S+", "<link>", text)
    text = re.sub(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "<email>", text)
    text = re.sub(r"@\w+", "<user>", text)
    text = text.replace('"', "").replace("'", "")
    text = text.replace("\n", " ")
    text = text.replace("\\n", " ")
    text = re.sub(r"\s+", " ", text).strip()

    return text

Quick Inference (Single Text)

from transformers import pipeline
import torch
clf = pipeline(
    "text-classification",
    model="HuggingFace/Model/Path",
    tokenizer="HuggingFace/Model/Path",
    device=0 if torch.cuda.is_available() else -1
)


text = """
Gran película, pero el final sí está en otra categoría. El mejor final que vi este año, y ese le va a dar todos los premios. Irán tiene historias bien poderosas, esta vale muchísimo la pena, y creo que recién llegó a cines. #itwasjustanaccident #iran #cine #jafarpanahi
"""

text = clean_text(text)
print(text)
print(clf(text))

Quick Inference (Batch Size)

from transformers import pipeline
import torch
clf = pipeline(
    "text-classification",
    model="HuggingFace/Model/Path",
    tokenizer="HuggingFace/Model/Path",
    device=0 if torch.cuda.is_available() else -1
)

batch_size = 32
results = []

texts = [
    "اللحظة التي توقف فيها البث! 😱 صاروخ يصيب استديو الإيرانية وهي تقدم الخبر! شاهد ما حدث في الثواني الأخيرة! 👇🔥 #إيران | #طهران | #عاجل | #اكسبلور | #fyp إيران | طهران | انفجار | قصف | مذيعة | أخبار عاجلة | بوز | روينة | تصعيد | 2026 | فيديو صادم | امير_ | Iran | Tehran | Attack",
    "ตึงเครียด! IRGC ยิงขีปนาวุธใส่เรือ USS Abraham Lincoln ของสหรัฐฯ",
    "#HubunganBilateral #Indonesia #Brasil #BeritaTerkini #InformasiPublik #BergerakBerdampak #SetahunBerdampak https://t.co/9aGPb5ACIo",
    "Gran película, pero el final sí está en otra categoría. El mejor final que vi este año, y ese le va a dar todos los premios. Irán tiene historias bien poderosas, esta vale muchísimo la pena, y creo que recién llegó a cines. #itwasjustanaccident #iran #cine #jafarpanahi"
]

for i in range(0, len(texts), batch_size):
    batch = texts[i:i+batch_size]
    preds = clf(
        batch,
        truncation=True
    )
    results.extend(preds)

labels = [
    1 if p["label"] in ["LABEL_1", "spam"] else 0
    for p in results
]

output = [
    {"text": t, "spam": l}
    for t, l in zip(texts, labels)
]
print(output)

🎯 Intended Use Cases

Social media spam analysis
Comment & post filtering
Content moderation assistance
Political monitoring
Brand & organization tracking
Multilingual content intelligence systems

⚠️ Limitations

Supports only the defined emotion labels set: labels = ['ham', 'spam']
Not optimized for:
- Formal academic/legal documents
- Extremely short or ambiguous messages
- Heavy slang or sarcastic expressions
Performance may degrade on highly code-mixed sentences
The model may inherit bias from training data

⚖️ Ethical Considerations

This model may reflect demographic, geopolitical, or cultural biases present in the training dataset.

It is not intended to replace human judgment in high-risk or sensitive decision-making systems.

Human-in-the-loop review is strongly recommended for moderation or governance-related deployments.

🖥️ Hardware Recommendations

Recommended: GPU (≥ 8GB VRAM) for optimal performance
CPU inference supported but slower
Compatible with FP16 mixed precision for faster inference

📜 License

Released under the Apache 2.0 License.
Free for commercial and research use.

📚 Citation

@misc{purba2026multilingualspamanalysis,
  author    = {M. Iqbal Purba},
  title     = {Multilabel Emotion Analysis for Social Media},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/iqbalpurba26/dev-emot-indobert}
}

🙌 Acknowledgements

Hugging Face Transformers
Facebook AI Research — XLM-RoBERTa
Open-source NLP community
Contributors and dataset annotators

Downloads last month: 49

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for nahiar/spam-detection-xlm-roberta-v3

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3821)

this model