🌍 Multilingual Spam Analysis for Social Media

Indonesian 🇮🇩 | English | Multilabel | IndoBERT Base P1

A fine-tuned xlm-roberta-base model for Spam Analysis on noisy social media text.

This model is optimized for multilingual informal content commonly found on:

  • Twitter / X
  • Instagram
  • TikTok
  • Facebook
  • Online forums

It supports Bahasa Indonesia, English, Arabic, Portugues, and many language making it suitable for moderation systems, social listening, and content intelligence pipelines.


🔍 Model Overview

  • Architecture: FacebookAI/xlm-roberta-base
  • Task: Text Classification (Spam Analysis)
  • Languages: Indonesian, English, Arabic, Portugues, etc
  • Domain: Informal & Social Media Text
  • Training Date: 2026-03-05

🏷️ Supported Emotion Labels

This model detects the following emotion types:

Label Description
LABEL_0 Ham (Not Spam)
LABEL_1 Spam

📊 Model Performance

Evaluated on held-out validation dataset:

Metric Score
F1 Score 0.96
Precision 0.96
Recall 0.96
Training Loss 0.082100
Validation Loss 0.202610

🏗️ Training Configuration

Parameter Value
Base Model xlm-roberta-base
Training Samples 77,456
Epochs 3
Learning Rate 2e-5
Batch Size 16 (train), 32 (eval)
Optimizer AdamW
Framework Hugging Face Transformers

🚀 Usage

Preprocessing Configuration

import re

def clean_text(text):
    if not isinstance(text, str):
        return text
    text = text.replace("#", "")
    text = re.sub(r"https?://\S+|www\.\S+", "<link>", text)
    text = re.sub(r"\b[\w\.-]+@[\w\.-]+\.\w+\b", "<email>", text)
    text = re.sub(r"@\w+", "<user>", text)
    text = text.replace('"', "").replace("'", "")
    text = text.replace("\n", " ")
    text = text.replace("\\n", " ")
    text = re.sub(r"\s+", " ", text).strip()

    return text

Quick Inference (Single Text)

from transformers import pipeline
import torch
clf = pipeline(
    "text-classification",
    model="HuggingFace/Model/Path",
    tokenizer="HuggingFace/Model/Path",
    device=0 if torch.cuda.is_available() else -1
)


text = """
Gran película, pero el final sí está en otra categoría. El mejor final que vi este año, y ese le va a dar todos los premios. Irán tiene historias bien poderosas, esta vale muchísimo la pena, y creo que recién llegó a cines. #itwasjustanaccident #iran #cine #jafarpanahi
"""

text = clean_text(text)
print(text)
print(clf(text))

Quick Inference (Batch Size)

from transformers import pipeline
import torch
clf = pipeline(
    "text-classification",
    model="HuggingFace/Model/Path",
    tokenizer="HuggingFace/Model/Path",
    device=0 if torch.cuda.is_available() else -1
)

batch_size = 32
results = []

texts = [
    "اللحظة التي توقف فيها البث! 😱 صاروخ يصيب استديو الإيرانية وهي تقدم الخبر! شاهد ما حدث في الثواني الأخيرة! 👇🔥 #إيران | #طهران | #عاجل | #اكسبلور | #fyp إيران | طهران | انفجار | قصف | مذيعة | أخبار عاجلة | بوز | روينة | تصعيد | 2026 | فيديو صادم | امير_ | Iran | Tehran | Attack",
    "ตึงเครียด! IRGC ยิงขีปนาวุธใส่เรือ USS Abraham Lincoln ของสหรัฐฯ",
    "#HubunganBilateral #Indonesia #Brasil #BeritaTerkini #InformasiPublik #BergerakBerdampak #SetahunBerdampak https://t.co/9aGPb5ACIo",
    "Gran película, pero el final sí está en otra categoría. El mejor final que vi este año, y ese le va a dar todos los premios. Irán tiene historias bien poderosas, esta vale muchísimo la pena, y creo que recién llegó a cines. #itwasjustanaccident #iran #cine #jafarpanahi"
]

for i in range(0, len(texts), batch_size):
    batch = texts[i:i+batch_size]
    preds = clf(
        batch,
        truncation=True
    )
    results.extend(preds)

labels = [
    1 if p["label"] in ["LABEL_1", "spam"] else 0
    for p in results
]

output = [
    {"text": t, "spam": l}
    for t, l in zip(texts, labels)
]
print(output)

🎯 Intended Use Cases

  • Social media spam analysis
  • Comment & post filtering
  • Content moderation assistance
  • Political monitoring
  • Brand & organization tracking
  • Multilingual content intelligence systems

⚠️ Limitations

  • Supports only the defined emotion labels set: labels = ['ham', 'spam']
  • Not optimized for:
    • Formal academic/legal documents
    • Extremely short or ambiguous messages
    • Heavy slang or sarcastic expressions
  • Performance may degrade on highly code-mixed sentences
  • The model may inherit bias from training data

⚖️ Ethical Considerations

This model may reflect demographic, geopolitical, or cultural biases present in the training dataset.

It is not intended to replace human judgment in high-risk or sensitive decision-making systems.

Human-in-the-loop review is strongly recommended for moderation or governance-related deployments.


🖥️ Hardware Recommendations

  • Recommended: GPU (≥ 8GB VRAM) for optimal performance
  • CPU inference supported but slower
  • Compatible with FP16 mixed precision for faster inference

📜 License

Released under the Apache 2.0 License.
Free for commercial and research use.


📚 Citation

@misc{purba2026multilingualspamanalysis,
  author    = {M. Iqbal Purba},
  title     = {Multilabel Emotion Analysis for Social Media},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/iqbalpurba26/dev-emot-indobert}
}

🙌 Acknowledgements

  • Hugging Face Transformers
  • Facebook AI Research — XLM-RoBERTa
  • Open-source NLP community
  • Contributors and dataset annotators
Downloads last month
49
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nahiar/spam-detection-xlm-roberta-v3

Finetuned
(3821)
this model