Omartificial-Intelligence-Space's picture
Update README.md
2fd0457 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:2964
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
base_model: Omartificial-Intelligence-Space/SA-BERT-V1
widget:
  - source_sentence: كم تكلفة رحلة بحرية ليوم؟
    sentences:
      - الباحثين يحلّلوا تأثير البيئة على اختلاف العادات بين المناطق.
      - أبي محلات فيها بضاعة عالمية مشهورة.
      - بكم أسعار الجولات البحرية اليومية؟
  - source_sentence: الفعاليات الشعبية تختلف حسب المناسبات.
    sentences:
      - معطف المختبر حقي اختفى وأبي أشتري بديل
      - بحط كل البنود المطلوبة وبراجع الميزانية عشان نرفع الطلب
      - بعض المناطق تتميز بطرق احتفال خاصة بها.
  - source_sentence: الأسوار القديمة كانت تحمي المدن زمان.
    sentences:
      - بجلس أصلّحها قبل أرسلها
      - بعض المدن احتفظت بأبوابها التاريخية.
      - بجلس أشتغل عليها وسط اليوم
  - source_sentence: ودي أجرب رحلة سفاري بصحراء الربع الخالي
    sentences:
      - أبغى أشارك بجولة سفاري بالربع الخالي
      - هذا التمرين ضروري لنحت منطقة البطن والخصر.
      - ودي أعرف عن فنادق فخمة بالدمام
  - source_sentence: أبي طرحة جديدة لونها سماوي فاتح.
    sentences:
      - المشاوي عندهم متبلة صح وتحسها طازجة
      - ريحة المعطرات هذي قوية وتقعد في الغرف؟
      - أدور شيلة لونها أزرق فاتح زي السماء.
pipeline_tag: feature-extraction
library_name: sentence-transformers
license: cc-by-nc-4.0
gated: true
extra_gated_prompt: Please provide the required information to access this model
extra_gated_fields:
  Full Name:
    type: text
  Affiliation / Company:
    type: text
  Email Address:
    type: text
  Intended Use:
    type: select
    options:
      - Research
      - Education
      - Commercial Exploration
      - Academic Project
      - Other
extra_gated_heading: Access Request  Provide Required Information
extra_gated_description: Before accessing this model, please complete the form below.
extra_gated_button_content: Submit Access Request
datasets:
  - Omartificial-Intelligence-Space/SaudiDialect-Triplet-21
language:
  - ar
metrics:
  - mse
  - mae

🏷️ SABER: Saudi Semantic Embedding Model (v0.1)

Black Elegant Minimalist Profile LinkedIn Banner

🧩 Summary

SABER-v0.1 (Saudi Arabic BERT Embeddings for Retrieval) is a state-of-the-art Saudi dialect semantic embedding model, fine-tuned from SA-BERT using MultipleNegativesRankingLoss (MNLR) and Matryoshka Representation Learning over a large, high-quality Saudi Triplet Dataset spanning 21 real-life Saudi domains.

SABER transforms a standard Masked Language Model (MLM) into a powerful semantic encoder capable of capturing deep contextual meaning across Najdi, Hijazi, Gulf-influenced, and mixed Saudi dialectals. The model achieves state-of-the-art results across both long-paragraph STS evaluation and triplet margin separation, significantly outperforming strong baselines such as ATM2, GATE, LaBSE, mE5-base, MarBERT, and MiniLM.

🏗️ Architecture & Build Pipeline

SABER utilizes a rigorous two-stage optimization pipeline: first, we adapted MARBERT-V2 via Masked Language Modeling (MLM) on 500k Saudi sentences to create the domain-specialized SA-BERT, followed by deep semantic optimization using MultipleNegativesRankingLoss (MNRL) and Matryoshka Representation Learning on curated triplets to produce the final state-of-the-art embedding model.

SABER Training Pipeline

SABER is designed for:

  • Semantic search
  • Retrieval-Augmented Generation (RAG)
  • Clustering
  • Intent detection
  • Semantic similarity
  • Document & paragraph embedding
  • Ranking and re-ranking systems
  • Multi-domain Saudi-language applications

This release is v0.1 — the first public version of SABER.

📌 Model Details

  • Model Name: SABER (Saudi Semantic Embedding)
  • Version: v0.1
  • Base Model: SA-BERT-V1 (AraBERT trained on Saudi data)
  • Language: Arabic (Saudi Dialects: Najdi, Hijazi, Gulf)
  • Task: Sentence Embeddings, Semantic Similarity, Retrieval
  • Training Objective: MNLR + Matryoshka Loss
  • Embedding Dimension: 768
  • License: Apache 2.0
  • Maintainer: Omartificial-Intelligence-Space

🧠 Motivation

Saudi dialect NLP remains an underdeveloped space. Most embeddings struggle with dialectal variation, idiomatic expressions, and multi-sentence reasoning. SABER was designed to fill this gap by:

  1. Training specifically on Saudi-dialect triplet data.
  2. Leveraging modern contrastive learning.
  3. Creating robust embeddings suitable for production and research.

This model is the result of extensive evaluation across STS, triplets, and domain-specific tests.


⚠️ Limitations

  1. Regional Scope: Performance may degrade on Levantine, Egyptian, or Maghrebi dialects.
  2. Scope: Embeddings focus on semantic similarity, not syntax or classification.
  3. Input Length: Long multi-document retrieval requires chunking.

📚 Training Data

SABER was trained on Omartificial-Intelligence-Space/SaudiDialect-Triplet-21, which contains:

  • 2964 triplets (Anchor, Positive, Negative)
  • 21 domains, including:
    • Travel, Food, Shopping, Work & Office, Education, Culture, Weather, Sports, Technology, Medical, Government, Social Events, Anthropology, etc.
  • Mixed Saudi dialect sentences (Najdi + Hijazi + Gulf)
  • Real-world conversational phrasing
  • Carefully curated positive/negative pairs

The dataset includes natural variations in:

  • Word choice
  • Dialect morphology
  • Sentence structure
  • Discourse context
  • Multi-sentence reasoning

🔧 Training Methodology

SABER was fine-tuned using:

  1. MultipleNegativesRankingLoss (MNLR)

    • Transforms the embedding space so similar pairs cluster tightly.
    • Each batch uses in-batch negatives, dramatically improving separation.
  2. Matryoshka Representation Learning

    • Ensures embeddings remain meaningful across different vector truncation sizes.
  3. Triplet Ranking Optimization

    • Anchor–Positive similarity maximized.
    • Anchor–Negative similarity minimized.
    • Margin-based structure preserved.
  4. Optimizer & Hyperparameters

Hyperparameter Value
Batch Size 16
Epochs 3
Loss MNLR + Matryoshka
Precision FP16
Negative Sampling In-batch
Gradient Clip Stable defaults
Warmup Ratio 0.1

🧪 Evaluation

SABER was evaluated on two benchmarks:

A) STS Evaluation (Saudi Paragraph-Level Dataset)

Dataset: 1000 samples (0–5 similarity) generated in Saudi dialect.

Metric Score
Pearson 0.9189
Spearman 0.9045
MAE 1.69
MSE 3.82

These results surpass: ATM2, GATE, LaBSE, MarBERT, mE5-base, and MiniLM.

B) Triplet Evaluation

Triplets derived from STS via (score ≥3 positive, score ≤1 negative).

Metric Score
Basic Accuracy 0.9899
Margin > 0.05 0.9845
Margin > 0.10 0.9781
Margin > 0.20 0.9609

Excellent separation across strict thresholds.


🔍 Usage Example

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the model
model = SentenceTransformer("Omartificial-Intelligence-Space/Saudi-Semantic-Embedding-v0.1")

# Define sentences (Saudi Dialect)
s1 = "ودي أسافر للرياض الأسبوع الجاي"
s2 = "أفكر أروح الرياض قريب عشان مشوار مهم"

# Encode
e1 = model.encode([s1])
e2 = model.encode([s2])

# Calculate similarity
sim = cosine_similarity(e1, e2)[0][0]
print("Cosine Similarity:", sim)

Training Details

Training Dataset

csv

  • Dataset: csv
  • Size: 2,964 training samples
  • Columns: text1 and text2
  • Approximate statistics based on the first 1000 samples:
    text1 text2
    type string string
    details
    • min: 5 tokens
    • mean: 10.36 tokens
    • max: 22 tokens
    • min: 4 tokens
    • mean: 10.28 tokens
    • max: 19 tokens
  • Samples:
    text1 text2
    هل فيه رحلات بحرية للأطفال في جدة؟ ودي أعرف عن جولات بحرية للأطفال في جدة
    ودي أحجز تذكرة طيران للرياض الأسبوع الجاي ناوي أشتري تذكرة للرياض الأسبوع الجاي
    عطوني أفضل فندق قريب من مطار جدة أبي فندق قريب من المطار
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768
        ],
        "matryoshka_weights": [
            1
        ],
        "n_dims_per_step": -1
    }
    

Citation

📌 Commercial Use

Commercial use of this model is not permitted under the CC BY-NC 4.0 license.
For commercial licensing, partnerships, or enterprise use, please contact:

📩 [email protected]

If you use this model in academic work, please cite:

@inproceedings{nacar-saber-2025,
    title = "SAUDI ARABIC EMBEDDING MODEL FOR SEMANTIC SIMILARITY AND RETRIEVAL",
    author = "Nacar, Omer",
    year = "2025",
    url = "https://huggingface.co/Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B",
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}