---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- dataset_size:2964
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Omartificial-Intelligence-Space/SA-BERT-V1
widget:
- source_sentence: كم تكلفة رحلة بحرية ليوم؟
  sentences:
  - الباحثين يحلّلوا تأثير البيئة على اختلاف العادات بين المناطق.
  - أبي محلات فيها بضاعة عالمية مشهورة.
  - بكم أسعار الجولات البحرية اليومية؟
- source_sentence: الفعاليات الشعبية تختلف حسب المناسبات.
  sentences:
  - معطف المختبر حقي اختفى وأبي أشتري بديل
  - بحط كل البنود المطلوبة وبراجع الميزانية عشان نرفع الطلب
  - بعض المناطق تتميز بطرق احتفال خاصة بها.
- source_sentence: الأسوار القديمة كانت تحمي المدن زمان.
  sentences:
  - بجلس أصلّحها قبل أرسلها
  - بعض المدن احتفظت بأبوابها التاريخية.
  - بجلس أشتغل عليها وسط اليوم
- source_sentence: ودي أجرب رحلة سفاري بصحراء الربع الخالي
  sentences:
  - أبغى أشارك بجولة سفاري بالربع الخالي
  - هذا التمرين ضروري لنحت منطقة البطن والخصر.
  - ودي أعرف عن فنادق فخمة بالدمام
- source_sentence: أبي طرحة جديدة لونها سماوي فاتح.
  sentences:
  - المشاوي عندهم متبلة صح وتحسها طازجة
  - ريحة المعطرات هذي قوية وتقعد في الغرف؟
  - أدور شيلة لونها أزرق فاتح زي السماء.
pipeline_tag: feature-extraction
library_name: sentence-transformers
license: cc-by-nc-4.0
gated: true
extra_gated_prompt: Please provide the required information to access this model
extra_gated_fields:
  Full Name:
    type: text
  Affiliation / Company:
    type: text
  Email Address:
    type: text
  Intended Use:
    type: select
    options:
    - Research
    - Education
    - Commercial Exploration
    - Academic Project
    - Other
extra_gated_heading: Access Request – Provide Required Information
extra_gated_description: Before accessing this model, please complete the form below.
extra_gated_button_content: Submit Access Request
datasets:
- Omartificial-Intelligence-Space/SaudiDialect-Triplet-21
language:
- ar
metrics:
- mse
- mae
---


# 🏷️ SABER: Saudi Semantic Embedding Model (v0.1)

![Black Elegant Minimalist Profile LinkedIn Banner](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/mWfh710vwZ_TsW7IXenNf.png)

## 🧩 Summary


<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/x1FdbE8bYGVQFfY01f1Nk.png" width="175" align="left"/>

**SABER-v0.1** (Saudi Arabic BERT Embeddings for Retrieval) is a state-of-the-art Saudi dialect semantic embedding model, fine-tuned from SA-BERT using **MultipleNegativesRankingLoss (MNLR)** and **Matryoshka Representation Learning** over a large, high-quality Saudi Triplet Dataset spanning 21 real-life Saudi domains.


**SABER** transforms a standard Masked Language Model (MLM) into a powerful semantic encoder capable of capturing deep contextual meaning across Najdi, Hijazi, Gulf-influenced, and mixed Saudi dialectals.
The model achieves state-of-the-art results across both long-paragraph `STS` evaluation and triplet margin separation, significantly outperforming strong baselines such as ATM2, GATE, LaBSE, mE5-base, MarBERT, and MiniLM.

## 🏗️ Architecture & Build Pipeline

SABER utilizes a rigorous two-stage optimization pipeline: first, we adapted **MARBERT-V2** via Masked Language Modeling (MLM) on **500k Saudi sentences** to create the domain-specialized **SA-BERT**, followed by deep semantic optimization using **MultipleNegativesRankingLoss (MNRL)** and **Matryoshka Representation Learning** on curated triplets to produce the final state-of-the-art embedding model.

<div align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/zC5JCPbsnIz-jflmTWae8.png" alt="SABER Training Pipeline" width="500"/>
</div>

---

**SABER is designed for:**
*   Semantic search
*   Retrieval-Augmented Generation (RAG)
*   Clustering
*   Intent detection
*   Semantic similarity
*   Document & paragraph embedding
*   Ranking and re-ranking systems
*   Multi-domain Saudi-language applications

*This release is v0.1 — the first public version of SABER.*

## 📌 Model Details

*   **Model Name:** SABER (Saudi Semantic Embedding)
*   **Version:** v0.1
*   **Base Model:** SA-BERT-V1 (AraBERT trained on Saudi data)
*   **Language:** Arabic (Saudi Dialects: Najdi, Hijazi, Gulf)
*   **Task:** Sentence Embeddings, Semantic Similarity, Retrieval
*   **Training Objective:** MNLR + Matryoshka Loss
*   **Embedding Dimension:** 768
*   **License:** Apache 2.0
*   **Maintainer:** Omartificial-Intelligence-Space

---

## 🧠 Motivation

Saudi dialect NLP remains an underdeveloped space. Most embeddings struggle with dialectal variation, idiomatic expressions, and multi-sentence reasoning. SABER was designed to fill this gap by:

1.  Training specifically on Saudi-dialect triplet data.
2.  Leveraging modern contrastive learning.
3.  Creating robust embeddings suitable for production and research.

This model is the result of extensive evaluation across STS, triplets, and domain-specific tests.

---
#### ⚠️ Limitations

1. Regional Scope: Performance may degrade on Levantine, Egyptian, or Maghrebi dialects.
2. Scope: Embeddings focus on semantic similarity, not syntax or classification.
3. Input Length: Long multi-document retrieval requires chunking.
---

## 📚 Training Data

**SABER** was trained on [Omartificial-Intelligence-Space/SaudiDialect-Triplet-21](https://huggingface.co/datasets/Omartificial-Intelligence-Space/SaudiDialect-Triplet-21), which contains:

*   **2964 triplets** (Anchor, Positive, Negative)
*   **21 domains**, including:
    *   Travel, Food, Shopping, Work & Office, Education, Culture, Weather, Sports, Technology, Medical, Government, Social Events, Anthropology, etc.
*   Mixed Saudi dialect sentences (Najdi + Hijazi + Gulf)
*   Real-world conversational phrasing
*   Carefully curated positive/negative pairs

**The dataset includes natural variations in:**
*   Word choice
*   Dialect morphology
*   Sentence structure
*   Discourse context
*   Multi-sentence reasoning

---

## 🔧 Training Methodology

SABER was fine-tuned using:

1.  **MultipleNegativesRankingLoss (MNLR)**
    *   Transforms the embedding space so similar pairs cluster tightly.
    *   Each batch uses in-batch negatives, dramatically improving separation.

2.  **Matryoshka Representation Learning**
    *   Ensures embeddings remain meaningful across different vector truncation sizes.

3.  **Triplet Ranking Optimization**
    *   Anchor–Positive similarity maximized.
    *   Anchor–Negative similarity minimized.
    *   Margin-based structure preserved.

4.  **Optimizer & Hyperparameters**

| Hyperparameter | Value |
| :--- | :--- |
| **Batch Size** | 16 |
| **Epochs** | 3 |
| **Loss** | MNLR + Matryoshka |
| **Precision** | FP16 |
| **Negative Sampling** | In-batch |
| **Gradient Clip** | Stable defaults |
| **Warmup Ratio** | 0.1 |

---

## 🧪 Evaluation

SABER was evaluated on two benchmarks:

### A) STS Evaluation (Saudi Paragraph-Level Dataset)
**Dataset:** 1000 samples (0–5 similarity) generated in Saudi dialect.

| Metric | Score |
| :--- | :--- |
| **Pearson** | **0.9189** |
| **Spearman** | **0.9045** |
| **MAE** | 1.69 |
| **MSE** | 3.82 |

*These results surpass: ATM2, GATE, LaBSE, MarBERT, mE5-base, and MiniLM.*

### B) Triplet Evaluation
Triplets derived from STS via (score ≥3 positive, score ≤1 negative).

| Metric | Score |
| :--- | :--- |
| **Basic Accuracy** | 0.9899 |
| **Margin > 0.05** | 0.9845 |
| **Margin > 0.10** | 0.9781 |
| **Margin > 0.20** | 0.9609 |

*Excellent separation across strict thresholds.*

---

## 🔍 Usage Example

```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the model
model = SentenceTransformer("Omartificial-Intelligence-Space/Saudi-Semantic-Embedding-v0.1")

# Define sentences (Saudi Dialect)
s1 = "ودي أسافر للرياض الأسبوع الجاي"
s2 = "أفكر أروح الرياض قريب عشان مشوار مهم"

# Encode
e1 = model.encode([s1])
e2 = model.encode([s2])

# Calculate similarity
sim = cosine_similarity(e1, e2)[0][0]
print("Cosine Similarity:", sim)
```

## Training Details

### Training Dataset

#### csv

* Dataset: csv
* Size: 2,964 training samples
* Columns: <code>text1</code> and <code>text2</code>
* Approximate statistics based on the first 1000 samples:
  |         | text1                                                                             | text2                                                                             |
  |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
  | type    | string                                                                            | string                                                                            |
  | details | <ul><li>min: 5 tokens</li><li>mean: 10.36 tokens</li><li>max: 22 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 10.28 tokens</li><li>max: 19 tokens</li></ul> |
* Samples:
  | text1                                                  | text2                                               |
  |:-------------------------------------------------------|:----------------------------------------------------|
  | <code>هل فيه رحلات بحرية للأطفال في جدة؟</code>        | <code>ودي أعرف عن جولات بحرية للأطفال في جدة</code> |
  | <code>ودي أحجز تذكرة طيران للرياض الأسبوع الجاي</code> | <code>ناوي أشتري تذكرة للرياض الأسبوع الجاي</code>  |
  | <code>عطوني أفضل فندق قريب من مطار جدة</code>          | <code>أبي فندق قريب من المطار</code>                |
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
  ```json
  {
      "loss": "MultipleNegativesRankingLoss",
      "matryoshka_dims": [
          768
      ],
      "matryoshka_weights": [
          1
      ],
      "n_dims_per_step": -1
  }
  ```

## Citation

### 📌 Commercial Use
Commercial use of this model is **not permitted** under the CC BY-NC 4.0 license.  
For commercial licensing, partnerships, or enterprise use, please contact:

📩 **eng.omarnj@gmail.com**  

If you use this model in academic work, please cite:

```bibtex
@inproceedings{nacar-saber-2025,
    title = "SAUDI ARABIC EMBEDDING MODEL FOR SEMANTIC SIMILARITY AND RETRIEVAL",
    author = "Nacar, Omer",
    year = "2025",
    url = "https://huggingface.co/Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B",
}
```


#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
```

#### MatryoshkaLoss
```bibtex
@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
```

#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```