|
|
--- |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- sentence-similarity |
|
|
- feature-extraction |
|
|
- dense |
|
|
- generated_from_trainer |
|
|
- dataset_size:2964 |
|
|
- loss:MatryoshkaLoss |
|
|
- loss:MultipleNegativesRankingLoss |
|
|
base_model: Omartificial-Intelligence-Space/SA-BERT-V1 |
|
|
widget: |
|
|
- source_sentence: كم تكلفة رحلة بحرية ليوم؟ |
|
|
sentences: |
|
|
- الباحثين يحلّلوا تأثير البيئة على اختلاف العادات بين المناطق. |
|
|
- أبي محلات فيها بضاعة عالمية مشهورة. |
|
|
- بكم أسعار الجولات البحرية اليومية؟ |
|
|
- source_sentence: الفعاليات الشعبية تختلف حسب المناسبات. |
|
|
sentences: |
|
|
- معطف المختبر حقي اختفى وأبي أشتري بديل |
|
|
- بحط كل البنود المطلوبة وبراجع الميزانية عشان نرفع الطلب |
|
|
- بعض المناطق تتميز بطرق احتفال خاصة بها. |
|
|
- source_sentence: الأسوار القديمة كانت تحمي المدن زمان. |
|
|
sentences: |
|
|
- بجلس أصلّحها قبل أرسلها |
|
|
- بعض المدن احتفظت بأبوابها التاريخية. |
|
|
- بجلس أشتغل عليها وسط اليوم |
|
|
- source_sentence: ودي أجرب رحلة سفاري بصحراء الربع الخالي |
|
|
sentences: |
|
|
- أبغى أشارك بجولة سفاري بالربع الخالي |
|
|
- هذا التمرين ضروري لنحت منطقة البطن والخصر. |
|
|
- ودي أعرف عن فنادق فخمة بالدمام |
|
|
- source_sentence: أبي طرحة جديدة لونها سماوي فاتح. |
|
|
sentences: |
|
|
- المشاوي عندهم متبلة صح وتحسها طازجة |
|
|
- ريحة المعطرات هذي قوية وتقعد في الغرف؟ |
|
|
- أدور شيلة لونها أزرق فاتح زي السماء. |
|
|
pipeline_tag: feature-extraction |
|
|
library_name: sentence-transformers |
|
|
license: cc-by-nc-4.0 |
|
|
gated: true |
|
|
extra_gated_prompt: Please provide the required information to access this model |
|
|
extra_gated_fields: |
|
|
Full Name: |
|
|
type: text |
|
|
Affiliation / Company: |
|
|
type: text |
|
|
Email Address: |
|
|
type: text |
|
|
Intended Use: |
|
|
type: select |
|
|
options: |
|
|
- Research |
|
|
- Education |
|
|
- Commercial Exploration |
|
|
- Academic Project |
|
|
- Other |
|
|
extra_gated_heading: Access Request – Provide Required Information |
|
|
extra_gated_description: Before accessing this model, please complete the form below. |
|
|
extra_gated_button_content: Submit Access Request |
|
|
datasets: |
|
|
- Omartificial-Intelligence-Space/SaudiDialect-Triplet-21 |
|
|
language: |
|
|
- ar |
|
|
metrics: |
|
|
- mse |
|
|
- mae |
|
|
--- |
|
|
|
|
|
|
|
|
# 🏷️ SABER: Saudi Semantic Embedding Model (v0.1) |
|
|
|
|
|
 |
|
|
|
|
|
## 🧩 Summary |
|
|
|
|
|
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/x1FdbE8bYGVQFfY01f1Nk.png" width="175" align="left"/> |
|
|
|
|
|
**SABER-v0.1** (Saudi Arabic BERT Embeddings for Retrieval) is a state-of-the-art Saudi dialect semantic embedding model, fine-tuned from SA-BERT using **MultipleNegativesRankingLoss (MNLR)** and **Matryoshka Representation Learning** over a large, high-quality Saudi Triplet Dataset spanning 21 real-life Saudi domains. |
|
|
|
|
|
|
|
|
**SABER** transforms a standard Masked Language Model (MLM) into a powerful semantic encoder capable of capturing deep contextual meaning across Najdi, Hijazi, Gulf-influenced, and mixed Saudi dialectals. |
|
|
The model achieves state-of-the-art results across both long-paragraph `STS` evaluation and triplet margin separation, significantly outperforming strong baselines such as ATM2, GATE, LaBSE, mE5-base, MarBERT, and MiniLM. |
|
|
|
|
|
## 🏗️ Architecture & Build Pipeline |
|
|
|
|
|
SABER utilizes a rigorous two-stage optimization pipeline: first, we adapted **MARBERT-V2** via Masked Language Modeling (MLM) on **500k Saudi sentences** to create the domain-specialized **SA-BERT**, followed by deep semantic optimization using **MultipleNegativesRankingLoss (MNRL)** and **Matryoshka Representation Learning** on curated triplets to produce the final state-of-the-art embedding model. |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/zC5JCPbsnIz-jflmTWae8.png" alt="SABER Training Pipeline" width="500"/> |
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
**SABER is designed for:** |
|
|
* Semantic search |
|
|
* Retrieval-Augmented Generation (RAG) |
|
|
* Clustering |
|
|
* Intent detection |
|
|
* Semantic similarity |
|
|
* Document & paragraph embedding |
|
|
* Ranking and re-ranking systems |
|
|
* Multi-domain Saudi-language applications |
|
|
|
|
|
*This release is v0.1 — the first public version of SABER.* |
|
|
|
|
|
## 📌 Model Details |
|
|
|
|
|
* **Model Name:** SABER (Saudi Semantic Embedding) |
|
|
* **Version:** v0.1 |
|
|
* **Base Model:** SA-BERT-V1 (AraBERT trained on Saudi data) |
|
|
* **Language:** Arabic (Saudi Dialects: Najdi, Hijazi, Gulf) |
|
|
* **Task:** Sentence Embeddings, Semantic Similarity, Retrieval |
|
|
* **Training Objective:** MNLR + Matryoshka Loss |
|
|
* **Embedding Dimension:** 768 |
|
|
* **License:** Apache 2.0 |
|
|
* **Maintainer:** Omartificial-Intelligence-Space |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧠 Motivation |
|
|
|
|
|
Saudi dialect NLP remains an underdeveloped space. Most embeddings struggle with dialectal variation, idiomatic expressions, and multi-sentence reasoning. SABER was designed to fill this gap by: |
|
|
|
|
|
1. Training specifically on Saudi-dialect triplet data. |
|
|
2. Leveraging modern contrastive learning. |
|
|
3. Creating robust embeddings suitable for production and research. |
|
|
|
|
|
This model is the result of extensive evaluation across STS, triplets, and domain-specific tests. |
|
|
|
|
|
--- |
|
|
#### ⚠️ Limitations |
|
|
|
|
|
1. Regional Scope: Performance may degrade on Levantine, Egyptian, or Maghrebi dialects. |
|
|
2. Scope: Embeddings focus on semantic similarity, not syntax or classification. |
|
|
3. Input Length: Long multi-document retrieval requires chunking. |
|
|
--- |
|
|
|
|
|
## 📚 Training Data |
|
|
|
|
|
**SABER** was trained on [Omartificial-Intelligence-Space/SaudiDialect-Triplet-21](https://huggingface.co/datasets/Omartificial-Intelligence-Space/SaudiDialect-Triplet-21), which contains: |
|
|
|
|
|
* **2964 triplets** (Anchor, Positive, Negative) |
|
|
* **21 domains**, including: |
|
|
* Travel, Food, Shopping, Work & Office, Education, Culture, Weather, Sports, Technology, Medical, Government, Social Events, Anthropology, etc. |
|
|
* Mixed Saudi dialect sentences (Najdi + Hijazi + Gulf) |
|
|
* Real-world conversational phrasing |
|
|
* Carefully curated positive/negative pairs |
|
|
|
|
|
**The dataset includes natural variations in:** |
|
|
* Word choice |
|
|
* Dialect morphology |
|
|
* Sentence structure |
|
|
* Discourse context |
|
|
* Multi-sentence reasoning |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔧 Training Methodology |
|
|
|
|
|
SABER was fine-tuned using: |
|
|
|
|
|
1. **MultipleNegativesRankingLoss (MNLR)** |
|
|
* Transforms the embedding space so similar pairs cluster tightly. |
|
|
* Each batch uses in-batch negatives, dramatically improving separation. |
|
|
|
|
|
2. **Matryoshka Representation Learning** |
|
|
* Ensures embeddings remain meaningful across different vector truncation sizes. |
|
|
|
|
|
3. **Triplet Ranking Optimization** |
|
|
* Anchor–Positive similarity maximized. |
|
|
* Anchor–Negative similarity minimized. |
|
|
* Margin-based structure preserved. |
|
|
|
|
|
4. **Optimizer & Hyperparameters** |
|
|
|
|
|
| Hyperparameter | Value | |
|
|
| :--- | :--- | |
|
|
| **Batch Size** | 16 | |
|
|
| **Epochs** | 3 | |
|
|
| **Loss** | MNLR + Matryoshka | |
|
|
| **Precision** | FP16 | |
|
|
| **Negative Sampling** | In-batch | |
|
|
| **Gradient Clip** | Stable defaults | |
|
|
| **Warmup Ratio** | 0.1 | |
|
|
|
|
|
--- |
|
|
|
|
|
## 🧪 Evaluation |
|
|
|
|
|
SABER was evaluated on two benchmarks: |
|
|
|
|
|
### A) STS Evaluation (Saudi Paragraph-Level Dataset) |
|
|
**Dataset:** 1000 samples (0–5 similarity) generated in Saudi dialect. |
|
|
|
|
|
| Metric | Score | |
|
|
| :--- | :--- | |
|
|
| **Pearson** | **0.9189** | |
|
|
| **Spearman** | **0.9045** | |
|
|
| **MAE** | 1.69 | |
|
|
| **MSE** | 3.82 | |
|
|
|
|
|
*These results surpass: ATM2, GATE, LaBSE, MarBERT, mE5-base, and MiniLM.* |
|
|
|
|
|
### B) Triplet Evaluation |
|
|
Triplets derived from STS via (score ≥3 positive, score ≤1 negative). |
|
|
|
|
|
| Metric | Score | |
|
|
| :--- | :--- | |
|
|
| **Basic Accuracy** | 0.9899 | |
|
|
| **Margin > 0.05** | 0.9845 | |
|
|
| **Margin > 0.10** | 0.9781 | |
|
|
| **Margin > 0.20** | 0.9609 | |
|
|
|
|
|
*Excellent separation across strict thresholds.* |
|
|
|
|
|
--- |
|
|
|
|
|
## 🔍 Usage Example |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
|
|
|
# Load the model |
|
|
model = SentenceTransformer("Omartificial-Intelligence-Space/Saudi-Semantic-Embedding-v0.1") |
|
|
|
|
|
# Define sentences (Saudi Dialect) |
|
|
s1 = "ودي أسافر للرياض الأسبوع الجاي" |
|
|
s2 = "أفكر أروح الرياض قريب عشان مشوار مهم" |
|
|
|
|
|
# Encode |
|
|
e1 = model.encode([s1]) |
|
|
e2 = model.encode([s2]) |
|
|
|
|
|
# Calculate similarity |
|
|
sim = cosine_similarity(e1, e2)[0][0] |
|
|
print("Cosine Similarity:", sim) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Dataset |
|
|
|
|
|
#### csv |
|
|
|
|
|
* Dataset: csv |
|
|
* Size: 2,964 training samples |
|
|
* Columns: <code>text1</code> and <code>text2</code> |
|
|
* Approximate statistics based on the first 1000 samples: |
|
|
| | text1 | text2 | |
|
|
|:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------| |
|
|
| type | string | string | |
|
|
| details | <ul><li>min: 5 tokens</li><li>mean: 10.36 tokens</li><li>max: 22 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 10.28 tokens</li><li>max: 19 tokens</li></ul> | |
|
|
* Samples: |
|
|
| text1 | text2 | |
|
|
|:-------------------------------------------------------|:----------------------------------------------------| |
|
|
| <code>هل فيه رحلات بحرية للأطفال في جدة؟</code> | <code>ودي أعرف عن جولات بحرية للأطفال في جدة</code> | |
|
|
| <code>ودي أحجز تذكرة طيران للرياض الأسبوع الجاي</code> | <code>ناوي أشتري تذكرة للرياض الأسبوع الجاي</code> | |
|
|
| <code>عطوني أفضل فندق قريب من مطار جدة</code> | <code>أبي فندق قريب من المطار</code> | |
|
|
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters: |
|
|
```json |
|
|
{ |
|
|
"loss": "MultipleNegativesRankingLoss", |
|
|
"matryoshka_dims": [ |
|
|
768 |
|
|
], |
|
|
"matryoshka_weights": [ |
|
|
1 |
|
|
], |
|
|
"n_dims_per_step": -1 |
|
|
} |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
### 📌 Commercial Use |
|
|
Commercial use of this model is **not permitted** under the CC BY-NC 4.0 license. |
|
|
For commercial licensing, partnerships, or enterprise use, please contact: |
|
|
|
|
|
📩 **[email protected]** |
|
|
|
|
|
If you use this model in academic work, please cite: |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{nacar-saber-2025, |
|
|
title = "SAUDI ARABIC EMBEDDING MODEL FOR SEMANTIC SIMILARITY AND RETRIEVAL", |
|
|
author = "Nacar, Omer", |
|
|
year = "2025", |
|
|
url = "https://huggingface.co/Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B", |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
#### Sentence Transformers |
|
|
```bibtex |
|
|
@inproceedings{reimers-2019-sentence-bert, |
|
|
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
|
|
author = "Reimers, Nils and Gurevych, Iryna", |
|
|
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
|
|
month = "11", |
|
|
year = "2019", |
|
|
publisher = "Association for Computational Linguistics", |
|
|
url = "https://arxiv.org/abs/1908.10084", |
|
|
} |
|
|
``` |
|
|
|
|
|
#### MatryoshkaLoss |
|
|
```bibtex |
|
|
@misc{kusupati2024matryoshka, |
|
|
title={Matryoshka Representation Learning}, |
|
|
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi}, |
|
|
year={2024}, |
|
|
eprint={2205.13147}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.LG} |
|
|
} |
|
|
``` |
|
|
|
|
|
#### MultipleNegativesRankingLoss |
|
|
```bibtex |
|
|
@misc{henderson2017efficient, |
|
|
title={Efficient Natural Language Response Suggestion for Smart Reply}, |
|
|
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, |
|
|
year={2017}, |
|
|
eprint={1705.00652}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL} |
|
|
} |
|
|
``` |