---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- dataset_size:2964
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Omartificial-Intelligence-Space/SA-BERT-V1
widget:
- source_sentence: كم تكلفة رحلة بحرية ليوم؟
sentences:
- الباحثين يحلّلوا تأثير البيئة على اختلاف العادات بين المناطق.
- أبي محلات فيها بضاعة عالمية مشهورة.
- بكم أسعار الجولات البحرية اليومية؟
- source_sentence: الفعاليات الشعبية تختلف حسب المناسبات.
sentences:
- معطف المختبر حقي اختفى وأبي أشتري بديل
- بحط كل البنود المطلوبة وبراجع الميزانية عشان نرفع الطلب
- بعض المناطق تتميز بطرق احتفال خاصة بها.
- source_sentence: الأسوار القديمة كانت تحمي المدن زمان.
sentences:
- بجلس أصلّحها قبل أرسلها
- بعض المدن احتفظت بأبوابها التاريخية.
- بجلس أشتغل عليها وسط اليوم
- source_sentence: ودي أجرب رحلة سفاري بصحراء الربع الخالي
sentences:
- أبغى أشارك بجولة سفاري بالربع الخالي
- هذا التمرين ضروري لنحت منطقة البطن والخصر.
- ودي أعرف عن فنادق فخمة بالدمام
- source_sentence: أبي طرحة جديدة لونها سماوي فاتح.
sentences:
- المشاوي عندهم متبلة صح وتحسها طازجة
- ريحة المعطرات هذي قوية وتقعد في الغرف؟
- أدور شيلة لونها أزرق فاتح زي السماء.
pipeline_tag: feature-extraction
library_name: sentence-transformers
license: cc-by-nc-4.0
gated: true
extra_gated_prompt: Please provide the required information to access this model
extra_gated_fields:
Full Name:
type: text
Affiliation / Company:
type: text
Email Address:
type: text
Intended Use:
type: select
options:
- Research
- Education
- Commercial Exploration
- Academic Project
- Other
extra_gated_heading: Access Request – Provide Required Information
extra_gated_description: Before accessing this model, please complete the form below.
extra_gated_button_content: Submit Access Request
datasets:
- Omartificial-Intelligence-Space/SaudiDialect-Triplet-21
language:
- ar
metrics:
- mse
- mae
---
# 🏷️ SABER: Saudi Semantic Embedding Model (v0.1)

## 🧩 Summary
**SABER-v0.1** (Saudi Arabic BERT Embeddings for Retrieval) is a state-of-the-art Saudi dialect semantic embedding model, fine-tuned from SA-BERT using **MultipleNegativesRankingLoss (MNLR)** and **Matryoshka Representation Learning** over a large, high-quality Saudi Triplet Dataset spanning 21 real-life Saudi domains.
**SABER** transforms a standard Masked Language Model (MLM) into a powerful semantic encoder capable of capturing deep contextual meaning across Najdi, Hijazi, Gulf-influenced, and mixed Saudi dialectals.
The model achieves state-of-the-art results across both long-paragraph `STS` evaluation and triplet margin separation, significantly outperforming strong baselines such as ATM2, GATE, LaBSE, mE5-base, MarBERT, and MiniLM.
## 🏗️ Architecture & Build Pipeline
SABER utilizes a rigorous two-stage optimization pipeline: first, we adapted **MARBERT-V2** via Masked Language Modeling (MLM) on **500k Saudi sentences** to create the domain-specialized **SA-BERT**, followed by deep semantic optimization using **MultipleNegativesRankingLoss (MNRL)** and **Matryoshka Representation Learning** on curated triplets to produce the final state-of-the-art embedding model.
text1 and text2
* Approximate statistics based on the first 1000 samples:
| | text1 | text2 |
|:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
| type | string | string |
| details | هل فيه رحلات بحرية للأطفال في جدة؟ | ودي أعرف عن جولات بحرية للأطفال في جدة |
| ودي أحجز تذكرة طيران للرياض الأسبوع الجاي | ناوي أشتري تذكرة للرياض الأسبوع الجاي |
| عطوني أفضل فندق قريب من مطار جدة | أبي فندق قريب من المطار |
* Loss: [MatryoshkaLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
```json
{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768
],
"matryoshka_weights": [
1
],
"n_dims_per_step": -1
}
```
## Citation
### 📌 Commercial Use
Commercial use of this model is **not permitted** under the CC BY-NC 4.0 license.
For commercial licensing, partnerships, or enterprise use, please contact:
📩 **eng.omarnj@gmail.com**
If you use this model in academic work, please cite:
```bibtex
@inproceedings{nacar-saber-2025,
title = "SAUDI ARABIC EMBEDDING MODEL FOR SEMANTIC SIMILARITY AND RETRIEVAL",
author = "Nacar, Omer",
year = "2025",
url = "https://huggingface.co/Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B",
}
```
#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
```
#### MatryoshkaLoss
```bibtex
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```