Omartificial-Intelligence-Space's picture
Update README.md
2fd0457 verified
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- dataset_size:2964
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Omartificial-Intelligence-Space/SA-BERT-V1
widget:
- source_sentence: كم تكلفة رحلة بحرية ليوم؟
sentences:
- الباحثين يحلّلوا تأثير البيئة على اختلاف العادات بين المناطق.
- أبي محلات فيها بضاعة عالمية مشهورة.
- بكم أسعار الجولات البحرية اليومية؟
- source_sentence: الفعاليات الشعبية تختلف حسب المناسبات.
sentences:
- معطف المختبر حقي اختفى وأبي أشتري بديل
- بحط كل البنود المطلوبة وبراجع الميزانية عشان نرفع الطلب
- بعض المناطق تتميز بطرق احتفال خاصة بها.
- source_sentence: الأسوار القديمة كانت تحمي المدن زمان.
sentences:
- بجلس أصلّحها قبل أرسلها
- بعض المدن احتفظت بأبوابها التاريخية.
- بجلس أشتغل عليها وسط اليوم
- source_sentence: ودي أجرب رحلة سفاري بصحراء الربع الخالي
sentences:
- أبغى أشارك بجولة سفاري بالربع الخالي
- هذا التمرين ضروري لنحت منطقة البطن والخصر.
- ودي أعرف عن فنادق فخمة بالدمام
- source_sentence: أبي طرحة جديدة لونها سماوي فاتح.
sentences:
- المشاوي عندهم متبلة صح وتحسها طازجة
- ريحة المعطرات هذي قوية وتقعد في الغرف؟
- أدور شيلة لونها أزرق فاتح زي السماء.
pipeline_tag: feature-extraction
library_name: sentence-transformers
license: cc-by-nc-4.0
gated: true
extra_gated_prompt: Please provide the required information to access this model
extra_gated_fields:
Full Name:
type: text
Affiliation / Company:
type: text
Email Address:
type: text
Intended Use:
type: select
options:
- Research
- Education
- Commercial Exploration
- Academic Project
- Other
extra_gated_heading: Access Request Provide Required Information
extra_gated_description: Before accessing this model, please complete the form below.
extra_gated_button_content: Submit Access Request
datasets:
- Omartificial-Intelligence-Space/SaudiDialect-Triplet-21
language:
- ar
metrics:
- mse
- mae
---
# 🏷️ SABER: Saudi Semantic Embedding Model (v0.1)
![Black Elegant Minimalist Profile LinkedIn Banner](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/mWfh710vwZ_TsW7IXenNf.png)
## 🧩 Summary
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/x1FdbE8bYGVQFfY01f1Nk.png" width="175" align="left"/>
**SABER-v0.1** (Saudi Arabic BERT Embeddings for Retrieval) is a state-of-the-art Saudi dialect semantic embedding model, fine-tuned from SA-BERT using **MultipleNegativesRankingLoss (MNLR)** and **Matryoshka Representation Learning** over a large, high-quality Saudi Triplet Dataset spanning 21 real-life Saudi domains.
**SABER** transforms a standard Masked Language Model (MLM) into a powerful semantic encoder capable of capturing deep contextual meaning across Najdi, Hijazi, Gulf-influenced, and mixed Saudi dialectals.
The model achieves state-of-the-art results across both long-paragraph `STS` evaluation and triplet margin separation, significantly outperforming strong baselines such as ATM2, GATE, LaBSE, mE5-base, MarBERT, and MiniLM.
## 🏗️ Architecture & Build Pipeline
SABER utilizes a rigorous two-stage optimization pipeline: first, we adapted **MARBERT-V2** via Masked Language Modeling (MLM) on **500k Saudi sentences** to create the domain-specialized **SA-BERT**, followed by deep semantic optimization using **MultipleNegativesRankingLoss (MNRL)** and **Matryoshka Representation Learning** on curated triplets to produce the final state-of-the-art embedding model.
<div align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/zC5JCPbsnIz-jflmTWae8.png" alt="SABER Training Pipeline" width="500"/>
</div>
---
**SABER is designed for:**
* Semantic search
* Retrieval-Augmented Generation (RAG)
* Clustering
* Intent detection
* Semantic similarity
* Document & paragraph embedding
* Ranking and re-ranking systems
* Multi-domain Saudi-language applications
*This release is v0.1 — the first public version of SABER.*
## 📌 Model Details
* **Model Name:** SABER (Saudi Semantic Embedding)
* **Version:** v0.1
* **Base Model:** SA-BERT-V1 (AraBERT trained on Saudi data)
* **Language:** Arabic (Saudi Dialects: Najdi, Hijazi, Gulf)
* **Task:** Sentence Embeddings, Semantic Similarity, Retrieval
* **Training Objective:** MNLR + Matryoshka Loss
* **Embedding Dimension:** 768
* **License:** Apache 2.0
* **Maintainer:** Omartificial-Intelligence-Space
---
## 🧠 Motivation
Saudi dialect NLP remains an underdeveloped space. Most embeddings struggle with dialectal variation, idiomatic expressions, and multi-sentence reasoning. SABER was designed to fill this gap by:
1. Training specifically on Saudi-dialect triplet data.
2. Leveraging modern contrastive learning.
3. Creating robust embeddings suitable for production and research.
This model is the result of extensive evaluation across STS, triplets, and domain-specific tests.
---
#### ⚠️ Limitations
1. Regional Scope: Performance may degrade on Levantine, Egyptian, or Maghrebi dialects.
2. Scope: Embeddings focus on semantic similarity, not syntax or classification.
3. Input Length: Long multi-document retrieval requires chunking.
---
## 📚 Training Data
**SABER** was trained on [Omartificial-Intelligence-Space/SaudiDialect-Triplet-21](https://huggingface.co/datasets/Omartificial-Intelligence-Space/SaudiDialect-Triplet-21), which contains:
* **2964 triplets** (Anchor, Positive, Negative)
* **21 domains**, including:
* Travel, Food, Shopping, Work & Office, Education, Culture, Weather, Sports, Technology, Medical, Government, Social Events, Anthropology, etc.
* Mixed Saudi dialect sentences (Najdi + Hijazi + Gulf)
* Real-world conversational phrasing
* Carefully curated positive/negative pairs
**The dataset includes natural variations in:**
* Word choice
* Dialect morphology
* Sentence structure
* Discourse context
* Multi-sentence reasoning
---
## 🔧 Training Methodology
SABER was fine-tuned using:
1. **MultipleNegativesRankingLoss (MNLR)**
* Transforms the embedding space so similar pairs cluster tightly.
* Each batch uses in-batch negatives, dramatically improving separation.
2. **Matryoshka Representation Learning**
* Ensures embeddings remain meaningful across different vector truncation sizes.
3. **Triplet Ranking Optimization**
* Anchor–Positive similarity maximized.
* Anchor–Negative similarity minimized.
* Margin-based structure preserved.
4. **Optimizer & Hyperparameters**
| Hyperparameter | Value |
| :--- | :--- |
| **Batch Size** | 16 |
| **Epochs** | 3 |
| **Loss** | MNLR + Matryoshka |
| **Precision** | FP16 |
| **Negative Sampling** | In-batch |
| **Gradient Clip** | Stable defaults |
| **Warmup Ratio** | 0.1 |
---
## 🧪 Evaluation
SABER was evaluated on two benchmarks:
### A) STS Evaluation (Saudi Paragraph-Level Dataset)
**Dataset:** 1000 samples (0–5 similarity) generated in Saudi dialect.
| Metric | Score |
| :--- | :--- |
| **Pearson** | **0.9189** |
| **Spearman** | **0.9045** |
| **MAE** | 1.69 |
| **MSE** | 3.82 |
*These results surpass: ATM2, GATE, LaBSE, MarBERT, mE5-base, and MiniLM.*
### B) Triplet Evaluation
Triplets derived from STS via (score ≥3 positive, score ≤1 negative).
| Metric | Score |
| :--- | :--- |
| **Basic Accuracy** | 0.9899 |
| **Margin > 0.05** | 0.9845 |
| **Margin > 0.10** | 0.9781 |
| **Margin > 0.20** | 0.9609 |
*Excellent separation across strict thresholds.*
---
## 🔍 Usage Example
```python
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load the model
model = SentenceTransformer("Omartificial-Intelligence-Space/Saudi-Semantic-Embedding-v0.1")
# Define sentences (Saudi Dialect)
s1 = "ودي أسافر للرياض الأسبوع الجاي"
s2 = "أفكر أروح الرياض قريب عشان مشوار مهم"
# Encode
e1 = model.encode([s1])
e2 = model.encode([s2])
# Calculate similarity
sim = cosine_similarity(e1, e2)[0][0]
print("Cosine Similarity:", sim)
```
## Training Details
### Training Dataset
#### csv
* Dataset: csv
* Size: 2,964 training samples
* Columns: <code>text1</code> and <code>text2</code>
* Approximate statistics based on the first 1000 samples:
| | text1 | text2 |
|:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
| type | string | string |
| details | <ul><li>min: 5 tokens</li><li>mean: 10.36 tokens</li><li>max: 22 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 10.28 tokens</li><li>max: 19 tokens</li></ul> |
* Samples:
| text1 | text2 |
|:-------------------------------------------------------|:----------------------------------------------------|
| <code>هل فيه رحلات بحرية للأطفال في جدة؟</code> | <code>ودي أعرف عن جولات بحرية للأطفال في جدة</code> |
| <code>ودي أحجز تذكرة طيران للرياض الأسبوع الجاي</code> | <code>ناوي أشتري تذكرة للرياض الأسبوع الجاي</code> |
| <code>عطوني أفضل فندق قريب من مطار جدة</code> | <code>أبي فندق قريب من المطار</code> |
* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
```json
{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768
],
"matryoshka_weights": [
1
],
"n_dims_per_step": -1
}
```
## Citation
### 📌 Commercial Use
Commercial use of this model is **not permitted** under the CC BY-NC 4.0 license.
For commercial licensing, partnerships, or enterprise use, please contact:
📩 **[email protected]**
If you use this model in academic work, please cite:
```bibtex
@inproceedings{nacar-saber-2025,
title = "SAUDI ARABIC EMBEDDING MODEL FOR SEMANTIC SIMILARITY AND RETRIEVAL",
author = "Nacar, Omer",
year = "2025",
url = "https://huggingface.co/Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B",
}
```
#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
```
#### MatryoshkaLoss
```bibtex
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```