tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- dataset_size:2964
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Omartificial-Intelligence-Space/SA-BERT-V1
widget:
- source_sentence: كم تكلفة رحلة بحرية ليوم؟
sentences:
- الباحثين يحلّلوا تأثير البيئة على اختلاف العادات بين المناطق.
- أبي محلات فيها بضاعة عالمية مشهورة.
- بكم أسعار الجولات البحرية اليومية؟
- source_sentence: الفعاليات الشعبية تختلف حسب المناسبات.
sentences:
- معطف المختبر حقي اختفى وأبي أشتري بديل
- بحط كل البنود المطلوبة وبراجع الميزانية عشان نرفع الطلب
- بعض المناطق تتميز بطرق احتفال خاصة بها.
- source_sentence: الأسوار القديمة كانت تحمي المدن زمان.
sentences:
- بجلس أصلّحها قبل أرسلها
- بعض المدن احتفظت بأبوابها التاريخية.
- بجلس أشتغل عليها وسط اليوم
- source_sentence: ودي أجرب رحلة سفاري بصحراء الربع الخالي
sentences:
- أبغى أشارك بجولة سفاري بالربع الخالي
- هذا التمرين ضروري لنحت منطقة البطن والخصر.
- ودي أعرف عن فنادق فخمة بالدمام
- source_sentence: أبي طرحة جديدة لونها سماوي فاتح.
sentences:
- المشاوي عندهم متبلة صح وتحسها طازجة
- ريحة المعطرات هذي قوية وتقعد في الغرف؟
- أدور شيلة لونها أزرق فاتح زي السماء.
pipeline_tag: feature-extraction
library_name: sentence-transformers
license: cc-by-nc-4.0
gated: true
extra_gated_prompt: Please provide the required information to access this model
extra_gated_fields:
Full Name:
type: text
Affiliation / Company:
type: text
Email Address:
type: text
Intended Use:
type: select
options:
- Research
- Education
- Commercial Exploration
- Academic Project
- Other
extra_gated_heading: Access Request – Provide Required Information
extra_gated_description: Before accessing this model, please complete the form below.
extra_gated_button_content: Submit Access Request
datasets:
- Omartificial-Intelligence-Space/SaudiDialect-Triplet-21
language:
- ar
metrics:
- mse
- mae
🏷️ SABER: Saudi Semantic Embedding Model (v0.1)
🧩 Summary
SABER-v0.1 (Saudi Arabic BERT Embeddings for Retrieval) is a state-of-the-art Saudi dialect semantic embedding model, fine-tuned from SA-BERT using MultipleNegativesRankingLoss (MNLR) and Matryoshka Representation Learning over a large, high-quality Saudi Triplet Dataset spanning 21 real-life Saudi domains.
SABER transforms a standard Masked Language Model (MLM) into a powerful semantic encoder capable of capturing deep contextual meaning across Najdi, Hijazi, Gulf-influenced, and mixed Saudi dialectals.
The model achieves state-of-the-art results across both long-paragraph STS evaluation and triplet margin separation, significantly outperforming strong baselines such as ATM2, GATE, LaBSE, mE5-base, MarBERT, and MiniLM.
🏗️ Architecture & Build Pipeline
SABER utilizes a rigorous two-stage optimization pipeline: first, we adapted MARBERT-V2 via Masked Language Modeling (MLM) on 500k Saudi sentences to create the domain-specialized SA-BERT, followed by deep semantic optimization using MultipleNegativesRankingLoss (MNRL) and Matryoshka Representation Learning on curated triplets to produce the final state-of-the-art embedding model.
SABER is designed for:
- Semantic search
- Retrieval-Augmented Generation (RAG)
- Clustering
- Intent detection
- Semantic similarity
- Document & paragraph embedding
- Ranking and re-ranking systems
- Multi-domain Saudi-language applications
This release is v0.1 — the first public version of SABER.
📌 Model Details
- Model Name: SABER (Saudi Semantic Embedding)
- Version: v0.1
- Base Model: SA-BERT-V1 (AraBERT trained on Saudi data)
- Language: Arabic (Saudi Dialects: Najdi, Hijazi, Gulf)
- Task: Sentence Embeddings, Semantic Similarity, Retrieval
- Training Objective: MNLR + Matryoshka Loss
- Embedding Dimension: 768
- License: Apache 2.0
- Maintainer: Omartificial-Intelligence-Space
🧠 Motivation
Saudi dialect NLP remains an underdeveloped space. Most embeddings struggle with dialectal variation, idiomatic expressions, and multi-sentence reasoning. SABER was designed to fill this gap by:
- Training specifically on Saudi-dialect triplet data.
- Leveraging modern contrastive learning.
- Creating robust embeddings suitable for production and research.
This model is the result of extensive evaluation across STS, triplets, and domain-specific tests.
⚠️ Limitations
- Regional Scope: Performance may degrade on Levantine, Egyptian, or Maghrebi dialects.
- Scope: Embeddings focus on semantic similarity, not syntax or classification.
- Input Length: Long multi-document retrieval requires chunking.
📚 Training Data
SABER was trained on Omartificial-Intelligence-Space/SaudiDialect-Triplet-21, which contains:
- 2964 triplets (Anchor, Positive, Negative)
- 21 domains, including:
- Travel, Food, Shopping, Work & Office, Education, Culture, Weather, Sports, Technology, Medical, Government, Social Events, Anthropology, etc.
- Mixed Saudi dialect sentences (Najdi + Hijazi + Gulf)
- Real-world conversational phrasing
- Carefully curated positive/negative pairs
The dataset includes natural variations in:
- Word choice
- Dialect morphology
- Sentence structure
- Discourse context
- Multi-sentence reasoning
🔧 Training Methodology
SABER was fine-tuned using:
MultipleNegativesRankingLoss (MNLR)
- Transforms the embedding space so similar pairs cluster tightly.
- Each batch uses in-batch negatives, dramatically improving separation.
Matryoshka Representation Learning
- Ensures embeddings remain meaningful across different vector truncation sizes.
Triplet Ranking Optimization
- Anchor–Positive similarity maximized.
- Anchor–Negative similarity minimized.
- Margin-based structure preserved.
Optimizer & Hyperparameters
| Hyperparameter | Value |
|---|---|
| Batch Size | 16 |
| Epochs | 3 |
| Loss | MNLR + Matryoshka |
| Precision | FP16 |
| Negative Sampling | In-batch |
| Gradient Clip | Stable defaults |
| Warmup Ratio | 0.1 |
🧪 Evaluation
SABER was evaluated on two benchmarks:
A) STS Evaluation (Saudi Paragraph-Level Dataset)
Dataset: 1000 samples (0–5 similarity) generated in Saudi dialect.
| Metric | Score |
|---|---|
| Pearson | 0.9189 |
| Spearman | 0.9045 |
| MAE | 1.69 |
| MSE | 3.82 |
These results surpass: ATM2, GATE, LaBSE, MarBERT, mE5-base, and MiniLM.
B) Triplet Evaluation
Triplets derived from STS via (score ≥3 positive, score ≤1 negative).
| Metric | Score |
|---|---|
| Basic Accuracy | 0.9899 |
| Margin > 0.05 | 0.9845 |
| Margin > 0.10 | 0.9781 |
| Margin > 0.20 | 0.9609 |
Excellent separation across strict thresholds.
🔍 Usage Example
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
# Load the model
model = SentenceTransformer("Omartificial-Intelligence-Space/Saudi-Semantic-Embedding-v0.1")
# Define sentences (Saudi Dialect)
s1 = "ودي أسافر للرياض الأسبوع الجاي"
s2 = "أفكر أروح الرياض قريب عشان مشوار مهم"
# Encode
e1 = model.encode([s1])
e2 = model.encode([s2])
# Calculate similarity
sim = cosine_similarity(e1, e2)[0][0]
print("Cosine Similarity:", sim)
Training Details
Training Dataset
csv
- Dataset: csv
- Size: 2,964 training samples
- Columns:
text1andtext2 - Approximate statistics based on the first 1000 samples:
text1 text2 type string string details - min: 5 tokens
- mean: 10.36 tokens
- max: 22 tokens
- min: 4 tokens
- mean: 10.28 tokens
- max: 19 tokens
- Samples:
text1 text2 هل فيه رحلات بحرية للأطفال في جدة؟ودي أعرف عن جولات بحرية للأطفال في جدةودي أحجز تذكرة طيران للرياض الأسبوع الجايناوي أشتري تذكرة للرياض الأسبوع الجايعطوني أفضل فندق قريب من مطار جدةأبي فندق قريب من المطار - Loss:
MatryoshkaLosswith these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768 ], "matryoshka_weights": [ 1 ], "n_dims_per_step": -1 }
Citation
📌 Commercial Use
Commercial use of this model is not permitted under the CC BY-NC 4.0 license.
For commercial licensing, partnerships, or enterprise use, please contact:
If you use this model in academic work, please cite:
@inproceedings{nacar-saber-2025,
title = "SAUDI ARABIC EMBEDDING MODEL FOR SEMANTIC SIMILARITY AND RETRIEVAL",
author = "Nacar, Omer",
year = "2025",
url = "https://huggingface.co/Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B",
}
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
