--- tags: - sentence-transformers - sentence-similarity - feature-extraction - dense - generated_from_trainer - dataset_size:2964 - loss:MatryoshkaLoss - loss:MultipleNegativesRankingLoss base_model: Omartificial-Intelligence-Space/SA-BERT-V1 widget: - source_sentence: كم تكلفة رحلة بحرية ليوم؟ sentences: - الباحثين يحلّلوا تأثير البيئة على اختلاف العادات بين المناطق. - أبي محلات فيها بضاعة عالمية مشهورة. - بكم أسعار الجولات البحرية اليومية؟ - source_sentence: الفعاليات الشعبية تختلف حسب المناسبات. sentences: - معطف المختبر حقي اختفى وأبي أشتري بديل - بحط كل البنود المطلوبة وبراجع الميزانية عشان نرفع الطلب - بعض المناطق تتميز بطرق احتفال خاصة بها. - source_sentence: الأسوار القديمة كانت تحمي المدن زمان. sentences: - بجلس أصلّحها قبل أرسلها - بعض المدن احتفظت بأبوابها التاريخية. - بجلس أشتغل عليها وسط اليوم - source_sentence: ودي أجرب رحلة سفاري بصحراء الربع الخالي sentences: - أبغى أشارك بجولة سفاري بالربع الخالي - هذا التمرين ضروري لنحت منطقة البطن والخصر. - ودي أعرف عن فنادق فخمة بالدمام - source_sentence: أبي طرحة جديدة لونها سماوي فاتح. sentences: - المشاوي عندهم متبلة صح وتحسها طازجة - ريحة المعطرات هذي قوية وتقعد في الغرف؟ - أدور شيلة لونها أزرق فاتح زي السماء. pipeline_tag: feature-extraction library_name: sentence-transformers license: cc-by-nc-4.0 gated: true extra_gated_prompt: Please provide the required information to access this model extra_gated_fields: Full Name: type: text Affiliation / Company: type: text Email Address: type: text Intended Use: type: select options: - Research - Education - Commercial Exploration - Academic Project - Other extra_gated_heading: Access Request – Provide Required Information extra_gated_description: Before accessing this model, please complete the form below. extra_gated_button_content: Submit Access Request datasets: - Omartificial-Intelligence-Space/SaudiDialect-Triplet-21 language: - ar metrics: - mse - mae --- # 🏷️ SABER: Saudi Semantic Embedding Model (v0.1) ![Black Elegant Minimalist Profile LinkedIn Banner](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/mWfh710vwZ_TsW7IXenNf.png) ## 🧩 Summary **SABER-v0.1** (Saudi Arabic BERT Embeddings for Retrieval) is a state-of-the-art Saudi dialect semantic embedding model, fine-tuned from SA-BERT using **MultipleNegativesRankingLoss (MNLR)** and **Matryoshka Representation Learning** over a large, high-quality Saudi Triplet Dataset spanning 21 real-life Saudi domains. **SABER** transforms a standard Masked Language Model (MLM) into a powerful semantic encoder capable of capturing deep contextual meaning across Najdi, Hijazi, Gulf-influenced, and mixed Saudi dialectals. The model achieves state-of-the-art results across both long-paragraph `STS` evaluation and triplet margin separation, significantly outperforming strong baselines such as ATM2, GATE, LaBSE, mE5-base, MarBERT, and MiniLM. ## 🏗️ Architecture & Build Pipeline SABER utilizes a rigorous two-stage optimization pipeline: first, we adapted **MARBERT-V2** via Masked Language Modeling (MLM) on **500k Saudi sentences** to create the domain-specialized **SA-BERT**, followed by deep semantic optimization using **MultipleNegativesRankingLoss (MNRL)** and **Matryoshka Representation Learning** on curated triplets to produce the final state-of-the-art embedding model.
SABER Training Pipeline
--- **SABER is designed for:** * Semantic search * Retrieval-Augmented Generation (RAG) * Clustering * Intent detection * Semantic similarity * Document & paragraph embedding * Ranking and re-ranking systems * Multi-domain Saudi-language applications *This release is v0.1 — the first public version of SABER.* ## 📌 Model Details * **Model Name:** SABER (Saudi Semantic Embedding) * **Version:** v0.1 * **Base Model:** SA-BERT-V1 (AraBERT trained on Saudi data) * **Language:** Arabic (Saudi Dialects: Najdi, Hijazi, Gulf) * **Task:** Sentence Embeddings, Semantic Similarity, Retrieval * **Training Objective:** MNLR + Matryoshka Loss * **Embedding Dimension:** 768 * **License:** Apache 2.0 * **Maintainer:** Omartificial-Intelligence-Space --- ## 🧠 Motivation Saudi dialect NLP remains an underdeveloped space. Most embeddings struggle with dialectal variation, idiomatic expressions, and multi-sentence reasoning. SABER was designed to fill this gap by: 1. Training specifically on Saudi-dialect triplet data. 2. Leveraging modern contrastive learning. 3. Creating robust embeddings suitable for production and research. This model is the result of extensive evaluation across STS, triplets, and domain-specific tests. --- #### ⚠️ Limitations 1. Regional Scope: Performance may degrade on Levantine, Egyptian, or Maghrebi dialects. 2. Scope: Embeddings focus on semantic similarity, not syntax or classification. 3. Input Length: Long multi-document retrieval requires chunking. --- ## 📚 Training Data **SABER** was trained on [Omartificial-Intelligence-Space/SaudiDialect-Triplet-21](https://huggingface.co/datasets/Omartificial-Intelligence-Space/SaudiDialect-Triplet-21), which contains: * **2964 triplets** (Anchor, Positive, Negative) * **21 domains**, including: * Travel, Food, Shopping, Work & Office, Education, Culture, Weather, Sports, Technology, Medical, Government, Social Events, Anthropology, etc. * Mixed Saudi dialect sentences (Najdi + Hijazi + Gulf) * Real-world conversational phrasing * Carefully curated positive/negative pairs **The dataset includes natural variations in:** * Word choice * Dialect morphology * Sentence structure * Discourse context * Multi-sentence reasoning --- ## 🔧 Training Methodology SABER was fine-tuned using: 1. **MultipleNegativesRankingLoss (MNLR)** * Transforms the embedding space so similar pairs cluster tightly. * Each batch uses in-batch negatives, dramatically improving separation. 2. **Matryoshka Representation Learning** * Ensures embeddings remain meaningful across different vector truncation sizes. 3. **Triplet Ranking Optimization** * Anchor–Positive similarity maximized. * Anchor–Negative similarity minimized. * Margin-based structure preserved. 4. **Optimizer & Hyperparameters** | Hyperparameter | Value | | :--- | :--- | | **Batch Size** | 16 | | **Epochs** | 3 | | **Loss** | MNLR + Matryoshka | | **Precision** | FP16 | | **Negative Sampling** | In-batch | | **Gradient Clip** | Stable defaults | | **Warmup Ratio** | 0.1 | --- ## 🧪 Evaluation SABER was evaluated on two benchmarks: ### A) STS Evaluation (Saudi Paragraph-Level Dataset) **Dataset:** 1000 samples (0–5 similarity) generated in Saudi dialect. | Metric | Score | | :--- | :--- | | **Pearson** | **0.9189** | | **Spearman** | **0.9045** | | **MAE** | 1.69 | | **MSE** | 3.82 | *These results surpass: ATM2, GATE, LaBSE, MarBERT, mE5-base, and MiniLM.* ### B) Triplet Evaluation Triplets derived from STS via (score ≥3 positive, score ≤1 negative). | Metric | Score | | :--- | :--- | | **Basic Accuracy** | 0.9899 | | **Margin > 0.05** | 0.9845 | | **Margin > 0.10** | 0.9781 | | **Margin > 0.20** | 0.9609 | *Excellent separation across strict thresholds.* --- ## 🔍 Usage Example ```python from sentence_transformers import SentenceTransformer from sklearn.metrics.pairwise import cosine_similarity # Load the model model = SentenceTransformer("Omartificial-Intelligence-Space/Saudi-Semantic-Embedding-v0.1") # Define sentences (Saudi Dialect) s1 = "ودي أسافر للرياض الأسبوع الجاي" s2 = "أفكر أروح الرياض قريب عشان مشوار مهم" # Encode e1 = model.encode([s1]) e2 = model.encode([s2]) # Calculate similarity sim = cosine_similarity(e1, e2)[0][0] print("Cosine Similarity:", sim) ``` ## Training Details ### Training Dataset #### csv * Dataset: csv * Size: 2,964 training samples * Columns: text1 and text2 * Approximate statistics based on the first 1000 samples: | | text1 | text2 | |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------| | type | string | string | | details | | | * Samples: | text1 | text2 | |:-------------------------------------------------------|:----------------------------------------------------| | هل فيه رحلات بحرية للأطفال في جدة؟ | ودي أعرف عن جولات بحرية للأطفال في جدة | | ودي أحجز تذكرة طيران للرياض الأسبوع الجاي | ناوي أشتري تذكرة للرياض الأسبوع الجاي | | عطوني أفضل فندق قريب من مطار جدة | أبي فندق قريب من المطار | * Loss: [MatryoshkaLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters: ```json { "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768 ], "matryoshka_weights": [ 1 ], "n_dims_per_step": -1 } ``` ## Citation ### 📌 Commercial Use Commercial use of this model is **not permitted** under the CC BY-NC 4.0 license. For commercial licensing, partnerships, or enterprise use, please contact: 📩 **eng.omarnj@gmail.com** If you use this model in academic work, please cite: ```bibtex @inproceedings{nacar-saber-2025, title = "SAUDI ARABIC EMBEDDING MODEL FOR SEMANTIC SIMILARITY AND RETRIEVAL", author = "Nacar, Omer", year = "2025", url = "https://huggingface.co/Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B", } ``` #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### MatryoshkaLoss ```bibtex @misc{kusupati2024matryoshka, title={Matryoshka Representation Learning}, author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi}, year={2024}, eprint={2205.13147}, archivePrefix={arXiv}, primaryClass={cs.LG} } ``` #### MultipleNegativesRankingLoss ```bibtex @misc{henderson2017efficient, title={Efficient Natural Language Response Suggestion for Smart Reply}, author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, year={2017}, eprint={1705.00652}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```