Omartificial-Intelligence-Space's picture

Update README.md

2fd0457 verified 24 days ago

12.7 kB

metadata

tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:2964
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
base_model: Omartificial-Intelligence-Space/SA-BERT-V1
widget:
  - source_sentence: كم تكلفة رحلة بحرية ليوم؟
    sentences:
      - الباحثين يحلّلوا تأثير البيئة على اختلاف العادات بين المناطق.
      - أبي محلات فيها بضاعة عالمية مشهورة.
      - بكم أسعار الجولات البحرية اليومية؟
  - source_sentence: الفعاليات الشعبية تختلف حسب المناسبات.
    sentences:
      - معطف المختبر حقي اختفى وأبي أشتري بديل
      - بحط كل البنود المطلوبة وبراجع الميزانية عشان نرفع الطلب
      - بعض المناطق تتميز بطرق احتفال خاصة بها.
  - source_sentence: الأسوار القديمة كانت تحمي المدن زمان.
    sentences:
      - بجلس أصلّحها قبل أرسلها
      - بعض المدن احتفظت بأبوابها التاريخية.
      - بجلس أشتغل عليها وسط اليوم
  - source_sentence: ودي أجرب رحلة سفاري بصحراء الربع الخالي
    sentences:
      - أبغى أشارك بجولة سفاري بالربع الخالي
      - هذا التمرين ضروري لنحت منطقة البطن والخصر.
      - ودي أعرف عن فنادق فخمة بالدمام
  - source_sentence: أبي طرحة جديدة لونها سماوي فاتح.
    sentences:
      - المشاوي عندهم متبلة صح وتحسها طازجة
      - ريحة المعطرات هذي قوية وتقعد في الغرف؟
      - أدور شيلة لونها أزرق فاتح زي السماء.
pipeline_tag: feature-extraction
library_name: sentence-transformers
license: cc-by-nc-4.0
gated: true
extra_gated_prompt: Please provide the required information to access this model
extra_gated_fields:
  Full Name:
    type: text
  Affiliation / Company:
    type: text
  Email Address:
    type: text
  Intended Use:
    type: select
    options:
      - Research
      - Education
      - Commercial Exploration
      - Academic Project
      - Other
extra_gated_heading: Access Request – Provide Required Information
extra_gated_description: Before accessing this model, please complete the form below.
extra_gated_button_content: Submit Access Request
datasets:
  - Omartificial-Intelligence-Space/SaudiDialect-Triplet-21
language:
  - ar
metrics:
  - mse
  - mae

🏷️ SABER: Saudi Semantic Embedding Model (v0.1)

🧩 Summary

SABER-v0.1 (Saudi Arabic BERT Embeddings for Retrieval) is a state-of-the-art Saudi dialect semantic embedding model, fine-tuned from SA-BERT using MultipleNegativesRankingLoss (MNLR) and Matryoshka Representation Learning over a large, high-quality Saudi Triplet Dataset spanning 21 real-life Saudi domains.

SABER transforms a standard Masked Language Model (MLM) into a powerful semantic encoder capable of capturing deep contextual meaning across Najdi, Hijazi, Gulf-influenced, and mixed Saudi dialectals. The model achieves state-of-the-art results across both long-paragraph STS evaluation and triplet margin separation, significantly outperforming strong baselines such as ATM2, GATE, LaBSE, mE5-base, MarBERT, and MiniLM.

🏗️ Architecture & Build Pipeline

SABER utilizes a rigorous two-stage optimization pipeline: first, we adapted MARBERT-V2 via Masked Language Modeling (MLM) on 500k Saudi sentences to create the domain-specialized SA-BERT, followed by deep semantic optimization using MultipleNegativesRankingLoss (MNRL) and Matryoshka Representation Learning on curated triplets to produce the final state-of-the-art embedding model.

SABER is designed for:

Semantic search
Retrieval-Augmented Generation (RAG)
Clustering
Intent detection
Semantic similarity
Document & paragraph embedding
Ranking and re-ranking systems
Multi-domain Saudi-language applications

This release is v0.1 — the first public version of SABER.

📌 Model Details

Model Name: SABER (Saudi Semantic Embedding)
Version: v0.1
Base Model: SA-BERT-V1 (AraBERT trained on Saudi data)
Language: Arabic (Saudi Dialects: Najdi, Hijazi, Gulf)
Task: Sentence Embeddings, Semantic Similarity, Retrieval
Training Objective: MNLR + Matryoshka Loss
Embedding Dimension: 768
License: Apache 2.0
Maintainer: Omartificial-Intelligence-Space

🧠 Motivation

Saudi dialect NLP remains an underdeveloped space. Most embeddings struggle with dialectal variation, idiomatic expressions, and multi-sentence reasoning. SABER was designed to fill this gap by:

Training specifically on Saudi-dialect triplet data.
Leveraging modern contrastive learning.
Creating robust embeddings suitable for production and research.

This model is the result of extensive evaluation across STS, triplets, and domain-specific tests.

⚠️ Limitations

Regional Scope: Performance may degrade on Levantine, Egyptian, or Maghrebi dialects.
Scope: Embeddings focus on semantic similarity, not syntax or classification.
Input Length: Long multi-document retrieval requires chunking.

📚 Training Data

SABER was trained on Omartificial-Intelligence-Space/SaudiDialect-Triplet-21, which contains:

2964 triplets (Anchor, Positive, Negative)
21 domains, including:
- Travel, Food, Shopping, Work & Office, Education, Culture, Weather, Sports, Technology, Medical, Government, Social Events, Anthropology, etc.
Mixed Saudi dialect sentences (Najdi + Hijazi + Gulf)
Real-world conversational phrasing
Carefully curated positive/negative pairs

The dataset includes natural variations in:

Word choice
Dialect morphology
Sentence structure
Discourse context
Multi-sentence reasoning

🔧 Training Methodology

SABER was fine-tuned using:

MultipleNegativesRankingLoss (MNLR)
- Transforms the embedding space so similar pairs cluster tightly.
- Each batch uses in-batch negatives, dramatically improving separation.
Matryoshka Representation Learning
- Ensures embeddings remain meaningful across different vector truncation sizes.
Triplet Ranking Optimization
- Anchor–Positive similarity maximized.
- Anchor–Negative similarity minimized.
- Margin-based structure preserved.
Optimizer & Hyperparameters

Hyperparameter	Value
Batch Size	16
Epochs	3
Loss	MNLR + Matryoshka
Precision	FP16
Negative Sampling	In-batch
Gradient Clip	Stable defaults
Warmup Ratio	0.1

🧪 Evaluation

SABER was evaluated on two benchmarks:

A) STS Evaluation (Saudi Paragraph-Level Dataset)

Dataset: 1000 samples (0–5 similarity) generated in Saudi dialect.

Metric	Score
Pearson	0.9189
Spearman	0.9045
MAE	1.69
MSE	3.82

These results surpass: ATM2, GATE, LaBSE, MarBERT, mE5-base, and MiniLM.

B) Triplet Evaluation

Triplets derived from STS via (score ≥3 positive, score ≤1 negative).

Metric	Score
Basic Accuracy	0.9899
Margin > 0.05	0.9845
Margin > 0.10	0.9781
Margin > 0.20	0.9609

Excellent separation across strict thresholds.

🔍 Usage Example

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the model
model = SentenceTransformer("Omartificial-Intelligence-Space/Saudi-Semantic-Embedding-v0.1")

# Define sentences (Saudi Dialect)
s1 = "ودي أسافر للرياض الأسبوع الجاي"
s2 = "أفكر أروح الرياض قريب عشان مشوار مهم"

# Encode
e1 = model.encode([s1])
e2 = model.encode([s2])

# Calculate similarity
sim = cosine_similarity(e1, e2)[0][0]
print("Cosine Similarity:", sim)

Training Details

Training Dataset

csv

Dataset: csv
Size: 2,964 training samples
Columns: text1 and text2
Approximate statistics based on the first 1000 samples:
text1 text2
type string string
details
min: 5 tokens
mean: 10.36 tokens
max: 22 tokens

min: 4 tokens
mean: 10.28 tokens
max: 19 tokens

	text1	text2
type	string	string
details	min: 5 tokens mean: 10.36 tokens max: 22 tokens	min: 4 tokens mean: 10.28 tokens max: 19 tokens

Samples:

text1	text2
`هل فيه رحلات بحرية للأطفال في جدة؟`	`ودي أعرف عن جولات بحرية للأطفال في جدة`
`ودي أحجز تذكرة طيران للرياض الأسبوع الجاي`	`ناوي أشتري تذكرة للرياض الأسبوع الجاي`
`عطوني أفضل فندق قريب من مطار جدة`	`أبي فندق قريب من المطار`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768
    ],
    "matryoshka_weights": [
        1
    ],
    "n_dims_per_step": -1
}

Citation

📌 Commercial Use

Commercial use of this model is not permitted under the CC BY-NC 4.0 license.
For commercial licensing, partnerships, or enterprise use, please contact:

📩 [email protected]

If you use this model in academic work, please cite:

@inproceedings{nacar-saber-2025,
    title = "SAUDI ARABIC EMBEDDING MODEL FOR SEMANTIC SIMILARITY AND RETRIEVAL",
    author = "Nacar, Omer",
    year = "2025",
    url = "https://huggingface.co/Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B",
}

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}