Omartificial-Intelligence-Space's picture

Update README.md

2fd0457 verified 25 days ago

12.7 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- dense
	- generated_from_trainer
	- dataset_size:2964
	- loss:MatryoshkaLoss
	- loss:MultipleNegativesRankingLoss
	base_model: Omartificial-Intelligence-Space/SA-BERT-V1
	widget:
	- source_sentence: كم تكلفة رحلة بحرية ليوم؟
	sentences:
	- الباحثين يحلّلوا تأثير البيئة على اختلاف العادات بين المناطق.
	- أبي محلات فيها بضاعة عالمية مشهورة.
	- بكم أسعار الجولات البحرية اليومية؟
	- source_sentence: الفعاليات الشعبية تختلف حسب المناسبات.
	sentences:
	- معطف المختبر حقي اختفى وأبي أشتري بديل
	- بحط كل البنود المطلوبة وبراجع الميزانية عشان نرفع الطلب
	- بعض المناطق تتميز بطرق احتفال خاصة بها.
	- source_sentence: الأسوار القديمة كانت تحمي المدن زمان.
	sentences:
	- بجلس أصلّحها قبل أرسلها
	- بعض المدن احتفظت بأبوابها التاريخية.
	- بجلس أشتغل عليها وسط اليوم
	- source_sentence: ودي أجرب رحلة سفاري بصحراء الربع الخالي
	sentences:
	- أبغى أشارك بجولة سفاري بالربع الخالي
	- هذا التمرين ضروري لنحت منطقة البطن والخصر.
	- ودي أعرف عن فنادق فخمة بالدمام
	- source_sentence: أبي طرحة جديدة لونها سماوي فاتح.
	sentences:
	- المشاوي عندهم متبلة صح وتحسها طازجة
	- ريحة المعطرات هذي قوية وتقعد في الغرف؟
	- أدور شيلة لونها أزرق فاتح زي السماء.
	pipeline_tag: feature-extraction
	library_name: sentence-transformers
	license: cc-by-nc-4.0
	gated: true
	extra_gated_prompt: Please provide the required information to access this model
	extra_gated_fields:
	Full Name:
	type: text
	Affiliation / Company:
	type: text
	Email Address:
	type: text
	Intended Use:
	type: select
	options:
	- Research
	- Education
	- Commercial Exploration
	- Academic Project
	- Other
	extra_gated_heading: Access Request – Provide Required Information
	extra_gated_description: Before accessing this model, please complete the form below.
	extra_gated_button_content: Submit Access Request
	datasets:
	- Omartificial-Intelligence-Space/SaudiDialect-Triplet-21
	language:
	- ar
	metrics:
	- mse
	- mae
	---


	# 🏷️ SABER: Saudi Semantic Embedding Model (v0.1)

	![Black Elegant Minimalist Profile LinkedIn Banner](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/mWfh710vwZ_TsW7IXenNf.png)

	## 🧩 Summary


	<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/x1FdbE8bYGVQFfY01f1Nk.png" width="175" align="left"/>

	SABER-v0.1 (Saudi Arabic BERT Embeddings for Retrieval) is a state-of-the-art Saudi dialect semantic embedding model, fine-tuned from SA-BERT using MultipleNegativesRankingLoss (MNLR) and Matryoshka Representation Learning over a large, high-quality Saudi Triplet Dataset spanning 21 real-life Saudi domains.


	SABER transforms a standard Masked Language Model (MLM) into a powerful semantic encoder capable of capturing deep contextual meaning across Najdi, Hijazi, Gulf-influenced, and mixed Saudi dialectals.
	The model achieves state-of-the-art results across both long-paragraph `STS` evaluation and triplet margin separation, significantly outperforming strong baselines such as ATM2, GATE, LaBSE, mE5-base, MarBERT, and MiniLM.

	## 🏗️ Architecture & Build Pipeline

	SABER utilizes a rigorous two-stage optimization pipeline: first, we adapted MARBERT-V2 via Masked Language Modeling (MLM) on 500k Saudi sentences to create the domain-specialized SA-BERT, followed by deep semantic optimization using MultipleNegativesRankingLoss (MNRL) and Matryoshka Representation Learning on curated triplets to produce the final state-of-the-art embedding model.

	<div align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/zC5JCPbsnIz-jflmTWae8.png" alt="SABER Training Pipeline" width="500"/>
	</div>

	---

	SABER is designed for:
	* Semantic search
	* Retrieval-Augmented Generation (RAG)
	* Clustering
	* Intent detection
	* Semantic similarity
	* Document & paragraph embedding
	* Ranking and re-ranking systems
	* Multi-domain Saudi-language applications

	This release is v0.1 — the first public version of SABER.

	## 📌 Model Details

	* Model Name: SABER (Saudi Semantic Embedding)
	* Version: v0.1
	* Base Model: SA-BERT-V1 (AraBERT trained on Saudi data)
	* Language: Arabic (Saudi Dialects: Najdi, Hijazi, Gulf)
	* Task: Sentence Embeddings, Semantic Similarity, Retrieval
	* Training Objective: MNLR + Matryoshka Loss
	* Embedding Dimension: 768
	* License: Apache 2.0
	* Maintainer: Omartificial-Intelligence-Space

	---

	## 🧠 Motivation

	Saudi dialect NLP remains an underdeveloped space. Most embeddings struggle with dialectal variation, idiomatic expressions, and multi-sentence reasoning. SABER was designed to fill this gap by:

	1. Training specifically on Saudi-dialect triplet data.
	2. Leveraging modern contrastive learning.
	3. Creating robust embeddings suitable for production and research.

	This model is the result of extensive evaluation across STS, triplets, and domain-specific tests.

	---
	#### ⚠️ Limitations

	1. Regional Scope: Performance may degrade on Levantine, Egyptian, or Maghrebi dialects.
	2. Scope: Embeddings focus on semantic similarity, not syntax or classification.
	3. Input Length: Long multi-document retrieval requires chunking.
	---

	## 📚 Training Data

	SABER was trained on [Omartificial-Intelligence-Space/SaudiDialect-Triplet-21](https://huggingface.co/datasets/Omartificial-Intelligence-Space/SaudiDialect-Triplet-21), which contains:

	* 2964 triplets (Anchor, Positive, Negative)
	* 21 domains, including:
	* Travel, Food, Shopping, Work & Office, Education, Culture, Weather, Sports, Technology, Medical, Government, Social Events, Anthropology, etc.
	* Mixed Saudi dialect sentences (Najdi + Hijazi + Gulf)
	* Real-world conversational phrasing
	* Carefully curated positive/negative pairs

	The dataset includes natural variations in:
	* Word choice
	* Dialect morphology
	* Sentence structure
	* Discourse context
	* Multi-sentence reasoning

	---

	## 🔧 Training Methodology

	SABER was fine-tuned using:

	1. MultipleNegativesRankingLoss (MNLR)
	* Transforms the embedding space so similar pairs cluster tightly.
	* Each batch uses in-batch negatives, dramatically improving separation.

	2. Matryoshka Representation Learning
	* Ensures embeddings remain meaningful across different vector truncation sizes.

	3. Triplet Ranking Optimization
	* Anchor–Positive similarity maximized.
	* Anchor–Negative similarity minimized.
	* Margin-based structure preserved.

	4. Optimizer & Hyperparameters

	\| Hyperparameter \| Value \|
	\| :--- \| :--- \|
	\| Batch Size \| 16 \|
	\| Epochs \| 3 \|
	\| Loss \| MNLR + Matryoshka \|
	\| Precision \| FP16 \|
	\| Negative Sampling \| In-batch \|
	\| Gradient Clip \| Stable defaults \|
	\| Warmup Ratio \| 0.1 \|

	---

	## 🧪 Evaluation

	SABER was evaluated on two benchmarks:

	### A) STS Evaluation (Saudi Paragraph-Level Dataset)
	Dataset: 1000 samples (0–5 similarity) generated in Saudi dialect.

	\| Metric \| Score \|
	\| :--- \| :--- \|
	\| Pearson \| 0.9189 \|
	\| Spearman \| 0.9045 \|
	\| MAE \| 1.69 \|
	\| MSE \| 3.82 \|

	These results surpass: ATM2, GATE, LaBSE, MarBERT, mE5-base, and MiniLM.

	### B) Triplet Evaluation
	Triplets derived from STS via (score ≥3 positive, score ≤1 negative).

	\| Metric \| Score \|
	\| :--- \| :--- \|
	\| Basic Accuracy \| 0.9899 \|
	\| Margin > 0.05 \| 0.9845 \|
	\| Margin > 0.10 \| 0.9781 \|
	\| Margin > 0.20 \| 0.9609 \|

	Excellent separation across strict thresholds.

	---

	## 🔍 Usage Example

	```python
	from sentence_transformers import SentenceTransformer
	from sklearn.metrics.pairwise import cosine_similarity

	# Load the model
	model = SentenceTransformer("Omartificial-Intelligence-Space/Saudi-Semantic-Embedding-v0.1")

	# Define sentences (Saudi Dialect)
	s1 = "ودي أسافر للرياض الأسبوع الجاي"
	s2 = "أفكر أروح الرياض قريب عشان مشوار مهم"

	# Encode
	e1 = model.encode([s1])
	e2 = model.encode([s2])

	# Calculate similarity
	sim = cosine_similarity(e1, e2)[0][0]
	print("Cosine Similarity:", sim)
	```

	## Training Details

	### Training Dataset

	#### csv

	* Dataset: csv
	* Size: 2,964 training samples
	* Columns: <code>text1</code> and <code>text2</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| text1 \| text2 \|
	\|:--------\|:----------------------------------------------------------------------------------\|:----------------------------------------------------------------------------------\|
	\| type \| string \| string \|
	\| details \| <ul><li>min: 5 tokens</li><li>mean: 10.36 tokens</li><li>max: 22 tokens</li></ul> \| <ul><li>min: 4 tokens</li><li>mean: 10.28 tokens</li><li>max: 19 tokens</li></ul> \|
	* Samples:
	\| text1 \| text2 \|
	\|:-------------------------------------------------------\|:----------------------------------------------------\|
	\| <code>هل فيه رحلات بحرية للأطفال في جدة؟</code> \| <code>ودي أعرف عن جولات بحرية للأطفال في جدة</code> \|
	\| <code>ودي أحجز تذكرة طيران للرياض الأسبوع الجاي</code> \| <code>ناوي أشتري تذكرة للرياض الأسبوع الجاي</code> \|
	\| <code>عطوني أفضل فندق قريب من مطار جدة</code> \| <code>أبي فندق قريب من المطار</code> \|
	* Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
	```json
	{
	"loss": "MultipleNegativesRankingLoss",
	"matryoshka_dims": [
	768
	],
	"matryoshka_weights": [
	1
	],
	"n_dims_per_step": -1
	}
	```

	## Citation

	### 📌 Commercial Use
	Commercial use of this model is not permitted under the CC BY-NC 4.0 license.
	For commercial licensing, partnerships, or enterprise use, please contact:

	📩 [email protected]

	If you use this model in academic work, please cite:

	```bibtex
	@inproceedings{nacar-saber-2025,
	title = "SAUDI ARABIC EMBEDDING MODEL FOR SEMANTIC SIMILARITY AND RETRIEVAL",
	author = "Nacar, Omer",
	year = "2025",
	url = "https://huggingface.co/Omartificial-Intelligence-Space/SA-STS-Embeddings-0.2B",
	}
	```


	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### MatryoshkaLoss
	```bibtex
	@misc{kusupati2024matryoshka,
	title={Matryoshka Representation Learning},
	author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
	year={2024},
	eprint={2205.13147},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```

	#### MultipleNegativesRankingLoss
	```bibtex
	@misc{henderson2017efficient,
	title={Efficient Natural Language Response Suggestion for Smart Reply},
	author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
	year={2017},
	eprint={1705.00652},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```