Overview
sindhinltk is a pure-Python NLP library built specifically for the Sindhi language (Arabic script). Sindhi is spoken by ~80 million people but has had virtually no open-source NLP tooling — until now.
pip install sindhinltk
Zero dependencies. No model downloads. Works immediately.
What's Inside
| Module | Description |
|---|---|
SindhiTokenizer |
Regex-based word & sentence tokenizer with full Unicode Arabic range support |
SindhiNormalizer |
NFC normalization, diacritic (harakat) removal, whitespace cleanup |
SindhiStemmer |
Rule-based suffix stripper — 20+ Sindhi morphological rules, longest-match |
SindhiStopwords |
143 stopwords across 10 semantic categories |
SindhiSentiment |
Lexicon-based sentiment with intensifier & negator handling, Sindhi labels |
SindhiDatasets |
Load bundled data assets (stopwords, sentiment lexicon) |
Quick Start
from sindhinltk.tokenizer import SindhiTokenizer
from sindhinltk.normalizer import SindhiNormalizer
from sindhinltk.stemmer import SindhiStemmer
from sindhinltk.stopwords import SindhiStopwords
from sindhinltk.sentiment import SindhiSentiment
text = "سنڌي ٻولي تمام سٺي ۽ قديم آهي"
# Tokenize
tok = SindhiTokenizer()
tokens = tok.tokenize(text)
# → ['سنڌي', 'ٻولي', 'تمام', 'سٺي', '۽', 'قديم', 'آهي']
# Normalize (strip diacritics)
norm = SindhiNormalizer()
clean = norm.normalize("ھوُ ھر روزَ", remove_diacritics=True)
# → 'ھو ھر روز'
# Remove stopwords
sw = SindhiStopwords()
content = sw.remove_stopwords(tokens)
# → ['سنڌي', 'ٻولي', 'سٺي', 'قديم']
# Stem
stm = SindhiStemmer()
stems = stm.stem_tokens(content)
# → ['سنڌي', 'ٻولي', 'سٺي', 'قديم']
# Sentiment
sa = SindhiSentiment()
sa.analyze(text) # → 'مثبت' (positive)
sa.score(text) # → 2.0
Sentiment Labels
| Label | Meaning | Score |
|---|---|---|
مثبت |
Positive | > 0 |
منفي |
Negative | < 0 |
غير جانبدار |
Neutral | = 0 |
Handles intensifiers (تمام, ڏاڍو) and negators (نه, ناهي) automatically:
sa.score("سٺو") # → 1.0
sa.score("تمام سٺو") # → 2.0 (intensified)
sa.score("نه سٺو") # → -1.0 (negated)
Stopword Categories
from sindhinltk.stopwords import SindhiStopwords
sw = SindhiStopwords()
sw.get_categories()
# → ['pronouns', 'demonstratives', 'postpositions', 'conjunctions',
# 'question_words', 'auxiliaries', 'negation', 'quantifiers',
# 'adverbs', 'particles']
sw.get_stopwords(category="pronouns")
# → {'مان', 'تون', 'هو', 'هوءَ', 'اسان', 'توهان', ...}
Stemmer Rules
The stemmer uses a longest-first rule set tuned for Sindhi morphology:
| Suffix | Example | Stem |
|---|---|---|
يندڙ |
ڪاوڙيندڙ | ڪاوڙ |
ندي |
هلندي | هل |
ندا |
وڃندا | وڃ |
ندو |
ڪندو | ڪ |
يائين |
ڪيائين | ڪ |
پڻ |
سچپڻ | سچ |
Linguistic Background
- Script: Naskh Arabic with Sindhi-specific letters (
ڄ ڃ ٻ ڦ ڳ ڱ ڻ ڏ ڊ ٺ ٽ ڇ) - Morphology: Agglutinative verb system — one root generates 40+ surface forms
- Diacritics: Full harakat support (U+064B–U+065F, U+0670)
- RTL: Fully right-to-left, bidirectional safe
Related Resources
| Resource | Description |
|---|---|
| sindhi-corpus-505m | 742K docs · ~505M tokens · largest open Sindhi corpus |
| SindhiLM-Tokenizer-v1 | BPE tokenizer merged into Qwen2.5-0.5B |
| GitHub | Source code, issues, contributions |
| PyPI | Install page |
Citation
If you use sindhinltk in research, please cite:
@software{meghwar2025sindhinltk,
author = {Aakash Meghwar},
title = {sindhinltk: A Natural Language Toolkit for Sindhi},
year = {2025},
publisher = {PyPI},
url = {https://pypi.org/project/sindhinltk/},
}
Author
Aakash Meghwar — Computational Linguist · NLP Engineer
MIT License · Building NLP tools for 80 million Sindhi speakers
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support