🇵🇰 SindhiNLTK

The first open-source NLP toolkit for the Sindhi language

Overview

sindhinltk is a pure-Python NLP library built specifically for the Sindhi language (Arabic script). Sindhi is spoken by ~80 million people but has had virtually no open-source NLP tooling — until now.

pip install sindhinltk

Zero dependencies. No model downloads. Works immediately.

What's Inside

Module	Description
`SindhiTokenizer`	Regex-based word & sentence tokenizer with full Unicode Arabic range support
`SindhiNormalizer`	NFC normalization, diacritic (harakat) removal, whitespace cleanup
`SindhiStemmer`	Rule-based suffix stripper — 20+ Sindhi morphological rules, longest-match
`SindhiStopwords`	143 stopwords across 10 semantic categories
`SindhiSentiment`	Lexicon-based sentiment with intensifier & negator handling, Sindhi labels
`SindhiDatasets`	Load bundled data assets (stopwords, sentiment lexicon)

Quick Start

from sindhinltk.tokenizer  import SindhiTokenizer
from sindhinltk.normalizer import SindhiNormalizer
from sindhinltk.stemmer    import SindhiStemmer
from sindhinltk.stopwords  import SindhiStopwords
from sindhinltk.sentiment  import SindhiSentiment

text = "سنڌي ٻولي تمام سٺي ۽ قديم آهي"

# Tokenize
tok = SindhiTokenizer()
tokens = tok.tokenize(text)
# → ['سنڌي', 'ٻولي', 'تمام', 'سٺي', '۽', 'قديم', 'آهي']

# Normalize (strip diacritics)
norm = SindhiNormalizer()
clean = norm.normalize("ھوُ ھر روزَ", remove_diacritics=True)
# → 'ھو ھر روز'

# Remove stopwords
sw = SindhiStopwords()
content = sw.remove_stopwords(tokens)
# → ['سنڌي', 'ٻولي', 'سٺي', 'قديم']

# Stem
stm = SindhiStemmer()
stems = stm.stem_tokens(content)
# → ['سنڌي', 'ٻولي', 'سٺي', 'قديم']

# Sentiment
sa = SindhiSentiment()
sa.analyze(text)   # → 'مثبت'   (positive)
sa.score(text)     # → 2.0

Sentiment Labels

Label	Meaning	Score
`مثبت`	Positive	> 0
`منفي`	Negative	< 0
`غير جانبدار`	Neutral	= 0

Handles intensifiers (تمام, ڏاڍو) and negators (نه, ناهي) automatically:

sa.score("سٺو")        # → 1.0
sa.score("تمام سٺو")   # → 2.0   (intensified)
sa.score("نه سٺو")     # → -1.0  (negated)

Stopword Categories

from sindhinltk.stopwords import SindhiStopwords
sw = SindhiStopwords()

sw.get_categories()
# → ['pronouns', 'demonstratives', 'postpositions', 'conjunctions',
#    'question_words', 'auxiliaries', 'negation', 'quantifiers',
#    'adverbs', 'particles']

sw.get_stopwords(category="pronouns")
# → {'مان', 'تون', 'هو', 'هوءَ', 'اسان', 'توهان', ...}

Stemmer Rules

The stemmer uses a longest-first rule set tuned for Sindhi morphology:

Suffix	Example	Stem
`يندڙ`	ڪاوڙيندڙ	ڪاوڙ
`ندي`	هلندي	هل
`ندا`	وڃندا	وڃ
`ندو`	ڪندو	ڪ
`يائين`	ڪيائين	ڪ
`پڻ`	سچپڻ	سچ

Linguistic Background

Script: Naskh Arabic with Sindhi-specific letters (ڄ ڃ ٻ ڦ ڳ ڱ ڻ ڏ ڊ ٺ ٽ ڇ)
Morphology: Agglutinative verb system — one root generates 40+ surface forms
Diacritics: Full harakat support (U+064B–U+065F, U+0670)
RTL: Fully right-to-left, bidirectional safe

Related Resources

Resource	Description
sindhi-corpus-505m	742K docs · ~505M tokens · largest open Sindhi corpus
SindhiLM-Tokenizer-v1	BPE tokenizer merged into Qwen2.5-0.5B
GitHub	Source code, issues, contributions
PyPI	Install page

Citation

If you use sindhinltk in research, please cite:

@software{meghwar2025sindhinltk,
  author    = {Aakash Meghwar},
  title     = {sindhinltk: A Natural Language Toolkit for Sindhi},
  year      = {2025},
  publisher = {PyPI},
  url       = {https://pypi.org/project/sindhinltk/},
}

Author

Aakash Meghwar — Computational Linguist · NLP Engineer

GitHub · HuggingFace

MIT License · Building NLP tools for 80 million Sindhi speakers

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support