🇵🇰 SindhiNLTK

The first open-source NLP toolkit for the Sindhi language

PyPI Python License GitHub


Overview

sindhinltk is a pure-Python NLP library built specifically for the Sindhi language (Arabic script). Sindhi is spoken by ~80 million people but has had virtually no open-source NLP tooling — until now.

pip install sindhinltk

Zero dependencies. No model downloads. Works immediately.


What's Inside

Module Description
SindhiTokenizer Regex-based word & sentence tokenizer with full Unicode Arabic range support
SindhiNormalizer NFC normalization, diacritic (harakat) removal, whitespace cleanup
SindhiStemmer Rule-based suffix stripper — 20+ Sindhi morphological rules, longest-match
SindhiStopwords 143 stopwords across 10 semantic categories
SindhiSentiment Lexicon-based sentiment with intensifier & negator handling, Sindhi labels
SindhiDatasets Load bundled data assets (stopwords, sentiment lexicon)

Quick Start

from sindhinltk.tokenizer  import SindhiTokenizer
from sindhinltk.normalizer import SindhiNormalizer
from sindhinltk.stemmer    import SindhiStemmer
from sindhinltk.stopwords  import SindhiStopwords
from sindhinltk.sentiment  import SindhiSentiment

text = "سنڌي ٻولي تمام سٺي ۽ قديم آهي"

# Tokenize
tok = SindhiTokenizer()
tokens = tok.tokenize(text)
# → ['سنڌي', 'ٻولي', 'تمام', 'سٺي', '۽', 'قديم', 'آهي']

# Normalize (strip diacritics)
norm = SindhiNormalizer()
clean = norm.normalize("ھوُ ھر روزَ", remove_diacritics=True)
# → 'ھو ھر روز'

# Remove stopwords
sw = SindhiStopwords()
content = sw.remove_stopwords(tokens)
# → ['سنڌي', 'ٻولي', 'سٺي', 'قديم']

# Stem
stm = SindhiStemmer()
stems = stm.stem_tokens(content)
# → ['سنڌي', 'ٻولي', 'سٺي', 'قديم']

# Sentiment
sa = SindhiSentiment()
sa.analyze(text)   # → 'مثبت'   (positive)
sa.score(text)     # → 2.0

Sentiment Labels

Label Meaning Score
مثبت Positive > 0
منفي Negative < 0
غير جانبدار Neutral = 0

Handles intensifiers (تمام, ڏاڍو) and negators (نه, ناهي) automatically:

sa.score("سٺو")        # → 1.0
sa.score("تمام سٺو")   # → 2.0   (intensified)
sa.score("نه سٺو")     # → -1.0  (negated)

Stopword Categories

from sindhinltk.stopwords import SindhiStopwords
sw = SindhiStopwords()

sw.get_categories()
# → ['pronouns', 'demonstratives', 'postpositions', 'conjunctions',
#    'question_words', 'auxiliaries', 'negation', 'quantifiers',
#    'adverbs', 'particles']

sw.get_stopwords(category="pronouns")
# → {'مان', 'تون', 'هو', 'هوءَ', 'اسان', 'توهان', ...}

Stemmer Rules

The stemmer uses a longest-first rule set tuned for Sindhi morphology:

Suffix Example Stem
يندڙ ڪاوڙيندڙ ڪاوڙ
ندي هلندي هل
ندا وڃندا وڃ
ندو ڪندو ڪ
يائين ڪيائين ڪ
پڻ سچپڻ سچ

Linguistic Background

  • Script: Naskh Arabic with Sindhi-specific letters (ڄ ڃ ٻ ڦ ڳ ڱ ڻ ڏ ڊ ٺ ٽ ڇ)
  • Morphology: Agglutinative verb system — one root generates 40+ surface forms
  • Diacritics: Full harakat support (U+064B–U+065F, U+0670)
  • RTL: Fully right-to-left, bidirectional safe

Related Resources

Resource Description
sindhi-corpus-505m 742K docs · ~505M tokens · largest open Sindhi corpus
SindhiLM-Tokenizer-v1 BPE tokenizer merged into Qwen2.5-0.5B
GitHub Source code, issues, contributions
PyPI Install page

Citation

If you use sindhinltk in research, please cite:

@software{meghwar2025sindhinltk,
  author    = {Aakash Meghwar},
  title     = {sindhinltk: A Natural Language Toolkit for Sindhi},
  year      = {2025},
  publisher = {PyPI},
  url       = {https://pypi.org/project/sindhinltk/},
}

Author

Aakash Meghwar — Computational Linguist · NLP Engineer

GitHub · HuggingFace


MIT License · Building NLP tools for 80 million Sindhi speakers

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support