Dataset Quality LM ππ§ͺ
DatasetQualityLM is an AI system that evaluates datasets for bias, data leakage, noise, and real-world deployability before model training.
It helps teams detect hidden dataset risks early, improving model reliability, fairness, and production readiness.
π What Problem Does It Solve?
Many ML failures come from bad data, not bad models.
DatasetQualityLM answers:
- Is this dataset biased?
- Does it contain data leakage?
- How noisy or inconsistent is it?
- Is it safe to deploy models trained on it?
β¨ Key Features
- βοΈ Bias detection (demographic & distributional)
- π Target & feature leakage detection
- π Noise and missing-value analysis
- π Deployability scoring (single quality score)
- π§ Explainable, rule-based analysis
- π€ Hugging Faceβready pipeline
- ποΈ Gradio demo included
- π§ͺ Unit-tested core components
π Project Structure
dataset-quality-lm/
βββ config/
βββ data/
βββ src/
βββ training/
βββ pipelines/
βββ scripts/
βββ tests/
βββ notebooks/
βββ app.py
βββ README.md
βββ model_card.md
βββ requirements.txt
βββ LICENSE
βοΈ Installation
pip install -r requirements.txt
π Quick Usage
from src.inference import DatasetQualityPipeline
pipeline = DatasetQualityPipeline()
result = pipeline("data/samples/clean_dataset.json")
print(result)
ποΈ Gradio Demo
python app.py
π§ How It Works
- Dataset Loading
- Bias Detection
- Leakage Detection
- Noise Analysis
- Deployability Scoring
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support