Spaces:

ml-jku
/

tox21_rf_classifier

Sleeping

App Files Files Community

antoniaebner commited on Sep 4

Commit

d84754c

1 Parent(s): 75c7791

move RF source files; add README

Browse files

Files changed (8) hide show

README.md +93 -2
predict.py +2 -2
requirements.txt +1 -1
src/__init__.py +0 -0
data.py → src/data.py +1 -1
model.py → src/model.py +1 -1
train.py → src/train.py +3 -3
utils.py → src/utils.py +0 -0

README.md CHANGED Viewed

@@ -6,7 +6,98 @@ colorTo: purple
 sdk: docker
 pinned: false
 license: apache-2.0
-short_description: This is a rf classifier for the Tox21 test dataset
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 sdk: docker
 pinned: false
 license: apache-2.0
+short_description: This is a RF classifier for the Tox21 test dataset
 ---
+# Tox21 Random Forest Classifier
+This repository hosts a Hugging Face Space that provides an examplary API for submitting models to the [Tox21 Leaderboard](https://huggingface.co/spaces/tschouis/tox21_leaderboard).
+In this example, we train a Random Forest classifier on the Tox21 targets and save the trained model in the `assets/` folder.
+**Important:** For leaderboard submission, your Space does not need to include training code. It only needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a prediction dictionary as output, with SMILES and targets as keys. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.
+# Repository Structure
+- `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
+- `app.py` - FastAPI application wrapper (can be used as-is).
+- `src/` - Core model & preprocessing logic:
+    - `data.py` - SMILES preprocessing pipeline
+    - `model.py` - Random Forest classifier wrapper
+    - `train.py` - Script to train the classifier
+    - `utils.py` – Constants and Helper functions
+# Quickstart with Spaces
+You can easily adapt this project in your own Hugging Face account:
+- Open this Space on Hugging Face.
+- Click "Duplicate this Space" (top-right corner).
+- Modify `src/` for your preprocessing pipeline and model class
+- Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.
+That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.
+# Installation
+To run (and train) the random forest, clone the repository and install dependencies:
+```bash
+git clone https://huggingface.co/spaces/tschouis/tox21_rf_classifier
+cd tox_21_rf_classifier
+conda create -n tox21_rf_cls python=3.11
+conda activate tox21_rf_cls
+pip install -r requirements.txt
+```
+# Training
+To train the Random Forest model from scratch:
+```bash
+python -m src/train.py
+```
+This will:
+1. Load and preprocess the Tox21 training dataset.
+2. Train a Random Forest classifier.
+3. Save the trained model to the assets/ folder.
+4. Evaluate the trained Random Forest classifier on the validation split.
+# Inference
+For inference, you only need `predict.py`.
+Example usage inside Python:
+```python
+from predict import predict
+smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
+results = predict(smiles_list)
+print(results)
+```
+The output will be a nested dictionary in the format:
+```python
+{
+    "CCO": {"target1": 0, "target2": 1, ..., "target12": 0},
+    "c1ccccc1": {"target1": 1, "target2": 0, ..., "target12": 1},
+    "CC(=O)O": {"target1": 0, "target2": 0, ..., "target12": 0}
+}
+```
+# Notes
+- Only adapting `predict.py` for your model inference is required for leaderboard submission.
+- Training (`src/train.py`) is provided for reproducibility.
+- Preprocessing (here inside `src/data.py`) must be applied at inference time, not just training.

predict.py CHANGED Viewed

@@ -8,8 +8,8 @@ SMILES and target names as keys.
 # Dependencies
 from collections import defaultdict
-from data import preprocess_molecules
-from model import Tox21RFClassifier
 # ---------------------------------------------------------------------------------------

 # Dependencies
 from collections import defaultdict
+from src.data import preprocess_molecules
+from src.model import Tox21RFClassifier
 # ---------------------------------------------------------------------------------------

requirements.txt CHANGED Viewed

@@ -3,7 +3,7 @@ uvicorn[standard]
 statsmodels
 rdkit
 numpy
-scikit-learn
 joblib
 tabulate
 datasets

 statsmodels
 rdkit
 numpy
+scikit-learn==1.7.1
 joblib
 tabulate
 datasets

src/__init__.py ADDED Viewed

File without changes

data.py → src/data.py RENAMED Viewed

@@ -17,7 +17,7 @@ from rdkit import Chem, DataStructs
 from rdkit.Chem import Descriptors, rdFingerprintGenerator
 from rdkit.Chem.rdchem import Mol
-from utils import USED_200_DESCR, Standardizer, load_pickle, write_pickle
 def preprocess_molecules(

 from rdkit.Chem import Descriptors, rdFingerprintGenerator
 from rdkit.Chem.rdchem import Mol
+from .utils import USED_200_DESCR, Standardizer, load_pickle, write_pickle
 def preprocess_molecules(

model.py → src/model.py RENAMED Viewed

@@ -12,7 +12,7 @@ import joblib
 import numpy as np
 from sklearn.ensemble import RandomForestClassifier
-from utils import TASKS
 # ---------------------------------------------------------------------------------------

 import numpy as np
 from sklearn.ensemble import RandomForestClassifier
+from .utils import TASKS
 # ---------------------------------------------------------------------------------------

train.py → src/train.py RENAMED Viewed

@@ -10,9 +10,9 @@ from tabulate import tabulate
 from datasets import load_dataset
 from sklearn.metrics import roc_auc_score
-from data import preprocess_molecules
-from model import Tox21RFClassifier
-from utils import HF_TOKEN
 parser = argparse.ArgumentParser(description="RF Trainig script for Tox21 dataset")

 from datasets import load_dataset
 from sklearn.metrics import roc_auc_score
+from .data import preprocess_molecules
+from .model import Tox21RFClassifier
+from .utils import HF_TOKEN
 parser = argparse.ArgumentParser(description="RF Trainig script for Tox21 dataset")

utils.py → src/utils.py RENAMED Viewed

File without changes