antoniaebner commited on
Commit
d84754c
Β·
1 Parent(s): 75c7791

move RF source files; add README

Browse files
README.md CHANGED
@@ -6,7 +6,98 @@ colorTo: purple
6
  sdk: docker
7
  pinned: false
8
  license: apache-2.0
9
- short_description: This is a rf classifier for the Tox21 test dataset
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  sdk: docker
7
  pinned: false
8
  license: apache-2.0
9
+ short_description: This is a RF classifier for the Tox21 test dataset
10
  ---
11
 
12
+ # Tox21 Random Forest Classifier
13
+
14
+ This repository hosts a Hugging Face Space that provides an examplary API for submitting models to the [Tox21 Leaderboard](https://huggingface.co/spaces/tschouis/tox21_leaderboard).
15
+
16
+ In this example, we train a Random Forest classifier on the Tox21 targets and save the trained model in the `assets/` folder.
17
+
18
+ **Important:** For leaderboard submission, your Space does not need to include training code. It only needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a prediction dictionary as output, with SMILES and targets as keys. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.
19
+
20
+ # Repository Structure
21
+ - `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
22
+ - `app.py` - FastAPI application wrapper (can be used as-is).
23
+
24
+ - `src/` - Core model & preprocessing logic:
25
+ - `data.py` - SMILES preprocessing pipeline
26
+ - `model.py` - Random Forest classifier wrapper
27
+ - `train.py` - Script to train the classifier
28
+ - `utils.py` – Constants and Helper functions
29
+
30
+ # Quickstart with Spaces
31
+
32
+ You can easily adapt this project in your own Hugging Face account:
33
+
34
+ - Open this Space on Hugging Face.
35
+
36
+ - Click "Duplicate this Space" (top-right corner).
37
+
38
+ - Modify `src/` for your preprocessing pipeline and model class
39
+
40
+ - Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.
41
+
42
+ That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.
43
+
44
+ # Installation
45
+ To run (and train) the random forest, clone the repository and install dependencies:
46
+
47
+ ```bash
48
+ git clone https://huggingface.co/spaces/tschouis/tox21_rf_classifier
49
+ cd tox_21_rf_classifier
50
+
51
+ conda create -n tox21_rf_cls python=3.11
52
+ conda activate tox21_rf_cls
53
+ pip install -r requirements.txt
54
+ ```
55
+
56
+ # Training
57
+
58
+ To train the Random Forest model from scratch:
59
+
60
+ ```bash
61
+ python -m src/train.py
62
+ ```
63
+
64
+ This will:
65
+
66
+ 1. Load and preprocess the Tox21 training dataset.
67
+ 2. Train a Random Forest classifier.
68
+ 3. Save the trained model to the assets/ folder.
69
+ 4. Evaluate the trained Random Forest classifier on the validation split.
70
+
71
+
72
+ # Inference
73
+
74
+ For inference, you only need `predict.py`.
75
+
76
+ Example usage inside Python:
77
+
78
+ ```python
79
+ from predict import predict
80
+
81
+ smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
82
+ results = predict(smiles_list)
83
+
84
+ print(results)
85
+ ```
86
+
87
+ The output will be a nested dictionary in the format:
88
+
89
+ ```python
90
+ {
91
+ "CCO": {"target1": 0, "target2": 1, ..., "target12": 0},
92
+ "c1ccccc1": {"target1": 1, "target2": 0, ..., "target12": 1},
93
+ "CC(=O)O": {"target1": 0, "target2": 0, ..., "target12": 0}
94
+ }
95
+ ```
96
+
97
+ # Notes
98
+
99
+ - Only adapting `predict.py` for your model inference is required for leaderboard submission.
100
+
101
+ - Training (`src/train.py`) is provided for reproducibility.
102
+
103
+ - Preprocessing (here inside `src/data.py`) must be applied at inference time, not just training.
predict.py CHANGED
@@ -8,8 +8,8 @@ SMILES and target names as keys.
8
  # Dependencies
9
  from collections import defaultdict
10
 
11
- from data import preprocess_molecules
12
- from model import Tox21RFClassifier
13
 
14
  # ---------------------------------------------------------------------------------------
15
 
 
8
  # Dependencies
9
  from collections import defaultdict
10
 
11
+ from src.data import preprocess_molecules
12
+ from src.model import Tox21RFClassifier
13
 
14
  # ---------------------------------------------------------------------------------------
15
 
requirements.txt CHANGED
@@ -3,7 +3,7 @@ uvicorn[standard]
3
  statsmodels
4
  rdkit
5
  numpy
6
- scikit-learn
7
  joblib
8
  tabulate
9
  datasets
 
3
  statsmodels
4
  rdkit
5
  numpy
6
+ scikit-learn==1.7.1
7
  joblib
8
  tabulate
9
  datasets
src/__init__.py ADDED
File without changes
data.py β†’ src/data.py RENAMED
@@ -17,7 +17,7 @@ from rdkit import Chem, DataStructs
17
  from rdkit.Chem import Descriptors, rdFingerprintGenerator
18
  from rdkit.Chem.rdchem import Mol
19
 
20
- from utils import USED_200_DESCR, Standardizer, load_pickle, write_pickle
21
 
22
 
23
  def preprocess_molecules(
 
17
  from rdkit.Chem import Descriptors, rdFingerprintGenerator
18
  from rdkit.Chem.rdchem import Mol
19
 
20
+ from .utils import USED_200_DESCR, Standardizer, load_pickle, write_pickle
21
 
22
 
23
  def preprocess_molecules(
model.py β†’ src/model.py RENAMED
@@ -12,7 +12,7 @@ import joblib
12
  import numpy as np
13
  from sklearn.ensemble import RandomForestClassifier
14
 
15
- from utils import TASKS
16
 
17
 
18
  # ---------------------------------------------------------------------------------------
 
12
  import numpy as np
13
  from sklearn.ensemble import RandomForestClassifier
14
 
15
+ from .utils import TASKS
16
 
17
 
18
  # ---------------------------------------------------------------------------------------
train.py β†’ src/train.py RENAMED
@@ -10,9 +10,9 @@ from tabulate import tabulate
10
  from datasets import load_dataset
11
  from sklearn.metrics import roc_auc_score
12
 
13
- from data import preprocess_molecules
14
- from model import Tox21RFClassifier
15
- from utils import HF_TOKEN
16
 
17
  parser = argparse.ArgumentParser(description="RF Trainig script for Tox21 dataset")
18
 
 
10
  from datasets import load_dataset
11
  from sklearn.metrics import roc_auc_score
12
 
13
+ from .data import preprocess_molecules
14
+ from .model import Tox21RFClassifier
15
+ from .utils import HF_TOKEN
16
 
17
  parser = argparse.ArgumentParser(description="RF Trainig script for Tox21 dataset")
18
 
utils.py β†’ src/utils.py RENAMED
File without changes