---
tags:
- furry
- e621
- not-for-all-audiences
pipeline_tag: image-classification
base_model: google/siglip2-so400m-patch16-naflex
language:
- en
license: apache-2.0
---
JTP-3 Hydra
JTP-3 Hydra is a finetune of the SigLIP2 image classifier with a custom classifier head, designed to predict 7,504 popular tags from [e621](https://e621.net).
A public demo of the model is available here: https://huggingface.co/spaces/RedRocket/JTP-3-Demo
Jump to section:
[Downloading](#downloading)
[Easy Windows Installation and Usage](#easy-windows-installation-and-usage)
[Advanced Windows Installation and Usage](#advanced-windows-installation-and-usage)
[Linux Installation and Usage](#linux-installation-and-usage)
[Using inference.py/inference.bat](#using-inferencepy-or-inferencebat)
[Usage Notes](#usage-notes)
[Calibration](#using-calibratebat-or-easy-mode-calibration)
[Using Extensions](#using-extensions)
[Training Extensions](#training-extensions)
[Technical Notes](#technical-notes)
[Credits / Citations](#credits)
## Downloading
If you have Git+LFS installed, download the repository using ``git clone https://huggingface.co/RedRocket/JTP-3``.
If you are unable to do this, manually download all the `.py` files, `requirements.txt`, `models/jtp-3-hydra.safetensors`, and `data/jtp-3-hydra-tags.csv`.
If you are on Windows, also download the `.bat` files and follow the instructions below for easy installation.
If you want to run calibration, you also need `data/jtp-3-hydra-val.csv`.
## Easy Windows Installation and Usage
For Windows, ensure you have at least Python 3.11 [installed](https://www.python.org/downloads/windows/) and available on your path.
If you are unsure about your version of Python, you can run `easy.bat` and it will let you know.
**For Windows, double-click ``easy.bat`` to run easy mode.**
Easy mode walks you through all the commands.
When easy mode asks you for a file or folder, you can drag and drop it onto the easy mode window and press enter, copy and paste the path, or type it yourself.
## Advanced Windows Installation and Usage
Double-click ``install.bat`` to run installation, which will create a virtual environment for all the requirements and install them.
You can check your version of Python by opening a command prompt and typing ``python -V``.
You can run the WebUI by double clicking ``app.bat`` and navigating your browser to the URL it shows. The link is not shared publicly.
On the command line, you can use ``inference.bat`` to do bulk operations such as tagging entire directories. Run ``inference.bat --help`` for help using the command line.
If you provide a path to a file or directory, it will write ``.txt`` caption files beside each image using the default threshold of ``0.5``.
Instead of using a fixed threshold, you can run the calibration wizard with ``calibrate.bat``.
## Linux Installation and Usage
If your OS Python install is not 3.11 or above, install a more recent version of Python according to your distribution's instructions and use that ``python`` to create the venv.
You can check your version of python with ``python -V``.
```sh
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
```sh
source venv/bin/activate
python app.py
```
```sh
source venv/bin/activate
python inference.py --help
```
## Using inference.py (or inference.bat)
If you do not have a calibration file, the default threshold of ``0.5`` is conservative. If you plan on manually reviewing the tags, consider using ``-t 0.2`` or ``-t 0.1``.
```
$ python inference.py --help
usage: inference.py [-h] [-t THRESHOLD_OR_PATH] [-i MODE] [-x CATEGORY] [-p PREFIX] [-o PATH] [-O] [-M PATH] [-m PATH] [-e PATH] [-E] [-b BATCH_SIZE] [-w N_WORKERS] [--no-shm] [-S SEQLEN] [-d TORCH_DEVICE] [-r] [PATH ...]
positional arguments:
PATH Paths to files and directories to classify. If none are specified, run interactively.
options:
-h, --help show this help message and exit
-r, --recursive Classify directories recursively. Dotfiles will be ignored.
classification:
-t, --threshold THRESHOLD_OR_PATH
Classification threshold -1.0 to 1.0. Or, a path to a CSV calibration file. (Default: calibration.csv)
-i, --implications MODE
Automatically apply implications. Requires tag metadata. (Default: inherit)
-x, --exclude CATEGORY
Exclude the specified category of tags. May be specified multiple times. Requires tag metadata.
output:
-p, --prefix PREFIX Prefix all .txt caption files with the specified text. If the prefix matches a tag, the tag will not be repeated.
-o, --output PATH Path for CSV output, or '-' for standard output. If not specified, individual .txt caption files are written.
-O, --original-tags Do not rewrite tags for compatibility with diffusion models.
model:
-M, --model PATH Path to model file.
-m, --metadata PATH Path to CSV file with additional tag metadata. (Default: data/jtp-3-hydra-tags.csv)
-e, --extension PATH Path to extension. May be specified multiple times. If a directory is specified, all extensions in the specified directory are loaded. (Default: extensions/jtp-3-hydra)
-E, --no-default-extensions
Do not load extensions by default.
execution:
-b, --batch BATCH_SIZE
Batch size.
-w, --workers N_WORKERS
Number of dataloader workers. (Default: number of cores)
--no-shm Disable shared memory between workers.
-S, --seqlen SEQLEN NaFlex sequence length. (Default: 1024)
-d, --device TORCH_DEVICE
Torch device. (Default: cuda)
MODE:
inherit Tags inherit the highest probability of the more specific tags that imply them.
constrain Tags are constrained to the lowest probability of the more general tags they imply.
remove Exclude implied tags from output.
constrain-remove Combination of constrain followed by remove.
off No implications are applied.
CATEGORY:
general artist copyright character species meta lore
```
Try to avoid running multiple copies of ``inference.py`` at once, as each copy will load the entire model.
If you are tagging only a few images, run with ``-w 0`` to use in-process dataloading.
### Interactive Mode
If you do not provide a list of files or directories to classify, ``inference.py`` will launch in an interactive mode where you can provide files one-at-a-time.
```
$ python inference.py
JTP-3 Hydra Interactive Classifier
Type 'q' to quit, or 'h' for help.
For bulk operations, quit and run again with a path, or '-h' for help.
> h
Provide a file path to classify, or one of the following commands:
threshold NUM (-1.0 to 1.0, 0.2 to 0.8 recommended)
calibration [PATH] (load calibration csv file)
exclude CATEGORY (general copyright character species meta lore)
include CATEGORY (general copyright character species meta lore)
implications MODE (inherit constrain remove constrain-remove off)
seqlen LEN (64 to 2048, 1024 recommended)
quit (or 'q', 'exit')
```
## Usage Notes
The model predicts 7,501 e621 tags, as well as the added rating meta-tags ``safe``, ``questionable``, and ``explicit``.
The model is trained with implications, but its raw predictions are not constrained.
If you use the inference script, it will leverage the tag metadata, if available, to automatically apply implications unless you specify otherwise with ``-i off``.
For example, with implications ``off`` it's possible the model can say ``tyrannosaurus rex`` is more likely than ``dinosaur``.
In the default ``inherit`` mode, it will instead say that ``dinosaur`` is as likely as ``tyrannosaurus rex``.
In the ``constrain`` mode, it will say that ``tyrannosaurus rex`` is as likely as ``dinosaur``.
The model is trained on images on e621 only, and not on photographs of people or real animals.
While it has retained some ability to classify photos, this is not in any way supported.
The interactive interfaces use a threshold convention of -100% to 100%.
This is different from other classifier models that generally range from 0% to 100%.
The model sees all transparency as a black background.
## Using calibrate.bat (or Easy Mode calibration)
You can just press ``ENTER`` to get the default calibration until it asks you for a list of tags to exclude.
If you don't want to exclude any tags, press ``ENTER`` again and answer ``y`` to get the default calibration.
Members of the [Furry Diffusion Community](https://discord.com/channels/1019133813105905664/1254974507819733017) may have created their own calibration files for you to try out, too.
Be cautious if anyone offers you a custom calibration file that ends in `.py` and tells you to run it. However, `.csv` calibration files are always safe.
## Using Extensions
JTP-3 Hydra supports adding and replacing tags with extensions, which are simple `.safetensors` files similar in spirit to LORAs.
By default, `.safetensors` files placed in `extensions/jtp-3-hydra` will be loaded as extensions.
Members of the [Furry Diffusion Community](https://discord.com/channels/1019133813105905664/1254974507819733017) may have created their own extension files for you to try out, too.
JTP-3 Hydra extension files are always safe.
If you are using calibrations, be sure to re-calibrate after adding an extension.
## Training Extensions
In order to train extensions, you will need to have some basic familiarity with the command line.
`.bat` wrapper files which load the virtual environment are intentionally not provided.
You will need around 1.5 GB of free VRAM to train, or more if you use higher batch sizes.
Training is generally very quick. You can expect a run to complete in under 10 minutes unless your dataset is many thousands of images.
If you are training on Windows, you will need [triton-windows](https://github.com/woct0rdho/triton-windows).
Run `pip install triton-windows` with the virtual environment active.
CPU training is not supported due to the dependency on [Triton](https://github.com/triton-lang/triton).
If you really want to train on a platform not supported by Triton, manually replace the optimizer, perhaps with AdamW in float32.
### Step 1 – Dataset
To train new extensions for JTP-3 create the following directories inside the `train` directory (or elsewhere):
```
tag_name/
positive/
negative/
```
Add at least 100 example images having the tag to the `positive` directory.
For best results, you must manually review every image to ensure it has the tag you are trying to train.
Try to use a diverse set of images having the tag. Don't just use your favorite images, especially if they are from a single artist.
Add a similar number of images not having the tag to the `negative` directory.
For best results, you must manually review every image to ensure it does not have the tag you are trying to train.
#### Dataset Tips
At least half of the negative dataset should be random images not having the tag.
For better results, the other half should be hand-selected images that contain concepts that might be easily confused with your tag.
It's fine filter out images with unsavory content that you absolutely don't want to see while reviewing.
So, for example, let's say you're training `dragon_on_top_gryphon_on_bottom` with 200 positive examples. Your negative set might look like:
- 100 random images that don't have the tag
- 30 images of gryphons and dragons in other scenarios
- 20 images of a dragon on top with another species
- 20 images of a gryphon on bottom with another species
- 30 images of a gryphon on top with a dragon on bottom
### Step 2 – Training
Run `python train_extension.py --help` to familiarize yourself with the options provided by the training script.
At a minimum, you will want to:
- Adjust the batch sizes (`-b/-B`) and/or gradient accumulation (`-a`) to match your available VRAM.
- Adjust the size of your validation set. Try to target 5%-10% of your available data, but never less than `-v 20`.
(Note that the `-v` option reserves an equal number of positive and negative examples. The default `-v 20` reserves 40 total examples, 20 positive and 20 negative.)
- Set the checkpoint interval or maximum number of epochs (`-c/-e`).
Please resist the urge to tweak hyperparameters until you have first succeeded with the defaults.
Training begins by building a feature cache for the dataset. This should only take a few minutes, but be aware that the feature cache for each dataset item consumes about 2.3 MB of disk space.
Here's an example training run. This took about 2 minutes on a RTX 3090 with 200 total examples. (Yes, it's that fast.)
```sh
$ python train_extension.py -c 0 example_tag
```
```
Loading 'models/jtp-3-hydra.safetensors' ... 7504 tags
caching: 100%|█████████| 200/200 [00:19<00:00, 11.02it/s]
...
EPOCH 1 VALIDATION: loss=0.6758, cti=0.5556, thr=0.4501
EPOCH 2 VALIDATION: loss=0.6633, cti=0.5556, thr=0.4501
EPOCH 3 VALIDATION: loss=0.6320, cti=0.5882, thr=0.4800
EPOCH 4 VALIDATION: loss=0.5922, cti=0.6923, thr=0.5499
...
EPOCH 65 VALIDATION: loss=0.0106, cti=1.0000, thr=0.0804
EPOCH 66 VALIDATION: loss=0.0105, cti=1.0000, thr=0.0804
EPOCH 67 VALIDATION: loss=0.0112, cti=1.0000, thr=0.0901
EPOCH 68 VALIDATION: loss=0.0115, cti=1.0000, thr=0.0995
EPOCH 69 VALIDATION: loss=0.0116, cti=1.0000, thr=0.1097
EPOCH 70 VALIDATION: loss=0.0113, cti=1.0000, thr=0.0995
...
```
In this case, selecting epoch 66 seems reasonable, which would have been saved as `train/example_tag/checkpoints/_e66.pt`.
### Step 3 – Build Extension
Run `python build_extension.py --help` to familiarize yourself with the options provided by the extension builder.
The extension builder converts pytorch checkpoints in training mode to inference-ready safetensors files with additional metadata, of which some is essential.
Continuing with the example above:
```sh
$ python build_extension.py -a "Project RedRocket" train/example_tag/checkpoints/_e66.pt example_tag general
```
```
Loading checkpoint 'train/example_tag/checkpoints/_e66.pt'...
Preparing metadata...
modelspec.sai_model_spec: '1.0.0'
modelspec.architecture: 'naflexvit_so400m_patch16_siglip+rr_hydra'
modelspec.implementation: 'redrocket.extension.label.v1'
modelspec.description: 'This is an extension for the RedRocket JTP-3 Hydra image classifier. You can find usage instructions at https://huggingface.co/RedRocket/JTP-3.'
modelspec.date: ''
modelspec.tags: 'Image Classification'
classifier.label: 'example_tag'
classifier.label.category: 'general'
modelspec.title: 'JTP-3 Hydra Extension: example_tag'
modelspec.author: 'Project RedRocket'
modelspec.license: 'MIT'
modelspec.language: 'en/US'
Building extension...
Apply optimizer state: attn_pool.q
Apply optimizer state: attn_pool.out_proj.weight
Normalize: attn_pool.q
Saving extension 'extensions/jtp-3-hydra/example_tag.safetensors'...
```
### Safetensors Metadata Editor
A simple metadata editor for `.safetensors` files is included as `edit_metadata.py`. You can use this to view and edit already-built extensions, perhaps to change the tag name or add implications.
## Technical Notes
The model consists of [SigLIP2 So400m Patch16 NAFlex](https://huggingface.co/google/siglip2-so400m-patch16-naflex) followed by a custom cross-attention transformer block with learned per-tag queries, SwiGLU feedforward, and per-tag SwiGLU output heads.
The per-tag cross attention mechanism is the origin of the moniker "hydra".
Subject to the preprocessing mentioned below, the initial set of training tags was all general tags with at least 1,200 examples, all species and character tags with at least 500 examples, a semi-automated selection of copyright and meta tags, and a handful of manually-selected lore tags which are sometimes discernible from the image.
This resulted in 8,067 tags. After training, tags with very poor validation performance were pruned, resulting in the final set of 7,504 tags.
Extensive semi-manual dataset curation was used to improve the quality of the training data.
The dataset preprocessing code consists of over 12,000 lines of code and data files.
In addition to correcting implications, manually-defined rules are used to detect common scenarios of missing, incomplete, or contradictory tagging and to selectively mask individual tags on a per-dataset-item basis.
This is responsible for JTP-3's excellent performance in detecting colors and "combo tags" such as `male_feral`.
Margin-focal cross entropy loss based on ASL was used to mitigate the effects of inconsistent labeling on e621 and the extreme class imbalance.
The dataset was sampled in mini-epochs according to a self-entropy metric.
Loss weight for negative labels was logarithmically redistributed from images with few tags to those with many tags.
Raw validation performance metrics and tag lists are available in the ``data`` folder.
These can be used to create P/R curves, compute CTI or F1 scores, or select automated thresholds for each tag.
The list of supported tags is also embedded in the safetensors metadata as ``classifier.labels``.
Internally, the model operates on logits as normal and classification thresholds are expressed in the interval from 0.0 to 1.0.
This is reflected in the ``data`` files and csv output of ``inference.py``.
## Credits
RedHotTensors — Architecture design, dataset curation, infrastructure and training, testing, and release.
DrHead — WebUI, multi-layer CAM, testing, and additional code.
Thessalo — Advice and testing.
[Furry Diffusion Community](https://discord.com/channels/1019133813105905664/1254974507819733017) — Feedback and compatibility fixes.
Google Gemini — Hero image.
### Citations
Michael Tschannen, et al. [SigLIP 2.](https://arxiv.org/abs/2502.14786)
Emanuel Ben-Baruch, et al. [Asymmetric Loss For Multi-Label Classification.](https://arxiv.org/abs/2009.14119)
Noam Shazeer. [GLU Variants Improve Transformer.](https://arxiv.org/abs/2002.05202)
Pedram Zamirai, et al. [Revisiting BFloat16 Training.](https://arxiv.org/pdf/2010.06192)