JTP-3 Hydra

---
tags:
  - furry
  - e621
  - not-for-all-audiences
pipeline_tag: image-classification
base_model: google/siglip2-so400m-patch16-naflex
language:
  - en
license: apache-2.0
---

<div style="text-align: center;">
  <img style="width: 60%; display: inline-block;" src="https://huggingface.co/RedRocket/JTP-3/resolve/main/data/hydra.jpg">
  <h1 style="text-align: center; margin-bottom: 0;">JTP-3 Hydra</h1>
  <div style="font-size: large;">e621 Image Classifier by <a href="https://huggingface.co/RedRocket/" style="font-size: large;">Project RedRocket</a></div>
</div>

JTP-3 Hydra is a finetune of the SigLIP2 image classifier with a custom classifier head, designed to predict 7,504 popular tags from [e621](https://e621.net).

A public demo of the model is available here: https://huggingface.co/spaces/RedRocket/JTP-3-Demo

Jump to section:<br>
[Downloading](#downloading)<br>
[Easy Windows Installation and Usage](#easy-windows-installation-and-usage)<br>
[Advanced Windows Installation and Usage](#advanced-windows-installation-and-usage)<br>
[Linux Installation and Usage](#linux-installation-and-usage)<br>
[Using inference.py/inference.bat](#using-inferencepy-or-inferencebat)<br>
[Usage Notes](#usage-notes)<br>
[Calibration](#using-calibratebat-or-easy-mode-calibration)<br>
[Using Extensions](#using-extensions)<br>
[Training Extensions](#training-extensions)<br>
[Technical Notes](#technical-notes)<br>
[Credits / Citations](#credits)

## Downloading
If you have Git+LFS installed, download the repository using ``git clone https://huggingface.co/RedRocket/JTP-3``.

If you are unable to do this, manually download all the `.py` files, `requirements.txt`, `models/jtp-3-hydra.safetensors`, and `data/jtp-3-hydra-tags.csv`.<br>
If you are on Windows, also download the `.bat` files and follow the instructions below for easy installation.<br>
If you want to run calibration, you also need `data/jtp-3-hydra-val.csv`.

## Easy Windows Installation and Usage
For Windows, ensure you have at least Python 3.11 [installed](https://www.python.org/downloads/windows/) and available on your path.
If you are unsure about your version of Python, you can run `easy.bat` and it will let you know.

**For Windows, double-click ``easy.bat`` to run easy mode.**
Easy mode walks you through all the commands.
When easy mode asks you for a file or folder, you can drag and drop it onto the easy mode window and press enter, copy and paste the path, or type it yourself.

## Advanced Windows Installation and Usage
Double-click ``install.bat`` to run installation, which will create a virtual environment for all the requirements and install them.
You can check your version of Python by opening a command prompt and typing ``python -V``.

You can run the WebUI by double clicking ``app.bat`` and navigating your browser to the URL it shows. The link is not shared publicly.

On the command line, you can use ``inference.bat`` to do bulk operations such as tagging entire directories. Run ``inference.bat --help`` for help using the command line.
If you provide a path to a file or directory, it will write ``.txt`` caption files beside each image using the default threshold of ``0.5``.

Instead of using a fixed threshold, you can run the calibration wizard with ``calibrate.bat``.

## Linux Installation and Usage

If your OS Python install is not 3.11 or above, install a more recent version of Python according to your distribution's instructions and use that ``python`` to create the venv.
You can check your version of python with ``python -V``.

```sh
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

```sh
source venv/bin/activate
python app.py
```

```sh
source venv/bin/activate
python inference.py --help
```

## Using inference.py (or inference.bat)

If you do not have a calibration file, the default threshold of ``0.5`` is conservative. If you plan on manually reviewing the tags, consider using ``-t 0.2`` or ``-t 0.1``.

```
$ python inference.py --help
usage: inference.py [-h] [-t THRESHOLD_OR_PATH] [-i MODE] [-x CATEGORY] [-p PREFIX] [-o PATH] [-O] [-M PATH] [-m PATH] [-e PATH] [-E] [-b BATCH_SIZE] [-w N_WORKERS] [--no-shm] [-S SEQLEN] [-d TORCH_DEVICE] [-r] [PATH ...]

positional arguments:
  PATH                  Paths to files and directories to classify. If none are specified, run interactively.

options:
  -h, --help            show this help message and exit
  -r, --recursive       Classify directories recursively. Dotfiles will be ignored.

classification:
  -t, --threshold THRESHOLD_OR_PATH
                        Classification threshold -1.0 to 1.0. Or, a path to a CSV calibration file. (Default: calibration.csv)
  -i, --implications MODE
                        Automatically apply implications. Requires tag metadata. (Default: inherit)
  -x, --exclude CATEGORY
                        Exclude the specified category of tags. May be specified multiple times. Requires tag metadata.

output:
  -p, --prefix PREFIX   Prefix all .txt caption files with the specified text. If the prefix matches a tag, the tag will not be repeated.
  -o, --output PATH     Path for CSV output, or '-' for standard output. If not specified, individual .txt caption files are written.
  -O, --original-tags   Do not rewrite tags for compatibility with diffusion models.

model:
  -M, --model PATH      Path to model file.
  -m, --metadata PATH   Path to CSV file with additional tag metadata. (Default: data/jtp-3-hydra-tags.csv)
  -e, --extension PATH  Path to extension. May be specified multiple times. If a directory is specified, all extensions in the specified directory are loaded. (Default: extensions/jtp-3-hydra)
  -E, --no-default-extensions
                        Do not load extensions by default.

execution:
  -b, --batch BATCH_SIZE
                        Batch size.
  -w, --workers N_WORKERS
                        Number of dataloader workers. (Default: number of cores)
  --no-shm              Disable shared memory between workers.
  -S, --seqlen SEQLEN   NaFlex sequence length. (Default: 1024)
  -d, --device TORCH_DEVICE
                        Torch device. (Default: cuda)

MODE:
  inherit           Tags inherit the highest probability of the more specific tags that imply them.
  constrain         Tags are constrained to the lowest probability of the more general tags they imply.
  remove            Exclude implied tags from output.
  constrain-remove  Combination of constrain followed by remove.
  off               No implications are applied.

CATEGORY:
  general artist copyright character species meta lore
```

Try to avoid running multiple copies of ``inference.py`` at once, as each copy will load the entire model.
If you are tagging only a few images, run with ``-w 0`` to use in-process dataloading.

### Interactive Mode
If you do not provide a list of files or directories to classify, ``inference.py`` will launch in an interactive mode where you can provide files one-at-a-time.

```
$ python inference.py
JTP-3 Hydra Interactive Classifier
  Type 'q' to quit, or 'h' for help.
  For bulk operations, quit and run again with a path, or '-h' for help.

> h
Provide a file path to classify, or one of the following commands:
  threshold NUM      (-1.0 to 1.0, 0.2 to 0.8 recommended)
  calibration [PATH] (load calibration csv file)
  exclude CATEGORY   (general copyright character species meta lore)
  include CATEGORY   (general copyright character species meta lore)
  implications MODE  (inherit constrain remove constrain-remove off)
  seqlen LEN         (64 to 2048, 1024 recommended)
  quit               (or 'q', 'exit')
```

## Usage Notes
The model predicts 7,501 e621 tags, as well as the added rating meta-tags ``safe``, ``questionable``, and ``explicit``.

The model is trained with implications, but its raw predictions are not constrained.
If you use the inference script, it will leverage the tag metadata, if available, to automatically apply implications unless you specify otherwise with ``-i off``.
For example, with implications ``off`` it's possible the model can say ``tyrannosaurus rex`` is more likely than ``dinosaur``.
In the default ``inherit`` mode, it will instead say that ``dinosaur`` is as likely as ``tyrannosaurus rex``.
In the ``constrain`` mode, it will say that ``tyrannosaurus rex`` is as likely as ``dinosaur``.

The model is trained on images on e621 only, and not on photographs of people or real animals.
While it has retained some ability to classify photos, this is not in any way supported.

The interactive interfaces use a threshold convention of -100% to 100%.
This is different from other classifier models that generally range from 0% to 100%.

The model sees all transparency as a black background.

## Using calibrate.bat (or Easy Mode calibration)
You can just press ``ENTER`` to get the default calibration until it asks you for a list of tags to exclude.
If you don't want to exclude any tags, press ``ENTER`` again and answer ``y`` to get the default calibration.

Members of the [Furry Diffusion Community](https://discord.com/channels/1019133813105905664/1254974507819733017) may have created their own calibration files for you to try out, too.
Be cautious if anyone offers you a custom calibration file that ends in `.py` and tells you to run it. However, `.csv` calibration files are always safe.

## Using Extensions

JTP-3 Hydra supports adding and replacing tags with extensions, which are simple `.safetensors` files similar in spirit to LORAs.
By default, `.safetensors` files placed in `extensions/jtp-3-hydra` will be loaded as extensions.

Members of the [Furry Diffusion Community](https://discord.com/channels/1019133813105905664/1254974507819733017) may have created their own extension files for you to try out, too.
JTP-3 Hydra extension files are always safe.

If you are using calibrations, be sure to re-calibrate after adding an extension.

## Training Extensions

In order to train extensions, you will need to have some basic familiarity with the command line.
`.bat` wrapper files which load the virtual environment are intentionally not provided.

You will need around 1.5 GB of free VRAM to train, or more if you use higher batch sizes.
Training is generally very quick. You can expect a run to complete in under 10 minutes unless your dataset is many thousands of images.

If you are training on Windows, you will need [triton-windows](https://github.com/woct0rdho/triton-windows).
Run `pip install triton-windows` with the virtual environment active.
CPU training is not supported due to the dependency on [Triton](https://github.com/triton-lang/triton).
If you really want to train on a platform not supported by Triton, manually replace the optimizer, perhaps with AdamW in float32.

### Step 1 – Dataset

To train new extensions for JTP-3 create the following directories inside the `train` directory (or elsewhere):
```
tag_name/
  positive/
  negative/
```

Add at least 100 example images having the tag to the `positive` directory.
For best results, you must manually review every image to ensure it has the tag you are trying to train.
Try to use a diverse set of images having the tag. Don't just use your favorite images, especially if they are from a single artist.

Add a similar number of images not having the tag to the `negative` directory.
For best results, you must manually review every image to ensure it does not have the tag you are trying to train.

#### Dataset Tips
At least half of the negative dataset should be random images not having the tag.
For better results, the other half should be hand-selected images that contain concepts that might be easily confused with your tag.
It's fine filter out images with unsavory content that you absolutely don't want to see while reviewing.

So, for example, let's say you're training `dragon_on_top_gryphon_on_bottom` with 200 positive examples. Your negative set might look like:
- 100 random images that don't have the tag
- 30 images of gryphons and dragons in other scenarios
- 20 images of a dragon on top with another species
- 20 images of a gryphon on bottom with another species
- 30 images of a gryphon on top with a dragon on bottom
  
### Step 2 – Training

Run `python train_extension.py --help` to familiarize yourself with the options provided by the training script.

At a minimum, you will want to:
- Adjust the batch sizes (`-b/-B`) and/or gradient accumulation (`-a`) to match your available VRAM. 
- Adjust the size of your validation set. Try to target 5%-10% of your available data, but never less than `-v 20`.
  (Note that the `-v` option reserves an equal number of positive and negative examples. The default `-v 20` reserves 40 total examples, 20 positive and 20 negative.)
- Set the checkpoint interval or maximum number of epochs (`-c/-e`).

Please resist the urge to tweak hyperparameters until you have first succeeded with the defaults.

Training begins by building a feature cache for the dataset. This should only take a few minutes, but be aware that the feature cache for each dataset item consumes about 2.3 MB of disk space.

Here's an example training run. This took about 2 minutes on a RTX 3090 with 200 total examples. (Yes, it's that fast.)
```sh
$ python train_extension.py -c 0 example_tag
```

```
Loading 'models/jtp-3-hydra.safetensors' ... 7504 tags
caching: 100%|█████████| 200/200 [00:19<00:00, 11.02it/s]
...
EPOCH 1 VALIDATION: loss=0.6758, cti=0.5556, thr=0.4501
EPOCH 2 VALIDATION: loss=0.6633, cti=0.5556, thr=0.4501
EPOCH 3 VALIDATION: loss=0.6320, cti=0.5882, thr=0.4800
EPOCH 4 VALIDATION: loss=0.5922, cti=0.6923, thr=0.5499
...
EPOCH 65 VALIDATION: loss=0.0106, cti=1.0000, thr=0.0804
EPOCH 66 VALIDATION: loss=0.0105, cti=1.0000, thr=0.0804
EPOCH 67 VALIDATION: loss=0.0112, cti=1.0000, thr=0.0901
EPOCH 68 VALIDATION: loss=0.0115, cti=1.0000, thr=0.0995
EPOCH 69 VALIDATION: loss=0.0116, cti=1.0000, thr=0.1097
EPOCH 70 VALIDATION: loss=0.0113, cti=1.0000, thr=0.0995
...
```

In this case, selecting epoch 66 seems reasonable, which would have been saved as `train/example_tag/checkpoints/<timestamp>_e66.pt`.

### Step 3 – Build Extension

Run `python build_extension.py --help` to familiarize yourself with the options provided by the extension builder.
The extension builder converts pytorch checkpoints in training mode to inference-ready safetensors files with additional metadata, of which some is essential.

Continuing with the example above:
```sh
$ python build_extension.py -a "Project RedRocket" train/example_tag/checkpoints/<timestamp>_e66.pt example_tag general
```

```
Loading checkpoint 'train/example_tag/checkpoints/<timestamp>_e66.pt'...
Preparing metadata...
  modelspec.sai_model_spec: '1.0.0'
  modelspec.architecture: 'naflexvit_so400m_patch16_siglip+rr_hydra'
  modelspec.implementation: 'redrocket.extension.label.v1'
  modelspec.description: 'This is an extension for the RedRocket JTP-3 Hydra image classifier. You can find usage instructions at https://huggingface.co/RedRocket/JTP-3.'
  modelspec.date: '<timestamp>'
  modelspec.tags: 'Image Classification'
  classifier.label: 'example_tag'
  classifier.label.category: 'general'
  modelspec.title: 'JTP-3 Hydra Extension: example_tag'
  modelspec.author: 'Project RedRocket'
  modelspec.license: 'MIT'
  modelspec.language: 'en/US'
Building extension...
  Apply optimizer state: attn_pool.q
  Apply optimizer state: attn_pool.out_proj.weight
  Normalize: attn_pool.q
Saving extension 'extensions/jtp-3-hydra/example_tag.safetensors'...
```

### Safetensors Metadata Editor

A simple metadata editor for `.safetensors` files is included as `edit_metadata.py`. You can use this to view and edit already-built extensions, perhaps to change the tag name or add implications.

## Technical Notes
The model consists of [SigLIP2 So400m Patch16 NAFlex](https://huggingface.co/google/siglip2-so400m-patch16-naflex) followed by a custom cross-attention transformer block with learned per-tag queries, SwiGLU feedforward, and per-tag SwiGLU output heads.
The per-tag cross attention mechanism is the origin of the moniker "hydra".

Subject to the preprocessing mentioned below, the initial set of training tags was all <span style="color:#2e76b4">general</span> tags with at least 1,200 examples, all <span style="color:#ed5d1f">species</span> and <span style="color:#00aa00">character</span> tags with at least 500 examples, a semi-automated selection of <span style="color:#dd00dd">copyright</span> and <span style="color:#666666">meta</span> tags, and a handful of manually-selected <span style="color:#228822">lore</span> tags which are sometimes discernible from the image.
This resulted in 8,067 tags. After training, tags with very poor validation performance were pruned, resulting in the final set of 7,504 tags.

Extensive semi-manual dataset curation was used to improve the quality of the training data.
The dataset preprocessing code consists of over 12,000 lines of code and data files.
In addition to correcting implications, manually-defined rules are used to detect common scenarios of missing, incomplete, or contradictory tagging and to selectively mask individual tags on a per-dataset-item basis.
This is responsible for JTP-3's excellent performance in detecting colors and "combo tags" such as `male_feral`.

Margin-focal cross entropy loss based on ASL was used to mitigate the effects of inconsistent labeling on e621 and the extreme class imbalance.
The dataset was sampled in mini-epochs according to a self-entropy metric.
Loss weight for negative labels was logarithmically redistributed from images with few tags to those with many tags.

Raw validation performance metrics and tag lists are available in the ``data`` folder.
These can be used to create P/R curves, compute CTI or F<sub>1</sub> scores, or select automated thresholds for each tag.
The list of supported tags is also embedded in the safetensors metadata as ``classifier.labels``.

Internally, the model operates on logits as normal and classification thresholds are expressed in the interval from 0.0 to 1.0.
This is reflected in the ``data`` files and csv output of ``inference.py``.

## Credits

RedHotTensors — Architecture design, dataset curation, infrastructure and training, testing, and release.<br>
DrHead — WebUI, multi-layer CAM, testing, and additional code.<br>
Thessalo — Advice and testing.<br>
[Furry Diffusion Community](https://discord.com/channels/1019133813105905664/1254974507819733017) — Feedback and compatibility fixes.<br>
Google Gemini — Hero image.

### Citations

Michael Tschannen, et al. [SigLIP 2.](https://arxiv.org/abs/2502.14786)<br>
Emanuel Ben-Baruch, et al. [Asymmetric Loss For Multi-Label Classification.](https://arxiv.org/abs/2009.14119)<br>
Noam Shazeer. [GLU Variants Improve Transformer.](https://arxiv.org/abs/2002.05202)<br>
Pedram Zamirai, et al. [Revisiting BFloat16 Training.](https://arxiv.org/pdf/2010.06192)