--- tags: - furry - e621 - not-for-all-audiences pipeline_tag: image-classification base_model: google/siglip2-so400m-patch16-naflex language: - en license: apache-2.0 ---

JTP-3 Hydra

e621 Image Classifier by Project RedRocket
JTP-3 Hydra is a finetune of the SigLIP2 image classifier with a custom classifier head, designed to predict 7,504 popular tags from [e621](https://e621.net). A public demo of the model is available here: https://huggingface.co/spaces/RedRocket/JTP-3-Demo Jump to section:
[Downloading](#downloading)
[Easy Windows Installation and Usage](#easy-windows-installation-and-usage)
[Advanced Windows Installation and Usage](#advanced-windows-installation-and-usage)
[Linux Installation and Usage](#linux-installation-and-usage)
[Using inference.py/inference.bat](#using-inferencepy-or-inferencebat)
[Usage Notes](#usage-notes)
[Calibration](#using-calibratebat-or-easy-mode-calibration)
[Using Extensions](#using-extensions)
[Training Extensions](#training-extensions)
[Technical Notes](#technical-notes)
[Credits / Citations](#credits) ## Downloading If you have Git+LFS installed, download the repository using ``git clone https://huggingface.co/RedRocket/JTP-3``. If you are unable to do this, manually download all the `.py` files, `requirements.txt`, `models/jtp-3-hydra.safetensors`, and `data/jtp-3-hydra-tags.csv`.
If you are on Windows, also download the `.bat` files and follow the instructions below for easy installation.
If you want to run calibration, you also need `data/jtp-3-hydra-val.csv`. ## Easy Windows Installation and Usage For Windows, ensure you have at least Python 3.11 [installed](https://www.python.org/downloads/windows/) and available on your path. If you are unsure about your version of Python, you can run `easy.bat` and it will let you know. **For Windows, double-click ``easy.bat`` to run easy mode.** Easy mode walks you through all the commands. When easy mode asks you for a file or folder, you can drag and drop it onto the easy mode window and press enter, copy and paste the path, or type it yourself. ## Advanced Windows Installation and Usage Double-click ``install.bat`` to run installation, which will create a virtual environment for all the requirements and install them. You can check your version of Python by opening a command prompt and typing ``python -V``. You can run the WebUI by double clicking ``app.bat`` and navigating your browser to the URL it shows. The link is not shared publicly. On the command line, you can use ``inference.bat`` to do bulk operations such as tagging entire directories. Run ``inference.bat --help`` for help using the command line. If you provide a path to a file or directory, it will write ``.txt`` caption files beside each image using the default threshold of ``0.5``. Instead of using a fixed threshold, you can run the calibration wizard with ``calibrate.bat``. ## Linux Installation and Usage If your OS Python install is not 3.11 or above, install a more recent version of Python according to your distribution's instructions and use that ``python`` to create the venv. You can check your version of python with ``python -V``. ```sh python -m venv venv source venv/bin/activate pip install -r requirements.txt ``` ```sh source venv/bin/activate python app.py ``` ```sh source venv/bin/activate python inference.py --help ``` ## Using inference.py (or inference.bat) If you do not have a calibration file, the default threshold of ``0.5`` is conservative. If you plan on manually reviewing the tags, consider using ``-t 0.2`` or ``-t 0.1``. ``` $ python inference.py --help usage: inference.py [-h] [-t THRESHOLD_OR_PATH] [-i MODE] [-x CATEGORY] [-p PREFIX] [-o PATH] [-O] [-M PATH] [-m PATH] [-e PATH] [-E] [-b BATCH_SIZE] [-w N_WORKERS] [--no-shm] [-S SEQLEN] [-d TORCH_DEVICE] [-r] [PATH ...] positional arguments: PATH Paths to files and directories to classify. If none are specified, run interactively. options: -h, --help show this help message and exit -r, --recursive Classify directories recursively. Dotfiles will be ignored. classification: -t, --threshold THRESHOLD_OR_PATH Classification threshold -1.0 to 1.0. Or, a path to a CSV calibration file. (Default: calibration.csv) -i, --implications MODE Automatically apply implications. Requires tag metadata. (Default: inherit) -x, --exclude CATEGORY Exclude the specified category of tags. May be specified multiple times. Requires tag metadata. output: -p, --prefix PREFIX Prefix all .txt caption files with the specified text. If the prefix matches a tag, the tag will not be repeated. -o, --output PATH Path for CSV output, or '-' for standard output. If not specified, individual .txt caption files are written. -O, --original-tags Do not rewrite tags for compatibility with diffusion models. model: -M, --model PATH Path to model file. -m, --metadata PATH Path to CSV file with additional tag metadata. (Default: data/jtp-3-hydra-tags.csv) -e, --extension PATH Path to extension. May be specified multiple times. If a directory is specified, all extensions in the specified directory are loaded. (Default: extensions/jtp-3-hydra) -E, --no-default-extensions Do not load extensions by default. execution: -b, --batch BATCH_SIZE Batch size. -w, --workers N_WORKERS Number of dataloader workers. (Default: number of cores) --no-shm Disable shared memory between workers. -S, --seqlen SEQLEN NaFlex sequence length. (Default: 1024) -d, --device TORCH_DEVICE Torch device. (Default: cuda) MODE: inherit Tags inherit the highest probability of the more specific tags that imply them. constrain Tags are constrained to the lowest probability of the more general tags they imply. remove Exclude implied tags from output. constrain-remove Combination of constrain followed by remove. off No implications are applied. CATEGORY: general artist copyright character species meta lore ``` Try to avoid running multiple copies of ``inference.py`` at once, as each copy will load the entire model. If you are tagging only a few images, run with ``-w 0`` to use in-process dataloading. ### Interactive Mode If you do not provide a list of files or directories to classify, ``inference.py`` will launch in an interactive mode where you can provide files one-at-a-time. ``` $ python inference.py JTP-3 Hydra Interactive Classifier Type 'q' to quit, or 'h' for help. For bulk operations, quit and run again with a path, or '-h' for help. > h Provide a file path to classify, or one of the following commands: threshold NUM (-1.0 to 1.0, 0.2 to 0.8 recommended) calibration [PATH] (load calibration csv file) exclude CATEGORY (general copyright character species meta lore) include CATEGORY (general copyright character species meta lore) implications MODE (inherit constrain remove constrain-remove off) seqlen LEN (64 to 2048, 1024 recommended) quit (or 'q', 'exit') ``` ## Usage Notes The model predicts 7,501 e621 tags, as well as the added rating meta-tags ``safe``, ``questionable``, and ``explicit``. The model is trained with implications, but its raw predictions are not constrained. If you use the inference script, it will leverage the tag metadata, if available, to automatically apply implications unless you specify otherwise with ``-i off``. For example, with implications ``off`` it's possible the model can say ``tyrannosaurus rex`` is more likely than ``dinosaur``. In the default ``inherit`` mode, it will instead say that ``dinosaur`` is as likely as ``tyrannosaurus rex``. In the ``constrain`` mode, it will say that ``tyrannosaurus rex`` is as likely as ``dinosaur``. The model is trained on images on e621 only, and not on photographs of people or real animals. While it has retained some ability to classify photos, this is not in any way supported. The interactive interfaces use a threshold convention of -100% to 100%. This is different from other classifier models that generally range from 0% to 100%. The model sees all transparency as a black background. ## Using calibrate.bat (or Easy Mode calibration) You can just press ``ENTER`` to get the default calibration until it asks you for a list of tags to exclude. If you don't want to exclude any tags, press ``ENTER`` again and answer ``y`` to get the default calibration. Members of the [Furry Diffusion Community](https://discord.com/channels/1019133813105905664/1254974507819733017) may have created their own calibration files for you to try out, too. Be cautious if anyone offers you a custom calibration file that ends in `.py` and tells you to run it. However, `.csv` calibration files are always safe. ## Using Extensions JTP-3 Hydra supports adding and replacing tags with extensions, which are simple `.safetensors` files similar in spirit to LORAs. By default, `.safetensors` files placed in `extensions/jtp-3-hydra` will be loaded as extensions. Members of the [Furry Diffusion Community](https://discord.com/channels/1019133813105905664/1254974507819733017) may have created their own extension files for you to try out, too. JTP-3 Hydra extension files are always safe. If you are using calibrations, be sure to re-calibrate after adding an extension. ## Training Extensions In order to train extensions, you will need to have some basic familiarity with the command line. `.bat` wrapper files which load the virtual environment are intentionally not provided. You will need around 1.5 GB of free VRAM to train, or more if you use higher batch sizes. Training is generally very quick. You can expect a run to complete in under 10 minutes unless your dataset is many thousands of images. If you are training on Windows, you will need [triton-windows](https://github.com/woct0rdho/triton-windows). Run `pip install triton-windows` with the virtual environment active. CPU training is not supported due to the dependency on [Triton](https://github.com/triton-lang/triton). If you really want to train on a platform not supported by Triton, manually replace the optimizer, perhaps with AdamW in float32. ### Step 1 – Dataset To train new extensions for JTP-3 create the following directories inside the `train` directory (or elsewhere): ``` tag_name/ positive/ negative/ ``` Add at least 100 example images having the tag to the `positive` directory. For best results, you must manually review every image to ensure it has the tag you are trying to train. Try to use a diverse set of images having the tag. Don't just use your favorite images, especially if they are from a single artist. Add a similar number of images not having the tag to the `negative` directory. For best results, you must manually review every image to ensure it does not have the tag you are trying to train. #### Dataset Tips At least half of the negative dataset should be random images not having the tag. For better results, the other half should be hand-selected images that contain concepts that might be easily confused with your tag. It's fine filter out images with unsavory content that you absolutely don't want to see while reviewing. So, for example, let's say you're training `dragon_on_top_gryphon_on_bottom` with 200 positive examples. Your negative set might look like: - 100 random images that don't have the tag - 30 images of gryphons and dragons in other scenarios - 20 images of a dragon on top with another species - 20 images of a gryphon on bottom with another species - 30 images of a gryphon on top with a dragon on bottom ### Step 2 – Training Run `python train_extension.py --help` to familiarize yourself with the options provided by the training script. At a minimum, you will want to: - Adjust the batch sizes (`-b/-B`) and/or gradient accumulation (`-a`) to match your available VRAM. - Adjust the size of your validation set. Try to target 5%-10% of your available data, but never less than `-v 20`. (Note that the `-v` option reserves an equal number of positive and negative examples. The default `-v 20` reserves 40 total examples, 20 positive and 20 negative.) - Set the checkpoint interval or maximum number of epochs (`-c/-e`). Please resist the urge to tweak hyperparameters until you have first succeeded with the defaults. Training begins by building a feature cache for the dataset. This should only take a few minutes, but be aware that the feature cache for each dataset item consumes about 2.3 MB of disk space. Here's an example training run. This took about 2 minutes on a RTX 3090 with 200 total examples. (Yes, it's that fast.) ```sh $ python train_extension.py -c 0 example_tag ``` ``` Loading 'models/jtp-3-hydra.safetensors' ... 7504 tags caching: 100%|█████████| 200/200 [00:19<00:00, 11.02it/s] ... EPOCH 1 VALIDATION: loss=0.6758, cti=0.5556, thr=0.4501 EPOCH 2 VALIDATION: loss=0.6633, cti=0.5556, thr=0.4501 EPOCH 3 VALIDATION: loss=0.6320, cti=0.5882, thr=0.4800 EPOCH 4 VALIDATION: loss=0.5922, cti=0.6923, thr=0.5499 ... EPOCH 65 VALIDATION: loss=0.0106, cti=1.0000, thr=0.0804 EPOCH 66 VALIDATION: loss=0.0105, cti=1.0000, thr=0.0804 EPOCH 67 VALIDATION: loss=0.0112, cti=1.0000, thr=0.0901 EPOCH 68 VALIDATION: loss=0.0115, cti=1.0000, thr=0.0995 EPOCH 69 VALIDATION: loss=0.0116, cti=1.0000, thr=0.1097 EPOCH 70 VALIDATION: loss=0.0113, cti=1.0000, thr=0.0995 ... ``` In this case, selecting epoch 66 seems reasonable, which would have been saved as `train/example_tag/checkpoints/_e66.pt`. ### Step 3 – Build Extension Run `python build_extension.py --help` to familiarize yourself with the options provided by the extension builder. The extension builder converts pytorch checkpoints in training mode to inference-ready safetensors files with additional metadata, of which some is essential. Continuing with the example above: ```sh $ python build_extension.py -a "Project RedRocket" train/example_tag/checkpoints/_e66.pt example_tag general ``` ``` Loading checkpoint 'train/example_tag/checkpoints/_e66.pt'... Preparing metadata... modelspec.sai_model_spec: '1.0.0' modelspec.architecture: 'naflexvit_so400m_patch16_siglip+rr_hydra' modelspec.implementation: 'redrocket.extension.label.v1' modelspec.description: 'This is an extension for the RedRocket JTP-3 Hydra image classifier. You can find usage instructions at https://huggingface.co/RedRocket/JTP-3.' modelspec.date: '' modelspec.tags: 'Image Classification' classifier.label: 'example_tag' classifier.label.category: 'general' modelspec.title: 'JTP-3 Hydra Extension: example_tag' modelspec.author: 'Project RedRocket' modelspec.license: 'MIT' modelspec.language: 'en/US' Building extension... Apply optimizer state: attn_pool.q Apply optimizer state: attn_pool.out_proj.weight Normalize: attn_pool.q Saving extension 'extensions/jtp-3-hydra/example_tag.safetensors'... ``` ### Safetensors Metadata Editor A simple metadata editor for `.safetensors` files is included as `edit_metadata.py`. You can use this to view and edit already-built extensions, perhaps to change the tag name or add implications. ## Technical Notes The model consists of [SigLIP2 So400m Patch16 NAFlex](https://huggingface.co/google/siglip2-so400m-patch16-naflex) followed by a custom cross-attention transformer block with learned per-tag queries, SwiGLU feedforward, and per-tag SwiGLU output heads. The per-tag cross attention mechanism is the origin of the moniker "hydra". Subject to the preprocessing mentioned below, the initial set of training tags was all general tags with at least 1,200 examples, all species and character tags with at least 500 examples, a semi-automated selection of copyright and meta tags, and a handful of manually-selected lore tags which are sometimes discernible from the image. This resulted in 8,067 tags. After training, tags with very poor validation performance were pruned, resulting in the final set of 7,504 tags. Extensive semi-manual dataset curation was used to improve the quality of the training data. The dataset preprocessing code consists of over 12,000 lines of code and data files. In addition to correcting implications, manually-defined rules are used to detect common scenarios of missing, incomplete, or contradictory tagging and to selectively mask individual tags on a per-dataset-item basis. This is responsible for JTP-3's excellent performance in detecting colors and "combo tags" such as `male_feral`. Margin-focal cross entropy loss based on ASL was used to mitigate the effects of inconsistent labeling on e621 and the extreme class imbalance. The dataset was sampled in mini-epochs according to a self-entropy metric. Loss weight for negative labels was logarithmically redistributed from images with few tags to those with many tags. Raw validation performance metrics and tag lists are available in the ``data`` folder. These can be used to create P/R curves, compute CTI or F1 scores, or select automated thresholds for each tag. The list of supported tags is also embedded in the safetensors metadata as ``classifier.labels``. Internally, the model operates on logits as normal and classification thresholds are expressed in the interval from 0.0 to 1.0. This is reflected in the ``data`` files and csv output of ``inference.py``. ## Credits RedHotTensors — Architecture design, dataset curation, infrastructure and training, testing, and release.
DrHead — WebUI, multi-layer CAM, testing, and additional code.
Thessalo — Advice and testing.
[Furry Diffusion Community](https://discord.com/channels/1019133813105905664/1254974507819733017) — Feedback and compatibility fixes.
Google Gemini — Hero image. ### Citations Michael Tschannen, et al. [SigLIP 2.](https://arxiv.org/abs/2502.14786)
Emanuel Ben-Baruch, et al. [Asymmetric Loss For Multi-Label Classification.](https://arxiv.org/abs/2009.14119)
Noam Shazeer. [GLU Variants Improve Transformer.](https://arxiv.org/abs/2002.05202)
Pedram Zamirai, et al. [Revisiting BFloat16 Training.](https://arxiv.org/pdf/2010.06192)