Phishing Email Detector V5.1

Phishing Email Detector V5.1 is a multi-task DistilBERT model trained on merged phishing email corpora with auxiliary phishing-type supervision and calibrated probability outputs.

Developed by members of l3ak, a cybersecurity / CTF team

👉 Demo: https://huggingface.co/spaces/mikaelnurminen/phishing-detector-app


Model Overview

  • Base model: distilbert-base-uncased
  • Task: phishing vs safe classification
  • Auxiliary task: phishing type classification (6 classes)
  • Input: preprocessed email text
  • Output: phishing probability (0–1)
  • Loss: weighted CE + label smoothing
  • Calibration: temperature scaling
  • Threshold modes: balanced / high-recall / high-precision

Training Data

The model is trained on a merged and deduplicated dataset from two public phishing email sources:

Dataset Size Notes
zefang-liu/phishing-email-dataset ~18.6k Primary dataset (used in V4/V5)
ealvaradob/phishing-dataset (emails config) ~4.8k Additional phishing/ham emails

After merge + deduplication: ~30k+ emails


Data Processing Pipeline

  • HTML → text stripping
  • URL + email masking
  • Cross-dataset deduplication
  • Stratified split
  • Feature extraction (URLs, domains, tokens, etc.)
  • Pseudo-labelling of phishing type

Phishing Type Auxiliary Labels

Phishing emails receive one of 6 heuristic types:

  1. Credential harvesting
  2. Financial / invoice
  3. Account suspension
  4. Malware / attachment
  5. Brand impersonation
  6. Generic spam

Safe emails → type = −1 (masked in loss)

This auxiliary supervision improves representation learning and recall.


Performance (Test Set)

Metric Score
Accuracy 0.9855
F1 0.9806
ROC-AUC 0.9989
PR-AUC 0.9981
FP / 1k 8.06
FN / 1k 25.23

Decision Thresholds

Mode Threshold Use
balanced 0.3635 default filtering
high_recall 0.9546 SOC triage
high_precision 0.0100 auto-blocking

Training Configuration

  • Max length: MAX_LENGTH (DistilBERT input)
  • Effective batch: 64 (grad accumulation)
  • Scheduler: linear
  • Optimizer: AdamW
  • Epochs: V5.1 training regime
  • GPU: CUDA (mixed precision)

Downloads last month
36
Safetensors
Model size
66.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train mikaelnurminen/phishing-email-detector-v51