Phishing Email Detector V5.1
Phishing Email Detector V5.1 is a multi-task DistilBERT model trained on merged phishing email corpora with auxiliary phishing-type supervision and calibrated probability outputs.
Developed by members of l3ak, a cybersecurity / CTF team
👉 Demo: https://huggingface.co/spaces/mikaelnurminen/phishing-detector-app
Model Overview
- Base model:
distilbert-base-uncased - Task: phishing vs safe classification
- Auxiliary task: phishing type classification (6 classes)
- Input: preprocessed email text
- Output: phishing probability (0–1)
- Loss: weighted CE + label smoothing
- Calibration: temperature scaling
- Threshold modes: balanced / high-recall / high-precision
Training Data
The model is trained on a merged and deduplicated dataset from two public phishing email sources:
| Dataset | Size | Notes |
|---|---|---|
zefang-liu/phishing-email-dataset |
~18.6k | Primary dataset (used in V4/V5) |
ealvaradob/phishing-dataset (emails config) |
~4.8k | Additional phishing/ham emails |
After merge + deduplication: ~30k+ emails
Data Processing Pipeline
- HTML → text stripping
- URL + email masking
- Cross-dataset deduplication
- Stratified split
- Feature extraction (URLs, domains, tokens, etc.)
- Pseudo-labelling of phishing type
Phishing Type Auxiliary Labels
Phishing emails receive one of 6 heuristic types:
- Credential harvesting
- Financial / invoice
- Account suspension
- Malware / attachment
- Brand impersonation
- Generic spam
Safe emails → type = −1 (masked in loss)
This auxiliary supervision improves representation learning and recall.
Performance (Test Set)
| Metric | Score |
|---|---|
| Accuracy | 0.9855 |
| F1 | 0.9806 |
| ROC-AUC | 0.9989 |
| PR-AUC | 0.9981 |
| FP / 1k | 8.06 |
| FN / 1k | 25.23 |
Decision Thresholds
| Mode | Threshold | Use |
|---|---|---|
| balanced | 0.3635 | default filtering |
| high_recall | 0.9546 | SOC triage |
| high_precision | 0.0100 | auto-blocking |
Training Configuration
- Max length:
MAX_LENGTH(DistilBERT input) - Effective batch: 64 (grad accumulation)
- Scheduler: linear
- Optimizer: AdamW
- Epochs: V5.1 training regime
- GPU: CUDA (mixed precision)
- Downloads last month
- 36