Financial Document Classifier (Jina-v3 Hybrid)
A production-grade classifier for regulatory financial documents, designed to categorize filings into 29 high-value classes.
This model utilizes a Hybrid Architecture:
- Semantic Encoder: A fine-tuned
jina-embeddings-v3(8192 token context) trained via Contrastive Learning (Triplet Loss). - Classification Head: An XGBoost classifier trained on semantic vectors + metadata features (Document Length).
Repository: FinancialReports/jina-v3-financial-classifier
π Performance Metrics
This model is designed for high-stakes financial environments where precision is paramount. It supports a Confidence Gating workflow.
| Metric | Score | Description |
|---|---|---|
| Production Reliability | 90.0% | Accuracy on documents with confidence > 75%. |
| Automation Coverage | 89.7% | Percentage of documents handled automatically. |
| Raw Accuracy | 85.44% | Baseline accuracy across all 29 classes. |
| Macro F1 | 0.86 | Balanced performance across rare and common classes. |
Key Class Performance
| Class | Precision | Recall | F1-Score |
|---|---|---|---|
| Director's Dealing | 96% | 98% | 0.97 |
| Remuneration Info | 98% | 90% | 0.94 |
| Net Asset Value | 94% | 99% | 0.96 |
| Voting Results | 95% | 92% | 0.93 |
| Annual Report | 90% | 86% | 0.88 |
π Usage (Python)
Because this is a Hybrid Model (Transformer + Gradient Boosted Tree), you must load the encoder and the classification head separately.
Installation
pip install sentence-transformers xgboost huggingface_hub numpy
Inference Script
import joblib
import xgboost as xgb
import numpy as np
from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer
class FinancialClassifier:
def __init__(self, repo_id="FinancialReports/jina-v3-financial-classifier"):
print(f"Loading model from {repo_id}...")
# 1. Load the Brain (Jina Encoder)
self.encoder = SentenceTransformer(repo_id, trust_remote_code=True)
self.encoder.max_seq_length = 8192 # Full context window
# 2. Download & Load the Head (XGBoost)
classifier_path = hf_hub_download(repo_id=repo_id, filename="financial_classifier_final_v1.json")
self.classifier = xgb.XGBClassifier()
self.classifier.load_model(classifier_path)
# 3. Download & Load the Label Decoder
decoder_path = hf_hub_download(repo_id=repo_id, filename="label_decoder_final_v1.pkl")
self.id2label = joblib.load(decoder_path)
print("β
System Ready.")
def predict(self, text):
# 1. Feature Extraction (Text + Log-Length)
# We inject document length to distinguish short Earnings Releases from long Interim Reports.
embedding = self.encoder.encode([text])[0]
length_feature = np.log1p(len(text))
length_norm = length_feature / 12.0 # Normalized scale
# Combine features
features = np.hstack([embedding, [length_norm]])
# 2. Inference
probs = self.classifier.predict_proba([features])[0]
pred_id = np.argmax(probs)
confidence = float(np.max(probs))
label = self.id2label[pred_id]
return {
"label": label,
"confidence": round(confidence, 4),
"status": "accept" if confidence > 0.75 else "manual_review"
}
# Example
clf = FinancialClassifier()
doc = "We are pleased to announce the acquisition of..."
result = clf.predict(doc)
print(result)
π Taxonomy (29 Classes)
The model classifies text into one of the following categories:
- Financial Reporting: Annual Report, Earnings Release, Interim / Quarterly Report, Periodic Financial Results.
- Transactions: M&A Activity, Transaction in Own Shares, Share Issue/Capital Change, Capital/Financing Update.
- Governance: Board/Management Information, Director's Dealing, Remuneration Information, Governance Information, Proxy Solicitation.
- Shareholder Meetings: AGM Information, Declaration of Voting Results.
- Funds: Net Asset Value, Fund Information / Factsheet, Notice of Dividend Amount.
- Legal/Compliance: Regulatory Filings, Legal Proceedings Report, Delisting Announcement.
π§ Training Details
- Base Model:
jinaai/jina-embeddings-v3 - Context Window: 8192 Tokens (Truncation Strategy: Tail)
- Training Data: 27,000+ synthetic and augmented financial filings.
- Objective: Batch All Triplet Loss (Contrastive Learning).
- Hardware: Trained on NVIDIA A100 (40GB).
Limitations
- Management Reports: The model occasionally confuses generic "Management Reports" with "Periodic Financial Results" due to high semantic overlap. Confidence gating is recommended.
- Hybrid Requirement: This model cannot be loaded with
AutoModelForSequenceClassificationalone; it requires the accompanying XGBoost artifacts found in this repository.
- Downloads last month
- 50
Model tree for FinancialReports/jina-v3-financial-classifier
Base model
jinaai/jina-embeddings-v3