Financial Document Classifier (Jina-v3 Hybrid)

A production-grade classifier for regulatory financial documents, designed to categorize filings into 29 high-value classes.

This model utilizes a Hybrid Architecture:

  1. Semantic Encoder: A fine-tuned jina-embeddings-v3 (8192 token context) trained via Contrastive Learning (Triplet Loss).
  2. Classification Head: An XGBoost classifier trained on semantic vectors + metadata features (Document Length).

Repository: FinancialReports/jina-v3-financial-classifier

πŸ“Š Performance Metrics

This model is designed for high-stakes financial environments where precision is paramount. It supports a Confidence Gating workflow.

Metric Score Description
Production Reliability 90.0% Accuracy on documents with confidence > 75%.
Automation Coverage 89.7% Percentage of documents handled automatically.
Raw Accuracy 85.44% Baseline accuracy across all 29 classes.
Macro F1 0.86 Balanced performance across rare and common classes.

Key Class Performance

Class Precision Recall F1-Score
Director's Dealing 96% 98% 0.97
Remuneration Info 98% 90% 0.94
Net Asset Value 94% 99% 0.96
Voting Results 95% 92% 0.93
Annual Report 90% 86% 0.88

πŸš€ Usage (Python)

Because this is a Hybrid Model (Transformer + Gradient Boosted Tree), you must load the encoder and the classification head separately.

Installation

pip install sentence-transformers xgboost huggingface_hub numpy

Inference Script

import joblib
import xgboost as xgb
import numpy as np
from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer

class FinancialClassifier:
    def __init__(self, repo_id="FinancialReports/jina-v3-financial-classifier"):
        print(f"Loading model from {repo_id}...")
        
        # 1. Load the Brain (Jina Encoder)
        self.encoder = SentenceTransformer(repo_id, trust_remote_code=True)
        self.encoder.max_seq_length = 8192 # Full context window
        
        # 2. Download & Load the Head (XGBoost)
        classifier_path = hf_hub_download(repo_id=repo_id, filename="financial_classifier_final_v1.json")
        self.classifier = xgb.XGBClassifier()
        self.classifier.load_model(classifier_path)
        
        # 3. Download & Load the Label Decoder
        decoder_path = hf_hub_download(repo_id=repo_id, filename="label_decoder_final_v1.pkl")
        self.id2label = joblib.load(decoder_path)
        print("βœ… System Ready.")

    def predict(self, text):
        # 1. Feature Extraction (Text + Log-Length)
        # We inject document length to distinguish short Earnings Releases from long Interim Reports.
        embedding = self.encoder.encode([text])[0]
        length_feature = np.log1p(len(text))
        length_norm = length_feature / 12.0 # Normalized scale
        
        # Combine features
        features = np.hstack([embedding, [length_norm]])
        
        # 2. Inference
        probs = self.classifier.predict_proba([features])[0]
        pred_id = np.argmax(probs)
        confidence = float(np.max(probs))
        label = self.id2label[pred_id]
        
        return {
            "label": label,
            "confidence": round(confidence, 4),
            "status": "accept" if confidence > 0.75 else "manual_review"
        }

# Example
clf = FinancialClassifier()
doc = "We are pleased to announce the acquisition of..."
result = clf.predict(doc)
print(result)

πŸ“‚ Taxonomy (29 Classes)

The model classifies text into one of the following categories:

  • Financial Reporting: Annual Report, Earnings Release, Interim / Quarterly Report, Periodic Financial Results.
  • Transactions: M&A Activity, Transaction in Own Shares, Share Issue/Capital Change, Capital/Financing Update.
  • Governance: Board/Management Information, Director's Dealing, Remuneration Information, Governance Information, Proxy Solicitation.
  • Shareholder Meetings: AGM Information, Declaration of Voting Results.
  • Funds: Net Asset Value, Fund Information / Factsheet, Notice of Dividend Amount.
  • Legal/Compliance: Regulatory Filings, Legal Proceedings Report, Delisting Announcement.

πŸ”§ Training Details

  • Base Model: jinaai/jina-embeddings-v3
  • Context Window: 8192 Tokens (Truncation Strategy: Tail)
  • Training Data: 27,000+ synthetic and augmented financial filings.
  • Objective: Batch All Triplet Loss (Contrastive Learning).
  • Hardware: Trained on NVIDIA A100 (40GB).

Limitations

  • Management Reports: The model occasionally confuses generic "Management Reports" with "Periodic Financial Results" due to high semantic overlap. Confidence gating is recommended.
  • Hybrid Requirement: This model cannot be loaded with AutoModelForSequenceClassification alone; it requires the accompanying XGBoost artifacts found in this repository.
Downloads last month
50
Safetensors
Model size
0.6B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FinancialReports/jina-v3-financial-classifier

Finetuned
(29)
this model