raffi-souren's picture
Update README.md
ff2efa2 verified
metadata
license: mit
tags:
  - financial-services
  - evaluation
  - determinism
  - regulatory-compliance
  - sec-filings
  - small-language-models
  - llm

LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows

Authors: Raffi Khatchadourian, Rolando Franco
Venue: AI4F @ ACM ICAIF 2025 (Nov 15 in Singapore)

Paper: https://arxiv.org/abs/2511.07585
Code: https://github.com/ibm-client-engineering/output-drift-financial-llms

This repository is a Hugging Face landing page for the paper and its open-source implementation. It focuses on deterministic test harnesses, cross-provider validation, and risk-tiered deployment for financial LLM workflows (SEC 10-Ks, RAG over filings, JSON/SQL tasks).

🔑 Key finding

Well-engineered 7–8B models achieve 100% output consistency at T=0.0, while a 120B model reaches only 12.5% consistency, regardless of configuration.

Across 480 runs (5 models, 3 tasks, 2 temperatures, 3 concurrency levels), we show an inverse relationship between model size and determinism and map this to regulatory requirements (FSB, BIS, CFTC).


📊 Model tier classification

Tier Models Consistency @ T=0.0 Status Recommended use
1 Granite-3-8B, Qwen2.5-7B 100% ✅ Production-ready All regulated tasks
2 Llama-3.3-70B, Mistral-Medium-2505 56–100% ⚠️ Task-specific SQL / structured only
3 GPT-OSS-120B 12.5% ❌ Non-compliant Not for compliance

n = 480 runs (16 per condition), 95% Wilson CIs, p < 0.0001 (Fisher’s exact).


🎯 Why this matters

Financial institutions face a “verification tax”: human review erodes AI productivity gains when outputs are nondeterministic.

This framework shows:

  • Audit-ready determinism is achievable with the right model + decoding setup.
  • Cross-provider consistency: behavior transfers between local (Ollama) and cloud (IBM watsonx.ai).
  • Task-specific drift: SQL and structured summaries remain stable even at T=0.2; RAG is far more sensitive.