|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- financial-services |
|
|
- evaluation |
|
|
- determinism |
|
|
- regulatory-compliance |
|
|
- sec-filings |
|
|
- small-language-models |
|
|
- llm |
|
|
--- |
|
|
# LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows |
|
|
|
|
|
**Authors:** Raffi Khatchadourian, Rolando Franco |
|
|
**Venue:** AI4F @ ACM ICAIF 2025 (Nov 15 in Singapore) |
|
|
|
|
|
**Paper:** https://arxiv.org/abs/2511.07585 |
|
|
**Code:** https://github.com/ibm-client-engineering/output-drift-financial-llms |
|
|
|
|
|
This repository is a Hugging Face landing page for the paper and its open-source implementation. |
|
|
It focuses on deterministic test harnesses, cross-provider validation, and risk-tiered deployment |
|
|
for financial LLM workflows (SEC 10-Ks, RAG over filings, JSON/SQL tasks). |
|
|
|
|
|
## 🔑 Key finding |
|
|
|
|
|
> Well-engineered **7–8B models achieve 100% output consistency at T=0.0**, while a 120B model reaches only **12.5% consistency**, regardless of configuration. |
|
|
|
|
|
Across **480 runs** (5 models, 3 tasks, 2 temperatures, 3 concurrency levels), we show an |
|
|
inverse relationship between model size and determinism and map this to regulatory |
|
|
requirements (FSB, BIS, CFTC). |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Model tier classification |
|
|
|
|
|
| Tier | Models | Consistency @ T=0.0 | Status | Recommended use | |
|
|
|------|-----------------------------|---------------------|--------------------|-----------------| |
|
|
| **1** | Granite-3-8B, Qwen2.5-7B | **100%** | ✅ Production-ready | All regulated tasks | |
|
|
| **2** | Llama-3.3-70B, Mistral-Medium-2505 | 56–100% | ⚠️ Task-specific | SQL / structured only | |
|
|
| **3** | GPT-OSS-120B | **12.5%** | ❌ Non-compliant | Not for compliance | |
|
|
|
|
|
*n = 480 runs (16 per condition), 95% Wilson CIs, p < 0.0001 (Fisher’s exact).* |
|
|
|
|
|
--- |
|
|
|
|
|
## 🎯 Why this matters |
|
|
|
|
|
Financial institutions face a **“verification tax”**: human review erodes AI productivity |
|
|
gains when outputs are nondeterministic. |
|
|
|
|
|
This framework shows: |
|
|
|
|
|
- **Audit-ready determinism is achievable** with the right model + decoding setup. |
|
|
- **Cross-provider consistency**: behavior transfers between local (Ollama) and cloud (IBM watsonx.ai). |
|
|
- **Task-specific drift**: SQL and structured summaries remain stable even at T=0.2; RAG is far more sensitive. |
|
|
|
|
|
|