Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,13 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
# LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows
|
| 5 |
|
|
@@ -11,4 +19,38 @@ license: mit
|
|
| 11 |
|
| 12 |
This repository is a Hugging Face landing page for the paper and its open-source implementation.
|
| 13 |
It focuses on deterministic test harnesses, cross-provider validation, and risk-tiered deployment
|
| 14 |
-
for financial LLM workflows (SEC 10-Ks, RAG over filings, JSON/SQL tasks).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- financial-services
|
| 5 |
+
- evaluation
|
| 6 |
+
- determinism
|
| 7 |
+
- regulatory-compliance
|
| 8 |
+
- sec-filings
|
| 9 |
+
- small-language-models
|
| 10 |
+
- llm
|
| 11 |
---
|
| 12 |
# LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows
|
| 13 |
|
|
|
|
| 19 |
|
| 20 |
This repository is a Hugging Face landing page for the paper and its open-source implementation.
|
| 21 |
It focuses on deterministic test harnesses, cross-provider validation, and risk-tiered deployment
|
| 22 |
+
for financial LLM workflows (SEC 10-Ks, RAG over filings, JSON/SQL tasks).
|
| 23 |
+
|
| 24 |
+
## 🔑 Key finding
|
| 25 |
+
|
| 26 |
+
> Well-engineered **7–8B models achieve 100% output consistency at T=0.0**, while a 120B model reaches only **12.5% consistency**, regardless of configuration.
|
| 27 |
+
|
| 28 |
+
Across **480 runs** (5 models, 3 tasks, 2 temperatures, 3 concurrency levels), we show an
|
| 29 |
+
inverse relationship between model size and determinism and map this to regulatory
|
| 30 |
+
requirements (FSB, BIS, CFTC).
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## 📊 Model tier classification
|
| 35 |
+
|
| 36 |
+
| Tier | Models | Consistency @ T=0.0 | Status | Recommended use |
|
| 37 |
+
|------|-----------------------------|---------------------|--------------------|-----------------|
|
| 38 |
+
| **1** | Granite-3-8B, Qwen2.5-7B | **100%** | ✅ Production-ready | All regulated tasks |
|
| 39 |
+
| **2** | Llama-3.3-70B, Mistral-Medium-2505 | 56–100% | ⚠️ Task-specific | SQL / structured only |
|
| 40 |
+
| **3** | GPT-OSS-120B | **12.5%** | ❌ Non-compliant | Not for compliance |
|
| 41 |
+
|
| 42 |
+
*n = 480 runs (16 per condition), 95% Wilson CIs, p < 0.0001 (Fisher’s exact).*
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## 🎯 Why this matters
|
| 47 |
+
|
| 48 |
+
Financial institutions face a **“verification tax”**: human review erodes AI productivity
|
| 49 |
+
gains when outputs are nondeterministic.
|
| 50 |
+
|
| 51 |
+
This framework shows:
|
| 52 |
+
|
| 53 |
+
- **Audit-ready determinism is achievable** with the right model + decoding setup.
|
| 54 |
+
- **Cross-provider consistency**: behavior transfers between local (Ollama) and cloud (IBM watsonx.ai).
|
| 55 |
+
- **Task-specific drift**: SQL and structured summaries remain stable even at T=0.2; RAG is far more sensitive.
|
| 56 |
+
|