raffi-souren commited on
Commit
ff2efa2
·
verified ·
1 Parent(s): 6979eb6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -1
README.md CHANGED
@@ -1,5 +1,13 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
3
  ---
4
  # LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows
5
 
@@ -11,4 +19,38 @@ license: mit
11
 
12
  This repository is a Hugging Face landing page for the paper and its open-source implementation.
13
  It focuses on deterministic test harnesses, cross-provider validation, and risk-tiered deployment
14
- for financial LLM workflows (SEC 10-Ks, RAG over filings, JSON/SQL tasks).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - financial-services
5
+ - evaluation
6
+ - determinism
7
+ - regulatory-compliance
8
+ - sec-filings
9
+ - small-language-models
10
+ - llm
11
  ---
12
  # LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows
13
 
 
19
 
20
  This repository is a Hugging Face landing page for the paper and its open-source implementation.
21
  It focuses on deterministic test harnesses, cross-provider validation, and risk-tiered deployment
22
+ for financial LLM workflows (SEC 10-Ks, RAG over filings, JSON/SQL tasks).
23
+
24
+ ## 🔑 Key finding
25
+
26
+ > Well-engineered **7–8B models achieve 100% output consistency at T=0.0**, while a 120B model reaches only **12.5% consistency**, regardless of configuration.
27
+
28
+ Across **480 runs** (5 models, 3 tasks, 2 temperatures, 3 concurrency levels), we show an
29
+ inverse relationship between model size and determinism and map this to regulatory
30
+ requirements (FSB, BIS, CFTC).
31
+
32
+ ---
33
+
34
+ ## 📊 Model tier classification
35
+
36
+ | Tier | Models | Consistency @ T=0.0 | Status | Recommended use |
37
+ |------|-----------------------------|---------------------|--------------------|-----------------|
38
+ | **1** | Granite-3-8B, Qwen2.5-7B | **100%** | ✅ Production-ready | All regulated tasks |
39
+ | **2** | Llama-3.3-70B, Mistral-Medium-2505 | 56–100% | ⚠️ Task-specific | SQL / structured only |
40
+ | **3** | GPT-OSS-120B | **12.5%** | ❌ Non-compliant | Not for compliance |
41
+
42
+ *n = 480 runs (16 per condition), 95% Wilson CIs, p < 0.0001 (Fisher’s exact).*
43
+
44
+ ---
45
+
46
+ ## 🎯 Why this matters
47
+
48
+ Financial institutions face a **“verification tax”**: human review erodes AI productivity
49
+ gains when outputs are nondeterministic.
50
+
51
+ This framework shows:
52
+
53
+ - **Audit-ready determinism is achievable** with the right model + decoding setup.
54
+ - **Cross-provider consistency**: behavior transfers between local (Ollama) and cloud (IBM watsonx.ai).
55
+ - **Task-specific drift**: SQL and structured summaries remain stable even at T=0.2; RAG is far more sensitive.
56
+