raffi-souren
/

llm-output-drift-financial-workflows

financial-services

regulatory-compliance

small-language-models

Model card Files Files and versions

llm-output-drift-financial-workflows / README.md

raffi-souren's picture

Update README.md

ff2efa2 verified 23 days ago

|

history blame contribute delete

2.26 kB

	---
	license: mit
	tags:
	- financial-services
	- evaluation
	- determinism
	- regulatory-compliance
	- sec-filings
	- small-language-models
	- llm
	---
	# LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows

	Authors: Raffi Khatchadourian, Rolando Franco
	Venue: AI4F @ ACM ICAIF 2025 (Nov 15 in Singapore)

	Paper: https://arxiv.org/abs/2511.07585
	Code: https://github.com/ibm-client-engineering/output-drift-financial-llms

	This repository is a Hugging Face landing page for the paper and its open-source implementation.
	It focuses on deterministic test harnesses, cross-provider validation, and risk-tiered deployment
	for financial LLM workflows (SEC 10-Ks, RAG over filings, JSON/SQL tasks).

	## 🔑 Key finding

	> Well-engineered 7–8B models achieve 100% output consistency at T=0.0, while a 120B model reaches only 12.5% consistency, regardless of configuration.

	Across 480 runs (5 models, 3 tasks, 2 temperatures, 3 concurrency levels), we show an
	inverse relationship between model size and determinism and map this to regulatory
	requirements (FSB, BIS, CFTC).

	---

	## 📊 Model tier classification

	\| Tier \| Models \| Consistency @ T=0.0 \| Status \| Recommended use \|
	\|------\|-----------------------------\|---------------------\|--------------------\|-----------------\|
	\| 1 \| Granite-3-8B, Qwen2.5-7B \| 100% \| ✅ Production-ready \| All regulated tasks \|
	\| 2 \| Llama-3.3-70B, Mistral-Medium-2505 \| 56–100% \| ⚠️ Task-specific \| SQL / structured only \|
	\| 3 \| GPT-OSS-120B \| 12.5% \| ❌ Non-compliant \| Not for compliance \|

	n = 480 runs (16 per condition), 95% Wilson CIs, p < 0.0001 (Fisher’s exact).

	---

	## 🎯 Why this matters

	Financial institutions face a “verification tax”: human review erodes AI productivity
	gains when outputs are nondeterministic.

	This framework shows:

	- Audit-ready determinism is achievable with the right model + decoding setup.
	- Cross-provider consistency: behavior transfers between local (Ollama) and cloud (IBM watsonx.ai).
	- Task-specific drift: SQL and structured summaries remain stable even at T=0.2; RAG is far more sensitive.