Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

DeepBoner / docs /bugs /P2_7B_MODEL_GARBAGE_OUTPUT.md

VibecoderMcSwaggins

fix(huggingface): P1 Free Tier tool execution - Remove premature marker (#121)

8da024f unverified 15 days ago

preview code

raw

history blame

9.62 kB

	# P2 Bug: 7B Model Produces Garbage Streaming Output

	Date: 2025-12-02
	Status: OPEN - Investigating
	Severity: P2 (Major - Degrades User Experience)
	Component: Free Tier / HuggingFace + Multi-Agent Orchestration

	---

	## Symptoms

	When running a research query on Free Tier (Qwen2.5-7B-Instruct), the streaming output shows garbage tokens and malformed tool calls instead of coherent agent reasoning:

	### Symptom A: Random Garbage Tokens
	```text
	📡 STREAMING: yarg
	📡 STREAMING: PostalCodes
	📡 STREAMING: FunctionFlags
	📡 STREAMING: system
	📡 STREAMING: Transferred to searcher, adopt the persona immediately.
	```

	### Symptom B: Raw Tool Call JSON in Text (NEW - 2025-12-03)
	```text
	📡 STREAMING:
	oleon
	{"name": "search_preprints", "arguments": {"query": "female libido post-menopause drug", "max_results": 10}}
	</tool_call>
	system

	UrlParser
	{"name": "search_clinical_trials", "arguments": {"query": "female libido post-menopause drug", "max_results": 10}}
	```

	The model is outputting:
	1. Garbage tokens: "oleon", "UrlParser" - meaningless fragments
	2. Raw JSON tool calls: `{"name": "search_preprints", ...}` - intended tool calls output as TEXT
	3. XML-style tags: `</tool_call>` - model trying to use wrong tool calling format
	4. "system" keyword: Model confusing role markers with content

	Root Cause of Symptom B: The 7B model is attempting to make tool calls but outputting them as text content instead of using the HuggingFace API's native `tool_calls` structure. The model may have been trained on a different tool calling format (XML-style like Claude's `<tool_call>` tags) and doesn't properly use the OpenAI-compatible JSON format.

	The model outputs random tokens like "yarg", "PostalCodes", "FunctionFlags" instead of actual research reasoning.

	---

	## Reproduction Steps

	1. Go to HuggingFace Spaces: https://huggingface.co/spaces/vcms/deepboner
	2. Leave API key empty (Free Tier)
	3. Click any example query or type a question
	4. Click submit
	5. Observe streaming output - garbage tokens appear

	Expected: Coherent agent reasoning like "Searching PubMed for female libido treatments..."
	Actual: Random tokens like "yarg", "PostalCodes"

	---

	## Root Cause Analysis

	### Primary Cause: 7B Model Too Small for Multi-Agent Prompts

	The Qwen2.5-7B-Instruct model has insufficient reasoning capacity for the complex multi-agent framework. The system requires the model to:

	1. Adopt agent personas with specialized instructions
	2. Follow structured workflows (Search → Judge → Hypothesis → Report)
	3. Make tool calls (search_pubmed, search_clinical_trials, etc.)
	4. Generate JSON-formatted progress ledgers for workflow control
	5. Understand manager instructions and delegate appropriately

	A 7B parameter model simply does not have the reasoning depth to handle this. Larger models (70B+) were originally intended, but those are routed to unreliable third-party providers (see `HF_FREE_TIER_ANALYSIS.md`).

	### Technical Flow (Where Garbage Appears)

	```
	User Query
	↓
	AdvancedOrchestrator.run() [advanced.py:247]
	↓
	workflow.run_stream(task) [builds Magentic workflow]
	↓
	MagenticAgentDeltaEvent emitted with event.text
	↓
	Yields AgentEvent(type="streaming", message=event.text) [advanced.py:314-319]
	↓
	Gradio displays: "📡 STREAMING: {garbage}"
	```

	The garbage tokens are raw model output. The 7B model is:
	- Not following the system prompt
	- Outputting partial/incomplete token sequences
	- Possibly attempting tool calls but formatting incorrectly
	- Hallucinating random words

	### Evidence from Microsoft Reference Framework

	The Microsoft Agent Framework's `_magentic.py` (lines 1717-1741) shows how agent invocation works:

	```python
	async for update in agent.run_stream(messages=self._chat_history):
	updates.append(update)
	await self._emit_agent_delta_event(ctx, update)
	```

	The framework passes through whatever the underlying chat client produces. If the model produces garbage, the framework streams it directly.

	### Why Click Example vs Submit Shows Different Initial State

	Both code paths go through the same `research_agent()` function in `app.py`. The difference:

	- Example click: Immediately submits query, so you see garbage quickly
	- Submit button click: Shows "Starting research (Advanced mode)" banner first, then garbage

	Both ultimately produce the same garbage output from the 7B model.

	---

	## Impact Assessment

	\| Aspect \| Impact \|
	\|--------\|--------\|
	\| Free Tier Users \| Cannot get usable research results \|
	\| Demo Quality \| Appears broken/unprofessional \|
	\| Trust \| Users may think the entire system is broken \|
	\| Differentiation \| Undermines "free tier works!" messaging \|

	---

	## Potential Solutions

	### Option 1: Switch to Better Small Model (Recommended - Quick Fix)

	Find a small model that better handles complex instructions. Candidates:

	\| Model \| Size \| Tool Calling \| Instruction Following \|
	\|-------\|------\|--------------\|----------------------\|
	\| `mistralai/Mistral-7B-Instruct-v0.3` \| 7B \| Yes \| Better \|
	\| `microsoft/Phi-3-mini-4k-instruct` \| 3.8B \| Limited \| Good \|
	\| `google/gemma-2-9b-it` \| 9B \| Yes \| Good \|
	\| `Qwen/Qwen2.5-14B-Instruct` \| 14B \| Yes \| Better \|

	Risk: 14B model might still be routed to third-party providers. Need to test each.

	### Option 2: Simplify Free Tier Architecture

	Create a simpler single-agent mode for Free Tier:
	- Remove multi-agent coordination (Manager, multiple ChatAgents)
	- Use a single direct query → search → synthesize flow
	- Reduce prompt complexity significantly

	Pros: More reliable with smaller models
	Cons: Loses sophisticated multi-agent research capability

	### Option 3: Output Filtering/Validation

	Add validation layer to detect and filter garbage output:

	```python
	def is_valid_streaming_token(text: str) -> bool:
	"""Check if streaming token appears valid."""
	# Garbage patterns we've seen
	garbage_patterns = ["yarg", "PostalCodes", "FunctionFlags"]
	if any(g in text for g in garbage_patterns):
	return False
	# Check for minimum coherence (has spaces, reasonable length)
	return len(text) > 0 and text.strip()
	```

	Pros: Band-aid fix, quick to implement
	Cons: Doesn't fix root cause, will miss new garbage patterns

	### Option 4: Graceful Degradation

	Detect when model output is incoherent and fall back to:
	- Returning an error message
	- Suggesting user provide an API key
	- Using a cached/templated response

	### Option 5: Prompt Engineering for 7B Models

	Significantly simplify the agent prompts for 7B compatibility:
	- Shorter system prompts
	- More explicit step-by-step instructions
	- Remove abstract concepts
	- Use few-shot examples

	### Option 6: Streaming Content Filter (For Symptom B)

	Filter raw tool call JSON from streaming output:

	```python
	def should_stream_content(text: str) -> bool:
	"""Filter garbage and raw tool calls from streaming."""
	# Don't stream raw JSON tool calls
	if text.strip().startswith('{"name":'):
	return False
	# Don't stream XML-style tool tags
	if '</tool_call>' in text or '<tool_call>' in text:
	return False
	# Don't stream garbage tokens (extend as needed)
	garbage = ["oleon", "UrlParser", "yarg", "PostalCodes", "FunctionFlags"]
	if any(g in text for g in garbage):
	return False
	return True
	```

	Location: `src/orchestrators/advanced.py` lines 315-322

	This would prevent the raw tool call JSON from being shown to users, even if the model produces it.

	---

	## Recommended Action Plan

	### Phase 1: Quick Fix (P2)
	1. Test `mistralai/Mistral-7B-Instruct-v0.3` or `Qwen/Qwen2.5-14B-Instruct`
	2. Verify they stay on HuggingFace native infrastructure (no third-party routing)
	3. Evaluate output quality on sample queries

	### Phase 2: Architecture Review (P3)
	1. Consider simplified single-agent mode for Free Tier
	2. Design graceful degradation when model output is invalid
	3. Add output validation layer

	### Phase 3: Long-term (P4)
	1. Consider hybrid approach: simple mode for free tier, advanced for paid
	2. Explore fine-tuning a small model specifically for research agent tasks

	---

	## Files Involved

	\| File \| Relevance \|
	\|------\|-----------\|
	\| `src/orchestrators/advanced.py` \| Main orchestrator, streaming event handling \|
	\| `src/clients/huggingface.py` \| HuggingFace chat client adapter \|
	\| `src/agents/magentic_agents.py` \| Agent definitions and prompts \|
	\| `src/app.py` \| Gradio UI, event display \|
	\| `src/utils/config.py` \| Model configuration \|

	---

	## Relation to Previous Bugs

	- P0 Repr Bug (RESOLVED): Fixed in PR #117 - Was about `<generator object>` appearing due to async generator mishandling
	- P1 HuggingFace Novita Error (RESOLVED): Fixed in PR #118 - Was about 72B models being routed to failing third-party providers

	This P2 bug is downstream of the P1 fix - we fixed the 500 errors by switching to 7B, but now the 7B model doesn't produce quality output.

	---

	## Questions to Investigate

	1. What models in the 7-20B range stay on HuggingFace native infrastructure?
	2. Can we detect third-party routing before making the full request?
	3. Is the chat template correct for Qwen2.5-7B? (Some models need specific formatting)
	4. Are there HuggingFace serverless models specifically optimized for tool calling?

	---

	## References

	- `HF_FREE_TIER_ANALYSIS.md` - Analysis of HuggingFace provider routing
	- `CLAUDE.md` - Critical HuggingFace Free Tier section
	- Microsoft Agent Framework `_magentic.py` - Reference implementation