DeepBoner / docs /bugs /P2_7B_MODEL_GARBAGE_OUTPUT.md
VibecoderMcSwaggins's picture
fix(huggingface): P1 Free Tier tool execution - Remove premature marker (#121)
8da024f unverified
|
raw
history blame
9.62 kB
# P2 Bug: 7B Model Produces Garbage Streaming Output
**Date**: 2025-12-02
**Status**: OPEN - Investigating
**Severity**: P2 (Major - Degrades User Experience)
**Component**: Free Tier / HuggingFace + Multi-Agent Orchestration
---
## Symptoms
When running a research query on Free Tier (Qwen2.5-7B-Instruct), the streaming output shows **garbage tokens** and **malformed tool calls** instead of coherent agent reasoning:
### Symptom A: Random Garbage Tokens
```text
πŸ“‘ **STREAMING**: yarg
πŸ“‘ **STREAMING**: PostalCodes
πŸ“‘ **STREAMING**: FunctionFlags
πŸ“‘ **STREAMING**: system
πŸ“‘ **STREAMING**: Transferred to searcher, adopt the persona immediately.
```
### Symptom B: Raw Tool Call JSON in Text (NEW - 2025-12-03)
```text
πŸ“‘ **STREAMING**:
oleon
{"name": "search_preprints", "arguments": {"query": "female libido post-menopause drug", "max_results": 10}}
</tool_call>
system
UrlParser
{"name": "search_clinical_trials", "arguments": {"query": "female libido post-menopause drug", "max_results": 10}}
```
The model is outputting:
1. **Garbage tokens**: "oleon", "UrlParser" - meaningless fragments
2. **Raw JSON tool calls**: `{"name": "search_preprints", ...}` - intended tool calls output as TEXT
3. **XML-style tags**: `</tool_call>` - model trying to use wrong tool calling format
4. **"system" keyword**: Model confusing role markers with content
**Root Cause of Symptom B**: The 7B model is attempting to make tool calls but outputting them as **text content** instead of using the HuggingFace API's native `tool_calls` structure. The model may have been trained on a different tool calling format (XML-style like Claude's `<tool_call>` tags) and doesn't properly use the OpenAI-compatible JSON format.
The model outputs random tokens like "yarg", "PostalCodes", "FunctionFlags" instead of actual research reasoning.
---
## Reproduction Steps
1. Go to HuggingFace Spaces: https://huggingface.co/spaces/vcms/deepboner
2. Leave API key empty (Free Tier)
3. Click any example query or type a question
4. Click submit
5. Observe streaming output - garbage tokens appear
**Expected**: Coherent agent reasoning like "Searching PubMed for female libido treatments..."
**Actual**: Random tokens like "yarg", "PostalCodes"
---
## Root Cause Analysis
### Primary Cause: 7B Model Too Small for Multi-Agent Prompts
The Qwen2.5-7B-Instruct model has **insufficient reasoning capacity** for the complex multi-agent framework. The system requires the model to:
1. **Adopt agent personas** with specialized instructions
2. **Follow structured workflows** (Search β†’ Judge β†’ Hypothesis β†’ Report)
3. **Make tool calls** (search_pubmed, search_clinical_trials, etc.)
4. **Generate JSON-formatted progress ledgers** for workflow control
5. **Understand manager instructions** and delegate appropriately
A 7B parameter model simply does not have the reasoning depth to handle this. Larger models (70B+) were originally intended, but those are routed to unreliable third-party providers (see `HF_FREE_TIER_ANALYSIS.md`).
### Technical Flow (Where Garbage Appears)
```
User Query
↓
AdvancedOrchestrator.run() [advanced.py:247]
↓
workflow.run_stream(task) [builds Magentic workflow]
↓
MagenticAgentDeltaEvent emitted with event.text
↓
Yields AgentEvent(type="streaming", message=event.text) [advanced.py:314-319]
↓
Gradio displays: "πŸ“‘ **STREAMING**: {garbage}"
```
The garbage tokens are **raw model output**. The 7B model is:
- Not following the system prompt
- Outputting partial/incomplete token sequences
- Possibly attempting tool calls but formatting incorrectly
- Hallucinating random words
### Evidence from Microsoft Reference Framework
The Microsoft Agent Framework's `_magentic.py` (lines 1717-1741) shows how agent invocation works:
```python
async for update in agent.run_stream(messages=self._chat_history):
updates.append(update)
await self._emit_agent_delta_event(ctx, update)
```
The framework passes through whatever the underlying chat client produces. If the model produces garbage, the framework streams it directly.
### Why Click Example vs Submit Shows Different Initial State
Both code paths go through the same `research_agent()` function in `app.py`. The difference:
- **Example click**: Immediately submits query, so you see garbage quickly
- **Submit button click**: Shows "Starting research (Advanced mode)" banner first, then garbage
Both ultimately produce the same garbage output from the 7B model.
---
## Impact Assessment
| Aspect | Impact |
|--------|--------|
| Free Tier Users | Cannot get usable research results |
| Demo Quality | Appears broken/unprofessional |
| Trust | Users may think the entire system is broken |
| Differentiation | Undermines "free tier works!" messaging |
---
## Potential Solutions
### Option 1: Switch to Better Small Model (Recommended - Quick Fix)
Find a small model that better handles complex instructions. Candidates:
| Model | Size | Tool Calling | Instruction Following |
|-------|------|--------------|----------------------|
| `mistralai/Mistral-7B-Instruct-v0.3` | 7B | Yes | Better |
| `microsoft/Phi-3-mini-4k-instruct` | 3.8B | Limited | Good |
| `google/gemma-2-9b-it` | 9B | Yes | Good |
| `Qwen/Qwen2.5-14B-Instruct` | 14B | Yes | Better |
**Risk**: 14B model might still be routed to third-party providers. Need to test each.
### Option 2: Simplify Free Tier Architecture
Create a **simpler single-agent mode** for Free Tier:
- Remove multi-agent coordination (Manager, multiple ChatAgents)
- Use a single direct query β†’ search β†’ synthesize flow
- Reduce prompt complexity significantly
**Pros**: More reliable with smaller models
**Cons**: Loses sophisticated multi-agent research capability
### Option 3: Output Filtering/Validation
Add validation layer to detect and filter garbage output:
```python
def is_valid_streaming_token(text: str) -> bool:
"""Check if streaming token appears valid."""
# Garbage patterns we've seen
garbage_patterns = ["yarg", "PostalCodes", "FunctionFlags"]
if any(g in text for g in garbage_patterns):
return False
# Check for minimum coherence (has spaces, reasonable length)
return len(text) > 0 and text.strip()
```
**Pros**: Band-aid fix, quick to implement
**Cons**: Doesn't fix root cause, will miss new garbage patterns
### Option 4: Graceful Degradation
Detect when model output is incoherent and fall back to:
- Returning an error message
- Suggesting user provide an API key
- Using a cached/templated response
### Option 5: Prompt Engineering for 7B Models
Significantly simplify the agent prompts for 7B compatibility:
- Shorter system prompts
- More explicit step-by-step instructions
- Remove abstract concepts
- Use few-shot examples
### Option 6: Streaming Content Filter (For Symptom B)
Filter raw tool call JSON from streaming output:
```python
def should_stream_content(text: str) -> bool:
"""Filter garbage and raw tool calls from streaming."""
# Don't stream raw JSON tool calls
if text.strip().startswith('{"name":'):
return False
# Don't stream XML-style tool tags
if '</tool_call>' in text or '<tool_call>' in text:
return False
# Don't stream garbage tokens (extend as needed)
garbage = ["oleon", "UrlParser", "yarg", "PostalCodes", "FunctionFlags"]
if any(g in text for g in garbage):
return False
return True
```
**Location**: `src/orchestrators/advanced.py` lines 315-322
This would prevent the raw tool call JSON from being shown to users, even if the model produces it.
---
## Recommended Action Plan
### Phase 1: Quick Fix (P2)
1. Test `mistralai/Mistral-7B-Instruct-v0.3` or `Qwen/Qwen2.5-14B-Instruct`
2. Verify they stay on HuggingFace native infrastructure (no third-party routing)
3. Evaluate output quality on sample queries
### Phase 2: Architecture Review (P3)
1. Consider simplified single-agent mode for Free Tier
2. Design graceful degradation when model output is invalid
3. Add output validation layer
### Phase 3: Long-term (P4)
1. Consider hybrid approach: simple mode for free tier, advanced for paid
2. Explore fine-tuning a small model specifically for research agent tasks
---
## Files Involved
| File | Relevance |
|------|-----------|
| `src/orchestrators/advanced.py` | Main orchestrator, streaming event handling |
| `src/clients/huggingface.py` | HuggingFace chat client adapter |
| `src/agents/magentic_agents.py` | Agent definitions and prompts |
| `src/app.py` | Gradio UI, event display |
| `src/utils/config.py` | Model configuration |
---
## Relation to Previous Bugs
- **P0 Repr Bug (RESOLVED)**: Fixed in PR #117 - Was about `<generator object>` appearing due to async generator mishandling
- **P1 HuggingFace Novita Error (RESOLVED)**: Fixed in PR #118 - Was about 72B models being routed to failing third-party providers
This P2 bug is **downstream** of the P1 fix - we fixed the 500 errors by switching to 7B, but now the 7B model doesn't produce quality output.
---
## Questions to Investigate
1. What models in the 7-20B range stay on HuggingFace native infrastructure?
2. Can we detect third-party routing before making the full request?
3. Is the chat template correct for Qwen2.5-7B? (Some models need specific formatting)
4. Are there HuggingFace serverless models specifically optimized for tool calling?
---
## References
- `HF_FREE_TIER_ANALYSIS.md` - Analysis of HuggingFace provider routing
- `CLAUDE.md` - Critical HuggingFace Free Tier section
- Microsoft Agent Framework `_magentic.py` - Reference implementation