# P2 Bug: 7B Model Produces Garbage Streaming Output

**Date**: 2025-12-02
**Status**: OPEN - Investigating
**Severity**: P2 (Major - Degrades User Experience)
**Component**: Free Tier / HuggingFace + Multi-Agent Orchestration

---

## Symptoms

When running a research query on Free Tier (Qwen2.5-7B-Instruct), the streaming output shows **garbage tokens** and **malformed tool calls** instead of coherent agent reasoning:

### Symptom A: Random Garbage Tokens
```text
📡 **STREAMING**: yarg
📡 **STREAMING**: PostalCodes
📡 **STREAMING**: FunctionFlags
📡 **STREAMING**: system
📡 **STREAMING**: Transferred to searcher, adopt the persona immediately.
```

### Symptom B: Raw Tool Call JSON in Text (NEW - 2025-12-03)
```text
📡 **STREAMING**:
oleon
{"name": "search_preprints", "arguments": {"query": "female libido post-menopause drug", "max_results": 10}}
</tool_call>
system

UrlParser
{"name": "search_clinical_trials", "arguments": {"query": "female libido post-menopause drug", "max_results": 10}}
```

The model is outputting:
1. **Garbage tokens**: "oleon", "UrlParser" - meaningless fragments
2. **Raw JSON tool calls**: `{"name": "search_preprints", ...}` - intended tool calls output as TEXT
3. **XML-style tags**: `</tool_call>` - model trying to use wrong tool calling format
4. **"system" keyword**: Model confusing role markers with content

**Root Cause of Symptom B**: The 7B model is attempting to make tool calls but outputting them as **text content** instead of using the HuggingFace API's native `tool_calls` structure. The model may have been trained on a different tool calling format (XML-style like Claude's `<tool_call>` tags) and doesn't properly use the OpenAI-compatible JSON format.

The model outputs random tokens like "yarg", "PostalCodes", "FunctionFlags" instead of actual research reasoning.

---

## Reproduction Steps

1. Go to HuggingFace Spaces: https://huggingface.co/spaces/vcms/deepboner
2. Leave API key empty (Free Tier)
3. Click any example query or type a question
4. Click submit
5. Observe streaming output - garbage tokens appear

**Expected**: Coherent agent reasoning like "Searching PubMed for female libido treatments..."
**Actual**: Random tokens like "yarg", "PostalCodes"

---

## Root Cause Analysis

### Primary Cause: 7B Model Too Small for Multi-Agent Prompts

The Qwen2.5-7B-Instruct model has **insufficient reasoning capacity** for the complex multi-agent framework. The system requires the model to:

1. **Adopt agent personas** with specialized instructions
2. **Follow structured workflows** (Search → Judge → Hypothesis → Report)
3. **Make tool calls** (search_pubmed, search_clinical_trials, etc.)
4. **Generate JSON-formatted progress ledgers** for workflow control
5. **Understand manager instructions** and delegate appropriately

A 7B parameter model simply does not have the reasoning depth to handle this. Larger models (70B+) were originally intended, but those are routed to unreliable third-party providers (see `HF_FREE_TIER_ANALYSIS.md`).

### Technical Flow (Where Garbage Appears)

```
User Query
    ↓
AdvancedOrchestrator.run() [advanced.py:247]
    ↓
workflow.run_stream(task) [builds Magentic workflow]
    ↓
MagenticAgentDeltaEvent emitted with event.text
    ↓
Yields AgentEvent(type="streaming", message=event.text) [advanced.py:314-319]
    ↓
Gradio displays: "📡 **STREAMING**: {garbage}"
```

The garbage tokens are **raw model output**. The 7B model is:
- Not following the system prompt
- Outputting partial/incomplete token sequences
- Possibly attempting tool calls but formatting incorrectly
- Hallucinating random words

### Evidence from Microsoft Reference Framework

The Microsoft Agent Framework's `_magentic.py` (lines 1717-1741) shows how agent invocation works:

```python
async for update in agent.run_stream(messages=self._chat_history):
    updates.append(update)
    await self._emit_agent_delta_event(ctx, update)
```

The framework passes through whatever the underlying chat client produces. If the model produces garbage, the framework streams it directly.

### Why Click Example vs Submit Shows Different Initial State

Both code paths go through the same `research_agent()` function in `app.py`. The difference:

- **Example click**: Immediately submits query, so you see garbage quickly
- **Submit button click**: Shows "Starting research (Advanced mode)" banner first, then garbage

Both ultimately produce the same garbage output from the 7B model.

---

## Impact Assessment

| Aspect | Impact |
|--------|--------|
| Free Tier Users | Cannot get usable research results |
| Demo Quality | Appears broken/unprofessional |
| Trust | Users may think the entire system is broken |
| Differentiation | Undermines "free tier works!" messaging |

---

## Potential Solutions

### Option 1: Switch to Better Small Model (Recommended - Quick Fix)

Find a small model that better handles complex instructions. Candidates:

| Model | Size | Tool Calling | Instruction Following |
|-------|------|--------------|----------------------|
| `mistralai/Mistral-7B-Instruct-v0.3` | 7B | Yes | Better |
| `microsoft/Phi-3-mini-4k-instruct` | 3.8B | Limited | Good |
| `google/gemma-2-9b-it` | 9B | Yes | Good |
| `Qwen/Qwen2.5-14B-Instruct` | 14B | Yes | Better |

**Risk**: 14B model might still be routed to third-party providers. Need to test each.

### Option 2: Simplify Free Tier Architecture

Create a **simpler single-agent mode** for Free Tier:
- Remove multi-agent coordination (Manager, multiple ChatAgents)
- Use a single direct query → search → synthesize flow
- Reduce prompt complexity significantly

**Pros**: More reliable with smaller models
**Cons**: Loses sophisticated multi-agent research capability

### Option 3: Output Filtering/Validation

Add validation layer to detect and filter garbage output:

```python
def is_valid_streaming_token(text: str) -> bool:
    """Check if streaming token appears valid."""
    # Garbage patterns we've seen
    garbage_patterns = ["yarg", "PostalCodes", "FunctionFlags"]
    if any(g in text for g in garbage_patterns):
        return False
    # Check for minimum coherence (has spaces, reasonable length)
    return len(text) > 0 and text.strip()
```

**Pros**: Band-aid fix, quick to implement
**Cons**: Doesn't fix root cause, will miss new garbage patterns

### Option 4: Graceful Degradation

Detect when model output is incoherent and fall back to:
- Returning an error message
- Suggesting user provide an API key
- Using a cached/templated response

### Option 5: Prompt Engineering for 7B Models

Significantly simplify the agent prompts for 7B compatibility:
- Shorter system prompts
- More explicit step-by-step instructions
- Remove abstract concepts
- Use few-shot examples

### Option 6: Streaming Content Filter (For Symptom B)

Filter raw tool call JSON from streaming output:

```python
def should_stream_content(text: str) -> bool:
    """Filter garbage and raw tool calls from streaming."""
    # Don't stream raw JSON tool calls
    if text.strip().startswith('{"name":'):
        return False
    # Don't stream XML-style tool tags
    if '</tool_call>' in text or '<tool_call>' in text:
        return False
    # Don't stream garbage tokens (extend as needed)
    garbage = ["oleon", "UrlParser", "yarg", "PostalCodes", "FunctionFlags"]
    if any(g in text for g in garbage):
        return False
    return True
```

**Location**: `src/orchestrators/advanced.py` lines 315-322

This would prevent the raw tool call JSON from being shown to users, even if the model produces it.

---

## Recommended Action Plan

### Phase 1: Quick Fix (P2)
1. Test `mistralai/Mistral-7B-Instruct-v0.3` or `Qwen/Qwen2.5-14B-Instruct`
2. Verify they stay on HuggingFace native infrastructure (no third-party routing)
3. Evaluate output quality on sample queries

### Phase 2: Architecture Review (P3)
1. Consider simplified single-agent mode for Free Tier
2. Design graceful degradation when model output is invalid
3. Add output validation layer

### Phase 3: Long-term (P4)
1. Consider hybrid approach: simple mode for free tier, advanced for paid
2. Explore fine-tuning a small model specifically for research agent tasks

---

## Files Involved

| File | Relevance |
|------|-----------|
| `src/orchestrators/advanced.py` | Main orchestrator, streaming event handling |
| `src/clients/huggingface.py` | HuggingFace chat client adapter |
| `src/agents/magentic_agents.py` | Agent definitions and prompts |
| `src/app.py` | Gradio UI, event display |
| `src/utils/config.py` | Model configuration |

---

## Relation to Previous Bugs

- **P0 Repr Bug (RESOLVED)**: Fixed in PR #117 - Was about `<generator object>` appearing due to async generator mishandling
- **P1 HuggingFace Novita Error (RESOLVED)**: Fixed in PR #118 - Was about 72B models being routed to failing third-party providers

This P2 bug is **downstream** of the P1 fix - we fixed the 500 errors by switching to 7B, but now the 7B model doesn't produce quality output.

---

## Questions to Investigate

1. What models in the 7-20B range stay on HuggingFace native infrastructure?
2. Can we detect third-party routing before making the full request?
3. Is the chat template correct for Qwen2.5-7B? (Some models need specific formatting)
4. Are there HuggingFace serverless models specifically optimized for tool calling?

---

## References

- `HF_FREE_TIER_ANALYSIS.md` - Analysis of HuggingFace provider routing
- `CLAUDE.md` - Critical HuggingFace Free Tier section
- Microsoft Agent Framework `_magentic.py` - Reference implementation