VibecoderMcSwaggins commited on
Commit
21dd8fe
Β·
unverified Β·
1 Parent(s): 4d7d84f

fix(p2): Resolve duplicate report content and first-turn timeout bugs (#123)

Browse files

## Summary

Fixes two P2 bugs + addresses CodeRabbit review feedback:

**P2 Bug Fixes:**
- **Duplicate Report Content**: Stateful deduplication in `run()` loop via `_handle_final_event()`
- **First Turn Timeout**: Reduced `max_results_per_tool` 10β†’5, increased `advanced_timeout` 300β†’600s

**CodeRabbit Fixes:**
- Added missing `@patch` decorator + return type to timeout test
- Removed unreachable dead code from `_process_event()`
- Created P3 tech debt doc for Modal integration removal

All 318 tests pass.

docs/ARCHITECTURE.md CHANGED
@@ -1,6 +1,6 @@
1
  # DeepBoner Architecture
2
 
3
- > **Last Updated**: 2025-12-01
4
 
5
  ---
6
 
@@ -15,7 +15,7 @@
15
  β”‚ NO (Free Tier) YES (Paid Tier) β”‚
16
  β”‚ ────────────── ─────────────── β”‚
17
  β”‚ HuggingFace backend OpenAI backend β”‚
18
- β”‚ Qwen 2.5 72B (free) GPT-5 (paid) β”‚
19
  β”‚ β”‚
20
  β”‚ SAME orchestration logic for both β”‚
21
  β”‚ ONE codebase, different LLM backends β”‚
@@ -28,12 +28,21 @@
28
 
29
  ## Current Status
30
 
31
- **Free Tier is BLOCKED** by upstream bug #2562.
32
 
33
- Once [PR #2566](https://github.com/microsoft/agent-framework/pull/2566) merges:
34
- 1. Update `agent-framework` dependency
35
- 2. Free tier works
36
- 3. Done
 
 
 
 
 
 
 
 
 
37
 
38
  ---
39
 
@@ -65,7 +74,7 @@ def get_chat_client():
65
  | Condition | Backend | Model |
66
  |-----------|---------|-------|
67
  | User provides OpenAI key | OpenAI | GPT-5 |
68
- | No API key provided | HuggingFace | Qwen 2.5 72B (free) |
69
 
70
  ---
71
 
@@ -87,16 +96,6 @@ def get_chat_client():
87
 
88
  ---
89
 
90
- ## Upstream Blocker
91
-
92
- **Bug:** Microsoft Agent Framework produces `repr()` garbage for tool-call-only messages.
93
-
94
- **Fix:** [PR #2566](https://github.com/microsoft/agent-framework/pull/2566) - waiting to merge.
95
-
96
- **Tracking:** [Issue #2562](https://github.com/microsoft/agent-framework/issues/2562)
97
-
98
- ---
99
-
100
  ## References
101
 
102
  - [Pydantic AI](https://ai.pydantic.dev/) - Structured outputs framework
 
1
  # DeepBoner Architecture
2
 
3
+ > **Last Updated**: 2025-12-03
4
 
5
  ---
6
 
 
15
  β”‚ NO (Free Tier) YES (Paid Tier) β”‚
16
  β”‚ ────────────── ─────────────── β”‚
17
  β”‚ HuggingFace backend OpenAI backend β”‚
18
+ β”‚ Qwen 2.5 7B (free) GPT-5 (paid) β”‚
19
  β”‚ β”‚
20
  β”‚ SAME orchestration logic for both β”‚
21
  β”‚ ONE codebase, different LLM backends β”‚
 
28
 
29
  ## Current Status
30
 
31
+ **Both tiers are WORKING** as of December 2025.
32
 
33
+ - **Free Tier**: Uses Accumulator Pattern to bypass upstream repr bug
34
+ - **Paid Tier**: Full OpenAI GPT-5 integration
35
+
36
+ ---
37
+
38
+ ## Key Fixes Applied
39
+
40
+ | Issue | Fix | PR |
41
+ |-------|-----|-----|
42
+ | Tool execution failure | Removed premature `__function_invoking_chat_client__` marker | fix/P1-free-tier |
43
+ | Repr garbage in output | Accumulator Pattern bypasses upstream bug | PR #117 |
44
+ | 72B model routing failures | Switched to 7B (native HF infra) | PR #118 |
45
+ | Evidence deduplication | Cross-source dedup by PMID/DOI | PR #122 |
46
 
47
  ---
48
 
 
74
  | Condition | Backend | Model |
75
  |-----------|---------|-------|
76
  | User provides OpenAI key | OpenAI | GPT-5 |
77
+ | No API key provided | HuggingFace | Qwen 2.5 7B (free) |
78
 
79
  ---
80
 
 
96
 
97
  ---
98
 
 
 
 
 
 
 
 
 
 
 
99
  ## References
100
 
101
  - [Pydantic AI](https://ai.pydantic.dev/) - Structured outputs framework
docs/architecture/HF_FREE_TIER_ANALYSIS.md CHANGED
@@ -64,5 +64,50 @@ For the Unified Chat Client architecture:
64
  1. **Tier 0 (Free):** Hardcoded to Native Models (Qwen 7B, Mistral Nemo).
65
  2. **Tier 1 (BYO Key):** Allow user to select any model (70B+), assuming they provide a key that grants access to premium providers or PRO tier.
66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  ---
68
  *Analysis performed by Gemini CLI Agent, Dec 2, 2025*
 
 
64
  1. **Tier 0 (Free):** Hardcoded to Native Models (Qwen 7B, Mistral Nemo).
65
  2. **Tier 1 (BYO Key):** Allow user to select any model (70B+), assuming they provide a key that grants access to premium providers or PRO tier.
66
 
67
+ ---
68
+
69
+ ## 5. Known Content Quality Limitations (7B Models)
70
+
71
+ **Status**: As of December 2025, the Free Tier (Qwen 2.5 7B) produces **working multi-agent orchestration** but with notable content quality limitations.
72
+
73
+ ### What Works Well
74
+ - Multi-agent coordination (Manager β†’ Search β†’ Hypothesis β†’ Report)
75
+ - Clean streaming output (no garbage tokens, no raw JSON)
76
+ - Proper agent handoffs and progress tracking
77
+ - Coherent narrative structure
78
+
79
+ ### Known Limitations
80
+
81
+ | Issue | Description | Severity |
82
+ |-------|-------------|----------|
83
+ | **Hallucinated Citations** | Model generates plausible-sounding but fake paper titles/authors instead of using actual search results | Medium |
84
+ | **Anatomical Confusion** | May apply male anatomy (e.g., "penile rigidity") to female health queries | High |
85
+ | **Nonsensical Medical Claims** | May generate claims like "prostate cancer risk" in context of female patients | High |
86
+ | **Duplicate Content** | Final reports sometimes contain repeated sections | Low |
87
+
88
+ ### Why This Happens
89
+
90
+ 7B parameter models have limited:
91
+ - **World knowledge**: Can't reliably recall specific paper titles/authors
92
+ - **Context grounding**: May ignore search results and hallucinate instead
93
+ - **Domain reasoning**: Complex medical topics exceed reasoning capacity
94
+
95
+ ### User Guidance
96
+
97
+ **Free Tier is best for:**
98
+ - Understanding the research workflow
99
+ - Getting general topic overviews
100
+ - Testing the system before committing to paid tier
101
+
102
+ **For accurate medical research:**
103
+ - Use Paid Tier (GPT-5) for citation accuracy
104
+ - Always verify citations against actual databases
105
+ - Treat Free Tier output as "draft quality"
106
+
107
+ ### Not a Stack Bug
108
+
109
+ These are **model capability limitations**, not bugs in the DeepBoner architecture. The orchestration, streaming, and agent coordination are working correctly.
110
+
111
  ---
112
  *Analysis performed by Gemini CLI Agent, Dec 2, 2025*
113
+ *Content quality section added Dec 3, 2025*
docs/bugs/ACTIVE_BUGS.md CHANGED
@@ -1,6 +1,6 @@
1
  # Active Bugs
2
 
3
- > Last updated: 2025-12-03
4
  >
5
  > **Note:** Completed bug docs archived to `docs/bugs/archive/`
6
  > **See also:** [ARCHITECTURE.md](../ARCHITECTURE.md) for unified architecture plan
@@ -33,6 +33,17 @@
33
 
34
  ---
35
 
 
 
 
 
 
 
 
 
 
 
 
36
  ## Resolved Bugs (December 2025)
37
 
38
  All resolved bugs have been moved to `docs/bugs/archive/`. Summary:
@@ -55,6 +66,8 @@ All resolved bugs have been moved to `docs/bugs/archive/`. Summary:
55
  - **P1 Simple Mode Removed Breaks Free Tier UX** - FIXED via Accumulator Pattern (PR #117)
56
 
57
  ### P2 Bugs (All FIXED)
 
 
58
  - **P2 7B Model Garbage Output** - SUPERSEDED by P1 Free Tier fix (root cause was premature marker, not model capacity)
59
  - **P2 Advanced Mode Cold Start No Feedback** - FIXED, all phases complete
60
  - **P2 Architectural BYOK Gaps** - FIXED, end-to-end BYOK support in PR #119
 
1
  # Active Bugs
2
 
3
+ > Last updated: 2025-12-04
4
  >
5
  > **Note:** Completed bug docs archived to `docs/bugs/archive/`
6
  > **See also:** [ARCHITECTURE.md](../ARCHITECTURE.md) for unified architecture plan
 
33
 
34
  ---
35
 
36
+ ### P3 - Remove Modal Integration
37
+
38
+ **File:** `docs/future-roadmap/P3_MODAL_INTEGRATION_REMOVAL.md`
39
+ **Status:** OPEN - Tech Debt
40
+
41
+ **Problem:** Modal (cloud functions) is integrated in 9 files but was decided against for this project. Creates dead code paths and confusion.
42
+
43
+ **Fix:** Remove all Modal references from codebase (config, services, agents, tools).
44
+
45
+ ---
46
+
47
  ## Resolved Bugs (December 2025)
48
 
49
  All resolved bugs have been moved to `docs/bugs/archive/`. Summary:
 
66
  - **P1 Simple Mode Removed Breaks Free Tier UX** - FIXED via Accumulator Pattern (PR #117)
67
 
68
  ### P2 Bugs (All FIXED)
69
+ - **P2 Duplicate Report Content** - FIXED in PR fix/p2-double-bug-squash, stateful deduplication in `run()` loop
70
+ - **P2 First Turn Timeout** - FIXED in PR fix/p2-double-bug-squash, reduced results per tool (10β†’5), increased timeout (5β†’10 min)
71
  - **P2 7B Model Garbage Output** - SUPERSEDED by P1 Free Tier fix (root cause was premature marker, not model capacity)
72
  - **P2 Advanced Mode Cold Start No Feedback** - FIXED, all phases complete
73
  - **P2 Architectural BYOK Gaps** - FIXED, end-to-end BYOK support in PR #119
docs/bugs/P2_DUPLICATE_REPORT_CONTENT.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P2 Bug: Duplicate Report Content in Output
2
+
3
+ **Date**: 2025-12-03
4
+ **Status**: FIXED (PR fix/p2-double-bug-squash)
5
+ **Severity**: P2 (UX - Duplicate content confuses users)
6
+ **Component**: `src/orchestrators/advanced.py`
7
+ **Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI)
8
+
9
+ ---
10
+
11
+ ## Executive Summary
12
+
13
+ This is a **confirmed stack bug**, NOT a model limitation. The duplicate report appears because:
14
+
15
+ 1. Streaming events yield the full report content character-by-character
16
+ 2. Final events (`MagenticFinalResultEvent`/`WorkflowOutputEvent`) contain the SAME content
17
+ 3. No deduplication exists between streamed content and final event content
18
+ 4. Both are appended to the output
19
+
20
+ ---
21
+
22
+ ## Symptom
23
+
24
+ The final research report appears **twice** in the UI output:
25
+ 1. First as streaming content (with `πŸ“‘ **STREAMING**:` prefix)
26
+ 2. Then again as a complete event (without prefix)
27
+
28
+ ---
29
+
30
+ ## Root Cause
31
+
32
+ The `_process_event()` method handles final events but has **no access to buffer state**. The buffer was already cleared at line 337 before these events arrive.
33
+
34
+ ```python
35
+ # Line 337: Buffer cleared
36
+ current_message_buffer = ""
37
+ continue
38
+
39
+ # Line 341: Final events processed WITHOUT buffer context
40
+ agent_event = self._process_event(event, iteration) # No buffer info!
41
+ ```
42
+
43
+ ---
44
+
45
+ ## The Fix (Consensus: Stateful Orchestrator Logic)
46
+
47
+ **Location**: `src/orchestrators/advanced.py` `run()` method
48
+
49
+ **Strategy**: Handle final events **inline in the run() loop** where buffer state exists. Track streaming volume to decide whether to re-emit content.
50
+
51
+ ### Why This Is Correct
52
+
53
+ | Rejected Approach | Why Wrong |
54
+ |-------------------|-----------|
55
+ | UI-side string comparison | Wrong layer, fragile, treats symptom |
56
+ | Stateless `_process_event` fix | No state = can't know if streaming occurred |
57
+ | **Stateful run() loop** | βœ… Only place with full lifecycle visibility |
58
+
59
+ The `run()` loop is the **single source of truth** for the request lifecycle. It "saw" the content stream out. It must decide whether to re-emit.
60
+
61
+ ### Implementation
62
+
63
+ ```python
64
+ # In run() method, add tracking variable after line 302:
65
+ last_streamed_length: int = 0
66
+
67
+ # Before clearing buffer at line 337, save its length:
68
+ last_streamed_length = len(current_message_buffer)
69
+ current_message_buffer = ""
70
+ continue
71
+
72
+ # Replace lines 340-345 with inline handling of final events:
73
+ if isinstance(event, (MagenticFinalResultEvent, WorkflowOutputEvent)):
74
+ final_event_received = True
75
+
76
+ # DECISION: Did we stream substantial content?
77
+ if last_streamed_length > 100:
78
+ # YES: Final event is a SIGNAL, not a payload
79
+ yield AgentEvent(
80
+ type="complete",
81
+ message="Research complete.",
82
+ data={"iterations": iteration, "streamed_chars": last_streamed_length},
83
+ iteration=iteration,
84
+ )
85
+ else:
86
+ # NO: Final event must carry the payload (tool-only turn, cache hit)
87
+ if isinstance(event, MagenticFinalResultEvent):
88
+ text = self._extract_text(event.message) if event.message else "No result"
89
+ else: # WorkflowOutputEvent
90
+ text = self._extract_text(event.data) if event.data else "Research complete"
91
+ yield AgentEvent(
92
+ type="complete",
93
+ message=text,
94
+ data={"iterations": iteration},
95
+ iteration=iteration,
96
+ )
97
+ continue
98
+
99
+ # Keep existing fallback for other events:
100
+ agent_event = self._process_event(event, iteration)
101
+ ```
102
+
103
+ ### Why Threshold of 100 Chars?
104
+
105
+ - `> 0` is too aggressive (might catch single-word streams)
106
+ - `> 500` is too conservative (might miss short but complete responses)
107
+ - `> 100` distinguishes "real content was streamed" from "just status messages"
108
+
109
+ ---
110
+
111
+ ## Edge Cases Handled
112
+
113
+ | Scenario | `last_streamed_length` | Action |
114
+ |----------|------------------------|--------|
115
+ | Normal streaming report | 5000+ | Emit "Research complete." |
116
+ | Tool call, no text | 0 | Emit full content from final event |
117
+ | Very short response | 50 | Emit full content (fallback) |
118
+ | Agent switch mid-stream | Reset on switch | Tracks only final agent |
119
+
120
+ ---
121
+
122
+ ## Files to Modify
123
+
124
+ | File | Lines | Change |
125
+ |------|-------|--------|
126
+ | `src/orchestrators/advanced.py` | 296-345 | Add `last_streamed_length`, handle final events inline |
127
+ | `src/orchestrators/advanced.py` | 532-552 | Optional: remove dead code from `_process_event()` |
128
+
129
+ ---
130
+
131
+ ## Test Plan
132
+
133
+ 1. **Happy Path**: Run query, verify report appears ONCE
134
+ 2. **Fallback**: Mock tool-only turn (no streaming), verify full content emitted
135
+ 3. **Both Tiers**: Test Free Tier and Paid Tier
136
+
137
+ ---
138
+
139
+ ## Validation
140
+
141
+ This fix was independently validated by two AI agents (Claude and Gemini) analyzing the architecture. Both concluded:
142
+
143
+ > "The Stateful Orchestrator Fix is the correct engineering solution. The 'Source of Truth' is the Orchestrator's runtime state."
144
+
145
+ ---
146
+
147
+ ## Related
148
+
149
+ - **Not related to model quality** - This is a stack bug
150
+ - P1 Free Tier fix enabled streaming, exposing this bug
151
+ - SPEC-17 Accumulator Pattern addressed repr bug but created this side effect
docs/bugs/P2_FIRST_TURN_TIMEOUT.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P2 Bug: First Agent Turn Exceeds Workflow Timeout
2
+
3
+ **Date**: 2025-12-03
4
+ **Status**: FIXED (PR fix/p2-double-bug-squash)
5
+ **Severity**: P2 (UX - Workflow always times out on complex queries)
6
+ **Component**: `src/orchestrators/advanced.py` + `src/agents/search_agent.py`
7
+ **Affects**: Both Free Tier (HuggingFace) AND Paid Tier (OpenAI)
8
+
9
+ ---
10
+
11
+ ## Executive Summary
12
+
13
+ The search agent's first turn can exceed the 5-minute workflow timeout, causing:
14
+ 1. `iterations=0` at timeout (no agent completed a turn)
15
+ 2. `_handle_timeout()` synthesizes from partial evidence
16
+ 3. Users get incomplete research results
17
+
18
+ This is a **performance/architecture bug**, not a model issue.
19
+
20
+ ---
21
+
22
+ ## Symptom
23
+
24
+ ```
25
+ [warning] Workflow timed out iterations=0
26
+ ```
27
+
28
+ The workflow times out with `iterations=0` - meaning the first agent (search agent) never completed its turn before the 5-minute timeout.
29
+
30
+ ---
31
+
32
+ ## Root Cause
33
+
34
+ The search agent's first turn is **extremely expensive**:
35
+
36
+ ```
37
+ Search Agent First Turn:
38
+ β”œβ”€β”€ Manager assigns task
39
+ β”œβ”€β”€ Search agent starts
40
+ β”‚ β”œβ”€β”€ Calls PubMed search tool (10 results)
41
+ β”‚ β”œβ”€β”€ Calls ClinicalTrials search tool (10 results)
42
+ β”‚ β”œβ”€β”€ Calls EuropePMC search tool (10 results)
43
+ β”‚ └── For EACH result (30 total):
44
+ β”‚ β”œβ”€β”€ Generate embedding (OpenAI API call)
45
+ β”‚ β”œβ”€β”€ Check for duplicates (ChromaDB query)
46
+ β”‚ └── Store in ChromaDB
47
+ β”‚
48
+ β”‚ TOTAL: 30 results Γ— (embedding + dedup + store) = 90+ API/DB operations
49
+ β”‚
50
+ └── Agent turn completes (if timeout hasn't fired)
51
+ ```
52
+
53
+ **The timeout is on the WORKFLOW, not individual agent turns.** A single greedy agent can consume the entire timeout budget.
54
+
55
+ ---
56
+
57
+ ## Impact
58
+
59
+ | Aspect | Impact |
60
+ |--------|--------|
61
+ | UX | Queries always timeout on first turn |
62
+ | Research quality | Synthesis happens on partial evidence |
63
+ | Confusion | `iterations=0` looks like nothing happened |
64
+
65
+ ---
66
+
67
+ ## The Fix (Consensus)
68
+
69
+ **Reduce work per turn + increase timeout budget.**
70
+
71
+ ### Implementation
72
+
73
+ **1. Reduce results per tool (immediate)**
74
+
75
+ `src/agents/search_agent.py` line 70:
76
+ ```python
77
+ # Change from 10 to 5
78
+ result: SearchResult = await self._handler.execute(query, max_results_per_tool=5)
79
+ ```
80
+
81
+ **2. Increase workflow timeout (immediate)**
82
+
83
+ `src/utils/config.py`:
84
+ ```python
85
+ advanced_timeout: float = Field(
86
+ default=600.0, # Was 300.0 (5 min), now 10 min
87
+ ge=60.0,
88
+ le=900.0,
89
+ description="Timeout for Advanced mode in seconds",
90
+ )
91
+ ```
92
+
93
+ ### Why NOT Per-Turn Timeout
94
+
95
+ **DANGER**: The SearchHandler uses `asyncio.gather()`:
96
+
97
+ ```python
98
+ # src/tools/search_handler.py line 163-164
99
+ results = await asyncio.gather(*tasks, return_exceptions=True)
100
+ ```
101
+
102
+ This is an **all-or-nothing** operation. If you wrap it with `asyncio.timeout()` and the timeout fires, you get **zero results**, not partial results.
103
+
104
+ ```python
105
+ # DON'T DO THIS - yields nothing on timeout
106
+ async with asyncio.timeout(60):
107
+ result = await self._handler.execute(query) # Cancelled = zero results
108
+ ```
109
+
110
+ Per-turn timeout requires `SearchHandler` to support cancellation with partial results. That's a separate architectural change (see Future Work).
111
+
112
+ ---
113
+
114
+ ## Future Work (Streaming Evidence Ingestion)
115
+
116
+ For proper fix, `SearchHandler.execute()` should:
117
+ 1. Yield results as they arrive (async generator)
118
+ 2. Support cancellation with partial results
119
+ 3. Allow agent to return "what we have so far" on timeout
120
+
121
+ ```python
122
+ # Future architecture
123
+ async def execute_streaming(self, query: str) -> AsyncIterator[Evidence]:
124
+ for tool in self.tools:
125
+ async for evidence in tool.search_streaming(query):
126
+ yield evidence # Can be cancelled at any point
127
+ ```
128
+
129
+ This is out of scope for the immediate fix.
130
+
131
+ ---
132
+
133
+ ## Test Plan
134
+
135
+ 1. Run query with 10-minute timeout
136
+ 2. Verify first agent turn completes before timeout
137
+ 3. Verify `iterations >= 1` at workflow end
138
+
139
+ ---
140
+
141
+ ## Verification Data
142
+
143
+ From diagnostic run:
144
+ ```
145
+ === RAW FRAMEWORK EVENTS ===
146
+ MagenticAgentDeltaEvent: 284
147
+ MagenticOrchestratorMessageEvent: 3
148
+ ...
149
+ NO MagenticAgentMessageEvent ← Agent never completed a turn!
150
+
151
+ [warning] Workflow timed out iterations=0
152
+ ```
153
+
154
+ ---
155
+
156
+ ## Related
157
+
158
+ - P2 Duplicate Report Bug (separate issue, happens after successful completion)
159
+ - `_handle_timeout()` correctly synthesizes, but with partial evidence
160
+ - Not related to model quality - this is infrastructure/performance
docs/bugs/{P1_FREE_TIER_TOOL_EXECUTION_FAILURE.md β†’ archive/P1_FREE_TIER_TOOL_EXECUTION_FAILURE.md} RENAMED
File without changes
docs/bugs/{P2_7B_MODEL_GARBAGE_OUTPUT.md β†’ archive/P2_7B_MODEL_GARBAGE_OUTPUT.md} RENAMED
File without changes
docs/future-roadmap/P3_MODAL_INTEGRATION_REMOVAL.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # P3 Tech Debt: Modal Integration Removal
2
+
3
+ **Date**: 2025-12-04
4
+ **Status**: OPEN - Tech Debt (Future Roadmap)
5
+ **Severity**: P3 (Tech Debt - Not blocking functionality)
6
+ **Component**: Multiple files
7
+
8
+ ---
9
+
10
+ ## Executive Summary
11
+
12
+ Modal (cloud function execution platform) is integrated throughout the codebase but was decided against for this project. This creates potential confusion and dead code paths that should be cleaned up when time permits.
13
+
14
+ ---
15
+
16
+ ## Affected Files
17
+
18
+ The following files contain Modal references:
19
+
20
+ | File | Usage |
21
+ |------|-------|
22
+ | `src/utils/config.py` | `MODAL_TOKEN_ID`, `MODAL_TOKEN_SECRET` settings |
23
+ | `src/utils/service_loader.py` | Modal service initialization |
24
+ | `src/services/llamaindex_rag.py` | Modal integration for RAG |
25
+ | `src/agents/code_executor_agent.py` | Modal sandbox execution |
26
+ | `src/utils/exceptions.py` | Modal-related exceptions |
27
+ | `src/tools/code_execution.py` | Modal code execution tool |
28
+ | `src/services/statistical_analyzer.py` | Modal statistical analysis |
29
+ | `src/mcp_tools.py` | Modal MCP tool wrappers |
30
+ | `src/agents/analysis_agent.py` | Modal analysis agent |
31
+
32
+ ---
33
+
34
+ ## Context
35
+
36
+ Modal was originally integrated for:
37
+ 1. **Code Execution Sandbox**: Running untrusted code in isolated containers
38
+ 2. **Statistical Analysis**: Offloading heavy statistical computations
39
+ 3. **LlamaIndex RAG**: Premium embeddings with persistent storage
40
+
41
+ However, the project decided against Modal because:
42
+ - Added infrastructure complexity
43
+ - Free Tier doesn't need cloud functions
44
+ - Paid Tier uses OpenAI directly
45
+
46
+ ---
47
+
48
+ ## Recommended Fix
49
+
50
+ 1. Remove Modal-related code from all affected files
51
+ 2. Remove `MODAL_TOKEN_ID` and `MODAL_TOKEN_SECRET` from config
52
+ 3. Remove Modal from dependencies in `pyproject.toml`
53
+ 4. Update any documentation referencing Modal
54
+
55
+ ---
56
+
57
+ ## Impact If Not Fixed
58
+
59
+ - Confusion for new contributors
60
+ - Dead code in production
61
+ - Unnecessary dependencies
62
+ - Config settings that do nothing
63
+
64
+ ---
65
+
66
+ ## Test Plan
67
+
68
+ 1. Remove Modal code
69
+ 2. Run `make check` to ensure no breakage
70
+ 3. Verify Free Tier and Paid Tier still work
71
+ 4. Search codebase for any remaining Modal references
72
+
73
+ ---
74
+
75
+ ## Related
76
+
77
+ - `P3_REMOVE_ANTHROPIC_PARTIAL_WIRING.md` - Similar tech debt for Anthropic
78
+ - ARCHITECTURE.md - Current architecture excludes Modal
docs/index.md CHANGED
@@ -103,5 +103,5 @@ User Question β†’ Research Agent (Orchestrator)
103
  |-------|--------|
104
  | Phases 1-14 | βœ… COMPLETE |
105
 
106
- **Tests**: 142+ passing, 0 warnings
107
  **Known Issues**: See [Active Bugs](bugs/ACTIVE_BUGS.md)
 
103
  |-------|--------|
104
  | Phases 1-14 | βœ… COMPLETE |
105
 
106
+ **Tests**: 318 passing, 0 warnings
107
  **Known Issues**: See [Active Bugs](bugs/ACTIVE_BUGS.md)
src/agents/search_agent.py CHANGED
@@ -67,7 +67,7 @@ class SearchAgent(BaseAgent): # type: ignore[misc]
67
  )
68
 
69
  # Execute search
70
- result: SearchResult = await self._handler.execute(query, max_results_per_tool=10)
71
 
72
  # Track what to show in response (initialized to search results as default)
73
  evidence_to_show: list[Evidence] = result.evidence
 
67
  )
68
 
69
  # Execute search
70
+ result: SearchResult = await self._handler.execute(query, max_results_per_tool=5)
71
 
72
  # Track what to show in response (initialized to search results as default)
73
  evidence_to_show: list[Evidence] = result.evidence
src/orchestrators/advanced.py CHANGED
@@ -301,6 +301,7 @@ The final output should be a structured research report."""
301
  # to repr(message) if text is empty. We must reconstruct text from Deltas.
302
  current_message_buffer: str = ""
303
  current_agent_id: str | None = None
 
304
 
305
  try:
306
  async with asyncio.timeout(self._timeout_seconds):
@@ -333,15 +334,21 @@ The final output should be a structured research report."""
333
  yield comp_event
334
  yield prog_event
335
 
 
 
336
  # Clear buffer after consuming
337
  current_message_buffer = ""
338
  continue
339
 
340
- # 3. Handle other events normally
 
 
 
 
 
 
341
  agent_event = self._process_event(event, iteration)
342
  if agent_event:
343
- if agent_event.type == "complete":
344
- final_event_received = True
345
  yield agent_event
346
 
347
  # GUARANTEE: Always emit termination event if stream ends without one
@@ -413,6 +420,40 @@ The final output should be a structured research report."""
413
 
414
  return completion_event, progress_event
415
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
416
  def _extract_text(self, message: Any) -> str:
417
  """
418
  Defensively extract text from a message object.
@@ -526,30 +567,12 @@ The final output should be a structured research report."""
526
  iteration=iteration,
527
  )
528
 
529
- # NOTE: MagenticAgentMessageEvent is handled in run() loop with Accumulator Pattern
530
- # (see lines 322-335) and never reaches this method due to `continue` statement.
531
-
532
- elif isinstance(event, MagenticFinalResultEvent):
533
- text = self._extract_text(event.message) if event.message else "No result"
534
- return AgentEvent(
535
- type="complete",
536
- message=text,
537
- data={"iterations": iteration},
538
- iteration=iteration,
539
- )
540
-
541
- # NOTE: MagenticAgentDeltaEvent is handled in run() loop with Accumulator Pattern
542
- # (see lines 306-320) and never reaches this method due to `continue` statement.
543
-
544
- elif isinstance(event, WorkflowOutputEvent):
545
- if event.data:
546
- # Use _extract_text to properly handle ChatMessage objects
547
- text = self._extract_text(event.data)
548
- return AgentEvent(
549
- type="complete",
550
- message=text if text else "Research complete (no synthesis)",
551
- iteration=iteration,
552
- )
553
 
554
  return None
555
 
 
301
  # to repr(message) if text is empty. We must reconstruct text from Deltas.
302
  current_message_buffer: str = ""
303
  current_agent_id: str | None = None
304
+ last_streamed_length: int = 0 # Track for P2 Duplicate Report Bug Fix
305
 
306
  try:
307
  async with asyncio.timeout(self._timeout_seconds):
 
334
  yield comp_event
335
  yield prog_event
336
 
337
+ # P2 BUG FIX: Save length before clearing
338
+ last_streamed_length = len(current_message_buffer)
339
  # Clear buffer after consuming
340
  current_message_buffer = ""
341
  continue
342
 
343
+ # 3. Handle Final Events Inline (P2 Duplicate Report Fix)
344
+ if isinstance(event, (MagenticFinalResultEvent, WorkflowOutputEvent)):
345
+ final_event_received = True
346
+ yield self._handle_final_event(event, iteration, last_streamed_length)
347
+ continue
348
+
349
+ # 4. Handle other events normally
350
  agent_event = self._process_event(event, iteration)
351
  if agent_event:
 
 
352
  yield agent_event
353
 
354
  # GUARANTEE: Always emit termination event if stream ends without one
 
420
 
421
  return completion_event, progress_event
422
 
423
+ def _handle_final_event(
424
+ self,
425
+ event: MagenticFinalResultEvent | WorkflowOutputEvent,
426
+ iteration: int,
427
+ last_streamed_length: int,
428
+ ) -> AgentEvent:
429
+ """Handle final workflow events with duplicate content suppression (P2 Bug Fix)."""
430
+ # DECISION: Did we stream substantial content?
431
+ if last_streamed_length > 100:
432
+ # YES: Final event is a SIGNAL, not a payload
433
+ return AgentEvent(
434
+ type="complete",
435
+ message="Research complete.",
436
+ data={
437
+ "iterations": iteration,
438
+ "streamed_chars": last_streamed_length,
439
+ },
440
+ iteration=iteration,
441
+ )
442
+
443
+ # NO: Final event must carry the payload (tool-only turn, cache hit)
444
+ text = ""
445
+ if isinstance(event, MagenticFinalResultEvent):
446
+ text = self._extract_text(event.message) if event.message else "No result"
447
+ elif isinstance(event, WorkflowOutputEvent):
448
+ text = self._extract_text(event.data) if event.data else "Research complete"
449
+
450
+ return AgentEvent(
451
+ type="complete",
452
+ message=text,
453
+ data={"iterations": iteration},
454
+ iteration=iteration,
455
+ )
456
+
457
  def _extract_text(self, message: Any) -> str:
458
  """
459
  Defensively extract text from a message object.
 
567
  iteration=iteration,
568
  )
569
 
570
+ # NOTE: The following event types are handled inline in run() loop and never reach
571
+ # this method due to `continue` statements:
572
+ # - MagenticAgentMessageEvent: Accumulator Pattern (lines 322-335)
573
+ # - MagenticAgentDeltaEvent: Accumulator Pattern (lines 306-320)
574
+ # - MagenticFinalResultEvent: P2 Duplicate Fix via _handle_final_event() (lines 343-347)
575
+ # - WorkflowOutputEvent: P2 Duplicate Fix via _handle_final_event() (lines 343-347)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
576
 
577
  return None
578
 
src/utils/config.py CHANGED
@@ -71,10 +71,10 @@ class Settings(BaseSettings):
71
  description="Max coordination rounds for Advanced mode (default 5 for faster demos)",
72
  )
73
  advanced_timeout: float = Field(
74
- default=300.0,
75
  ge=60.0,
76
  le=900.0,
77
- description="Timeout for Advanced mode in seconds (default 5 min)",
78
  )
79
  search_timeout: int = Field(default=30, description="Seconds to wait for search")
80
  magentic_timeout: int = Field(
 
71
  description="Max coordination rounds for Advanced mode (default 5 for faster demos)",
72
  )
73
  advanced_timeout: float = Field(
74
+ default=600.0,
75
  ge=60.0,
76
  le=900.0,
77
+ description="Timeout for Advanced mode in seconds (default 10 min)",
78
  )
79
  search_timeout: int = Field(default=30, description="Seconds to wait for search")
80
  magentic_timeout: int = Field(
tests/unit/agents/test_search_agent.py CHANGED
@@ -47,7 +47,7 @@ async def test_run_executes_search(mock_handler: AsyncMock) -> None:
47
  response = await agent.run("test query")
48
 
49
  # Check handler called
50
- mock_handler.execute.assert_awaited_once_with("test query", max_results_per_tool=10)
51
 
52
  # Check store updated
53
  assert len(store["current"]) == 1
@@ -67,7 +67,7 @@ async def test_run_handles_chat_message_input(mock_handler: AsyncMock) -> None:
67
  message = ChatMessage(role=Role.USER, text="test query")
68
  await agent.run(message)
69
 
70
- mock_handler.execute.assert_awaited_once_with("test query", max_results_per_tool=10)
71
 
72
 
73
  @pytest.mark.asyncio
@@ -81,7 +81,7 @@ async def test_run_handles_list_input(mock_handler: AsyncMock) -> None:
81
  ChatMessage(role=Role.USER, text="test query"),
82
  ]
83
  await agent.run(messages)
84
- mock_handler.execute.assert_awaited_once_with("test query", max_results_per_tool=10)
85
 
86
 
87
  @pytest.mark.asyncio
 
47
  response = await agent.run("test query")
48
 
49
  # Check handler called
50
+ mock_handler.execute.assert_awaited_once_with("test query", max_results_per_tool=5)
51
 
52
  # Check store updated
53
  assert len(store["current"]) == 1
 
67
  message = ChatMessage(role=Role.USER, text="test query")
68
  await agent.run(message)
69
 
70
+ mock_handler.execute.assert_awaited_once_with("test query", max_results_per_tool=5)
71
 
72
 
73
  @pytest.mark.asyncio
 
81
  ChatMessage(role=Role.USER, text="test query"),
82
  ]
83
  await agent.run(messages)
84
+ mock_handler.execute.assert_awaited_once_with("test query", max_results_per_tool=5)
85
 
86
 
87
  @pytest.mark.asyncio
tests/unit/orchestrators/test_advanced_orchestrator.py CHANGED
@@ -28,11 +28,11 @@ class TestAdvancedOrchestratorConfig:
28
  assert orch._max_rounds == 7
29
 
30
  @patch("src.orchestrators.advanced.get_chat_client")
31
- def test_timeout_default_is_five_minutes(self, mock_get_client) -> None:
32
- """Default timeout should be 300s (5 min) from settings."""
33
  mock_get_client.return_value = MagicMock()
34
  orch = AdvancedOrchestrator()
35
- assert orch._timeout_seconds == 300.0
36
 
37
  @patch("src.orchestrators.advanced.get_chat_client")
38
  def test_explicit_timeout_overrides_settings(self, mock_get_client) -> None:
 
28
  assert orch._max_rounds == 7
29
 
30
  @patch("src.orchestrators.advanced.get_chat_client")
31
+ def test_timeout_default_is_ten_minutes(self, mock_get_client: MagicMock) -> None:
32
+ """Test that default timeout is 10 minutes (P2 fix)."""
33
  mock_get_client.return_value = MagicMock()
34
  orch = AdvancedOrchestrator()
35
+ assert orch._timeout_seconds == 600.0
36
 
37
  @patch("src.orchestrators.advanced.get_chat_client")
38
  def test_explicit_timeout_overrides_settings(self, mock_get_client) -> None: