DeepBoner / docs /architecture /agent-tool-state-contracts.md
VibecoderMcSwaggins's picture
docs: Audit and fix architecture documentation for accuracy
c7a2e77
# Agent-Tool-State Contract Registry
> **Status**: Canonical Source of Truth
> **Last Updated**: 2025-12-06
> **Purpose**: Developer reference for multi-agent coordination
This document defines the exact contracts between agents, tools, and shared state. Use this when:
- Adding new agents or tools
- Modifying agent behavior
- Debugging coordination issues
- Understanding "if I change X, what breaks?"
---
## Table of Contents
1. [System Overview](#system-overview)
2. [Agent Contracts](#agent-contracts)
3. [Judge Decision Criteria](#judge-decision-criteria)
4. [Shared State (ResearchMemory)](#shared-state-researchmemory)
5. [Tool Contracts](#tool-contracts)
6. [Event Flow](#event-flow)
7. [Break Conditions](#break-conditions)
8. [Dependency Matrix](#dependency-matrix)
---
## System Overview
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ORCHESTRATOR (AdvancedOrchestrator) β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Manager │──▢│ Agents │──▢│ Memory β”‚ β”‚
β”‚ β”‚ (Magentic) β”‚ β”‚ (ChatAgent) β”‚ β”‚(ResearchMem)β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β–Ό β–Ό β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ └────────▢│ Tools │──▢│ Embeddings β”‚ β”‚
β”‚ β”‚(@ai_function)β”‚ β”‚ (ChromaDB) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Agent Inventory
| Agent | File | Role | Tools |
|-------|------|------|-------|
| **SearchAgent** | `magentic_agents.py` | Evidence gathering | search_pubmed, search_clinical_trials, search_preprints |
| **JudgeAgent** | `magentic_agents.py` | Evidence evaluation | None (LLM only) |
| **HypothesisAgent** | `magentic_agents.py` | Mechanism generation | None (LLM only) |
| **ReportAgent** | `magentic_agents.py` | Report synthesis | get_bibliography |
| **RetrievalAgent** | `retrieval_agent.py` | Web search | search_web |
> **⚠️ Dead Code Warning:** RetrievalAgent is implemented but NOT wired into `magentic_agents.py`.
> The orchestrator only uses SearchAgent (PubMed, ClinicalTrials, EuropePMC), not web search.
> See GitHub issue #134 for decision to delete or wire in.
---
## Agent Contracts
### SearchAgent
**Factory**: `create_search_agent(chat_client, domain, api_key) -> ChatAgent`
#### Input
```python
# Manager instruction (string)
"Search for testosterone and libido mechanisms in peer-reviewed literature"
```
#### Output
```python
# ChatMessage with:
message.text = """
Found 15 sources (12 new added to context):
- [Title 1](url): Abstract excerpt...
- [Title 2](url): Abstract excerpt...
"""
message.additional_properties = {
"evidence": [Evidence.model_dump(), ...]
}
```
#### State Access
| Operation | Key | Type | Description |
|-----------|-----|------|-------------|
| **READ** | `memory.query` | str | Current research question |
| **READ** | `memory.evidence_ids` | list[str] | Existing evidence URLs |
| **WRITE** | `memory._evidence_cache` | dict[str, Evidence] | Caches Evidence objects |
| **WRITE** | `memory.evidence_ids` | list[str] | Appends new URLs |
| **WRITE** | `embedding_service` | VectorDB | Stores embeddings |
#### Side Effects
1. Calls external APIs (PubMed, ClinicalTrials, Europe PMC)
2. Deduplicates via semantic similarity (0.9 threshold)
3. Stores in vector database
#### Error Behavior
- API failure β†’ Returns "No results found for: {query}"
- Rate limit β†’ Raises `RateLimitError` (caught by orchestrator)
---
### JudgeAgent
**Factory**: `create_judge_agent(chat_client, domain, api_key) -> ChatAgent`
#### Input
```python
# Manager instruction with evidence context
"Evaluate if we have sufficient evidence to answer: {query}"
# + Evidence list in context
```
#### Output
```python
# ChatMessage with:
message.text = """
## Assessment
βœ… SUFFICIENT EVIDENCE (confidence: 85%). STOP SEARCHING.
### Scores
- Mechanism: 8/10
- Clinical: 7/10
### Reasoning
Strong evidence for testosterone-AR pathway...
"""
message.additional_properties = {
"assessment": JudgeAssessment.model_dump()
}
```
#### State Access
| Operation | Key | Type | Description |
|-----------|-----|------|-------------|
| **READ** | Evidence from context | list[Evidence] | Passed by Manager |
| **WRITE** | None | - | Read-only evaluation |
#### Side Effects
- None (pure evaluation)
#### Critical Output Signal
- `"βœ… SUFFICIENT EVIDENCE"` β†’ Manager delegates to ReportAgent
- `"❌ INSUFFICIENT"` β†’ Manager calls SearchAgent with suggested queries
---
### HypothesisAgent
**Factory**: `create_hypothesis_agent(chat_client, domain, api_key) -> ChatAgent`
#### Input
```python
# Manager instruction
"Generate mechanistic hypotheses for: {query}"
```
#### Output
```python
# ChatMessage with:
message.text = """
## Hypothesis 1 (Confidence: 75%)
**Mechanism**: Testosterone β†’ Androgen Receptor β†’ BDNF β†’ Libido
**Suggested searches**: testosterone BDNF, androgen receptor signaling
## Primary Hypothesis
Testosterone β†’ AR β†’ dopamine release β†’ reward pathway
## Knowledge Gaps
- Dose-response relationship unclear
"""
message.additional_properties = {
"assessment": HypothesisAssessment.model_dump()
}
```
#### State Access
| Operation | Key | Type | Description |
|-----------|-----|------|-------------|
| **READ** | `memory.query` | str | Research question |
| **READ** | Evidence from context | list[Evidence] | Current evidence |
| **WRITE** | `evidence_store["hypotheses"]` | list | Appends hypotheses |
---
### ReportAgent
**Factory**: `create_report_agent(chat_client, domain, api_key) -> ChatAgent`
#### Input
```python
# Manager instruction
"Generate final research report for: {query}"
```
#### Output
```python
# ChatMessage with:
message.text = ResearchReport.to_markdown() # Full markdown report
message.additional_properties = {
"report": ResearchReport.model_dump()
}
```
#### State Access
| Operation | Key | Type | Description |
|-----------|-----|------|-------------|
| **READ** | `memory.get_all_evidence()` | list[Evidence] | All collected evidence |
| **READ** | `evidence_store["hypotheses"]` | list | Generated hypotheses |
| **READ** | `evidence_store["last_assessment"]` | JudgeAssessment | Final assessment |
| **WRITE** | `evidence_store["final_report"]` | ResearchReport | Stores report |
#### Tool: get_bibliography()
```python
@ai_function
def get_bibliography() -> str:
"""Returns formatted reference list from all evidence."""
evidence = state.memory.get_all_evidence()
return format_as_references(evidence)
```
---
## Judge Decision Criteria
### Scoring Dimensions
**Mechanism Score (0-10)**
| Score | Meaning |
|-------|---------|
| 0-3 | Minimal mechanism understanding |
| 4-5 | Partial mechanism (some targets identified) |
| 6-7 | Clear mechanism (targets + pathways) |
| 8-9 | Comprehensive (multiple pathways, regulation) |
| 10 | Complete understanding |
**Clinical Evidence Score (0-10)**
| Score | Meaning |
|-------|---------|
| 0-3 | Preclinical only or weak human evidence |
| 4-5 | Some human evidence (small trials, case reports) |
| 6-7 | Strong human evidence (RCTs) |
| 8-9 | Robust (meta-analysis, large RCTs) |
| 10 | Definitive clinical proof |
### Sufficiency Decision
```python
# SUFFICIENT (recommendation="synthesize")
if (
confidence >= 0.7 # 70%
and mechanism_score >= 6
and clinical_evidence_score >= 6
):
sufficient = True
recommendation = "synthesize"
# INSUFFICIENT (recommendation="continue")
else:
sufficient = False
recommendation = "continue"
next_search_queries = ["suggested query 1", "suggested query 2"]
```
### JudgeAssessment Model
```python
class JudgeAssessment(BaseModel):
details: AssessmentDetails
mechanism_score: int # 0-10
mechanism_reasoning: str # min 10 chars
clinical_evidence_score: int # 0-10
clinical_reasoning: str # min 10 chars
drug_candidates: list[str]
key_findings: list[str]
sufficient: bool # Ready for synthesis?
confidence: float # 0.0-1.0
recommendation: Literal["continue", "synthesize"]
next_search_queries: list[str] # If continue
reasoning: str # min 20 chars
```
---
## Shared State (ResearchMemory)
### Initialization
```python
# Per-query isolation via ContextVar
state = init_magentic_state(query, embedding_service)
# Returns MagenticState wrapping ResearchMemory
```
### Memory Structure
```python
class ResearchMemory:
query: str # Research question
hypotheses: list[Hypothesis] # Generated hypotheses
conflicts: list[Conflict] # Detected conflicts
evidence_ids: list[str] # URLs (unique keys)
_evidence_cache: dict[str, Evidence] # URL -> Evidence
iteration_count: int # Current iteration
_embedding_service: EmbeddingServiceProtocol
```
### Key Methods
| Method | Returns | Description |
|--------|---------|-------------|
| `store_evidence(evidence)` | `list[str]` | Store with dedup, return new IDs |
| `get_all_evidence()` | `list[Evidence]` | All accumulated evidence |
| `get_relevant_evidence(n)` | `list[Evidence]` | Top N by semantic similarity |
| `get_context_summary()` | `str` | Markdown summary for fallback |
| `add_hypothesis(h)` | `None` | Append hypothesis |
| `get_confirmed_hypotheses()` | `list[Hypothesis]` | Confidence > 0.8 |
### State Flow
```
User Query
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ ResearchMemory initialized (empty) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
SearchAgent ──▢ store_evidence([Evidence]) ──▢ evidence_ids grows
β”‚
β–Ό
JudgeAgent ──▢ reads evidence from context ──▢ returns assessment
β”‚
β”œβ”€β”€β”€ INSUFFICIENT ──▢ SearchAgent (with next_search_queries)
β”‚
└─── SUFFICIENT ──▢ ReportAgent
β”‚
β–Ό
get_all_evidence() ──▢ ResearchReport
```
---
## Tool Contracts
### search_pubmed
**File**: `src/agents/tools.py`
```python
@ai_function
async def search_pubmed(query: str, max_results: int = 10) -> str:
"""Search PubMed for biomedical research papers."""
```
| Aspect | Value |
|--------|-------|
| External API | NCBI E-utilities |
| Rate Limit | 3/sec (10/sec with NCBI_API_KEY) |
| Output | Formatted string with titles/abstracts |
| Side Effect | Stores Evidence in memory |
### search_clinical_trials
```python
@ai_function
async def search_clinical_trials(query: str, max_results: int = 10) -> str:
"""Search ClinicalTrials.gov for clinical studies."""
```
| Aspect | Value |
|--------|-------|
| External API | ClinicalTrials.gov (uses `requests` not httpx) |
| Rate Limit | Standard HTTP limits |
| Output | Trial status, conditions, interventions |
| Side Effect | Stores Evidence in memory |
### search_preprints
```python
@ai_function
async def search_preprints(query: str, max_results: int = 10) -> str:
"""Search Europe PMC for preprints and papers."""
```
| Aspect | Value |
|--------|-------|
| External API | Europe PMC REST API |
| Output | Papers with PMIDs, DOIs |
| Side Effect | Stores Evidence in memory |
### get_bibliography
```python
@ai_function
def get_bibliography() -> str:
"""Get formatted reference list from all collected evidence."""
```
| Aspect | Value |
|--------|-------|
| External API | None |
| Reads | `memory.get_all_evidence()` |
| Output | Numbered reference list |
### search_web
```python
@ai_function
async def search_web(query: str, max_results: int = 10) -> str:
"""Search web using DuckDuckGo."""
```
| Aspect | Value |
|--------|-------|
| External API | DuckDuckGo |
| Output | Web results with URLs |
| Side Effect | Stores Evidence in memory |
---
## Event Flow
### AgentEvent Types
| Type | When Emitted | Data |
|------|--------------|------|
| `started` | Workflow begins | None |
| `thinking` | Before first agent event | None |
| `searching` | SearchAgent active | agent_id |
| `search_complete` | SearchAgent done | evidence count |
| `judging` | JudgeAgent active | agent_id |
| `judge_complete` | JudgeAgent done | assessment |
| `hypothesizing` | HypothesisAgent active | agent_id |
| `synthesizing` | ReportAgent active | agent_id |
| `streaming` | Real-time text | text, agent_id |
| `complete` | Workflow done | report, iterations |
| `error` | Error occurred | error message |
| `progress` | Status update | status message |
### Typical Sequence
```
1. started β†’ "Starting research..."
2. progress β†’ "Loading embedding service..."
3. thinking β†’ "Multi-agent reasoning..."
4. streaming (searcher) β†’ "Found 15 sources..."
5. streaming (judge) β†’ "βœ… SUFFICIENT..."
6. streaming (reporter) β†’ "## Research Report..."
7. complete β†’ Final report
```
---
## Break Conditions
The orchestrator exits when ANY of these occur:
### 1. Judge Approval βœ…
```python
if "SUFFICIENT EVIDENCE" in judge_response:
# Manager delegates to ReportAgent
# ReportAgent completes β†’ Workflow ends
```
### 2. Max Rounds Reached πŸ”„
```python
# MagenticBuilder config
max_round_count = 5 # Default
# After 5 manager rounds:
if not reporter_ran:
# Force fallback synthesis
async for event in _synthesize_fallback(iteration, "max_rounds"):
yield event
```
### 3. Timeout ⏱️
```python
try:
async with asyncio.timeout(settings.advanced_timeout): # 600s default
async for event in workflow.run_stream(task):
yield event
except TimeoutError:
async for event in _synthesize_fallback(iteration, "timeout"):
yield event
```
### 4. Token Budget πŸ’Ύ
```python
# Implicit via PydanticAI/LLM client
# ~50K tokens per query (from settings)
# Individual agent calls handle retries
```
---
## Dependency Matrix
### "If I change X, what breaks?"
| Changed Component | Affected Components | Impact |
|-------------------|---------------------|--------|
| **Evidence model** | All agents, Memory, Tools | HIGH - Core data type |
| **JudgeAssessment** | Judge, Orchestrator | HIGH - Decision flow |
| **ResearchMemory** | All agents | HIGH - Shared state |
| **search_pubmed** | SearchAgent | MEDIUM - One tool |
| **get_bibliography** | ReportAgent | MEDIUM - References |
| **AgentEvent** | Orchestrator, UI | MEDIUM - Streaming |
| **EmbeddingService** | Memory, Dedup | MEDIUM - Similarity |
| **Judge thresholds** | Workflow loop count | LOW - Tuning |
| **System prompts** | Agent behavior | LOW - Prompt eng |
### Agent Dependencies
```
SearchAgent
β”œβ”€β”€ REQUIRES: MagenticState, EmbeddingService
β”œβ”€β”€ WRITES TO: ResearchMemory (evidence)
└── NO DEPS ON: Other agents
JudgeAgent
β”œβ”€β”€ REQUIRES: Evidence context (from Manager)
β”œβ”€β”€ WRITES TO: Nothing
└── CONTROLS: SearchAgent (continue) or ReportAgent (synthesize)
HypothesisAgent
β”œβ”€β”€ REQUIRES: Evidence context
β”œβ”€β”€ WRITES TO: evidence_store["hypotheses"]
└── NO DEPS ON: Other agents
ReportAgent
β”œβ”€β”€ REQUIRES: ResearchMemory, hypotheses, assessment
β”œβ”€β”€ READS FROM: All prior state
└── WRITES TO: evidence_store["final_report"]
```
---
## Critical Thresholds
| Threshold | Value | Location | Impact |
|-----------|-------|----------|--------|
| Confidence threshold | 0.7 (70%) | JudgeAssessment | Sufficiency decision |
| Mechanism score threshold | 6 | Judge criteria | Sufficiency decision |
| Clinical score threshold | 6 | Judge criteria | Sufficiency decision |
| Max manager rounds | 5 | AdvancedOrchestrator | Loop termination |
| Max stall count | 3 | MagenticBuilder | Stall detection |
| Dedup similarity | 0.9 | EmbeddingService | Evidence dedup |
| Max evidence for judge | 30 | prompts/judge.py | Context limit |
| Confirmed hypothesis | 0.8 | ResearchMemory | High-confidence filter |
| Timeout | 600s | settings.advanced_timeout | Workflow timeout |
---
## Developer Checklist
When modifying agents:
- [ ] Update this document if contracts change
- [ ] Verify state access (read/write) is correct
- [ ] Check tool side effects
- [ ] Test with `make check`
- [ ] Verify event emission
When adding new agents:
- [ ] Create factory function in `magentic_agents.py`
- [ ] Define input/output contract
- [ ] Document state access
- [ ] Add to Agent Inventory table
- [ ] Update Dependency Matrix
When changing Judge criteria:
- [ ] Update JudgeAssessment model
- [ ] Update Critical Thresholds table
- [ ] Test workflow loop behavior
- [ ] Verify fallback synthesis triggers correctly
---
*This document is the source of truth for multi-agent coordination.*