Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

DeepBoner / docs /architecture /agent-tool-state-contracts.md

VibecoderMcSwaggins

docs: Audit and fix architecture documentation for accuracy

c7a2e77 9 days ago

preview code

raw

history blame contribute delete

18.2 kB

	# Agent-Tool-State Contract Registry

	> Status: Canonical Source of Truth
	> Last Updated: 2025-12-06
	> Purpose: Developer reference for multi-agent coordination

	This document defines the exact contracts between agents, tools, and shared state. Use this when:
	- Adding new agents or tools
	- Modifying agent behavior
	- Debugging coordination issues
	- Understanding "if I change X, what breaks?"

	---

	## Table of Contents

	1. [System Overview](#system-overview)
	2. [Agent Contracts](#agent-contracts)
	3. [Judge Decision Criteria](#judge-decision-criteria)
	4. [Shared State (ResearchMemory)](#shared-state-researchmemory)
	5. [Tool Contracts](#tool-contracts)
	6. [Event Flow](#event-flow)
	7. [Break Conditions](#break-conditions)
	8. [Dependency Matrix](#dependency-matrix)

	---

	## System Overview

	```
	┌─────────────────────────────────────────────────────────────────────┐
	│ ORCHESTRATOR (AdvancedOrchestrator) │
	│ │
	│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
	│ │ Manager │──▶│ Agents │──▶│ Memory │ │
	│ │ (Magentic) │ │ (ChatAgent) │ │(ResearchMem)│ │
	│ └─────────────┘ └─────────────┘ └─────────────┘ │
	│ │ │ │ │
	│ │ ▼ ▼ │
	│ │ ┌─────────────┐ ┌─────────────┐ │
	│ └────────▶│ Tools │──▶│ Embeddings │ │
	│ │(@ai_function)│ │ (ChromaDB) │ │
	│ └─────────────┘ └─────────────┘ │
	└─────────────────────────────────────────────────────────────────────┘
	```

	### Agent Inventory

	\| Agent \| File \| Role \| Tools \|
	\|-------\|------\|------\|-------\|
	\| SearchAgent \| `magentic_agents.py` \| Evidence gathering \| search_pubmed, search_clinical_trials, search_preprints \|
	\| JudgeAgent \| `magentic_agents.py` \| Evidence evaluation \| None (LLM only) \|
	\| HypothesisAgent \| `magentic_agents.py` \| Mechanism generation \| None (LLM only) \|
	\| ReportAgent \| `magentic_agents.py` \| Report synthesis \| get_bibliography \|
	\| RetrievalAgent \| `retrieval_agent.py` \| Web search \| search_web \|

	> ⚠️ Dead Code Warning: RetrievalAgent is implemented but NOT wired into `magentic_agents.py`.
	> The orchestrator only uses SearchAgent (PubMed, ClinicalTrials, EuropePMC), not web search.
	> See GitHub issue #134 for decision to delete or wire in.

	---

	## Agent Contracts

	### SearchAgent

	Factory: `create_search_agent(chat_client, domain, api_key) -> ChatAgent`

	#### Input
	```python
	# Manager instruction (string)
	"Search for testosterone and libido mechanisms in peer-reviewed literature"
	```

	#### Output
	```python
	# ChatMessage with:
	message.text = """
	Found 15 sources (12 new added to context):
	- [Title 1](url): Abstract excerpt...
	- [Title 2](url): Abstract excerpt...
	"""
	message.additional_properties = {
	"evidence": [Evidence.model_dump(), ...]
	}
	```

	#### State Access

	\| Operation \| Key \| Type \| Description \|
	\|-----------\|-----\|------\|-------------\|
	\| READ \| `memory.query` \| str \| Current research question \|
	\| READ \| `memory.evidence_ids` \| list[str] \| Existing evidence URLs \|
	\| WRITE \| `memory._evidence_cache` \| dict[str, Evidence] \| Caches Evidence objects \|
	\| WRITE \| `memory.evidence_ids` \| list[str] \| Appends new URLs \|
	\| WRITE \| `embedding_service` \| VectorDB \| Stores embeddings \|

	#### Side Effects
	1. Calls external APIs (PubMed, ClinicalTrials, Europe PMC)
	2. Deduplicates via semantic similarity (0.9 threshold)
	3. Stores in vector database

	#### Error Behavior
	- API failure → Returns "No results found for: {query}"
	- Rate limit → Raises `RateLimitError` (caught by orchestrator)

	---

	### JudgeAgent

	Factory: `create_judge_agent(chat_client, domain, api_key) -> ChatAgent`

	#### Input
	```python
	# Manager instruction with evidence context
	"Evaluate if we have sufficient evidence to answer: {query}"
	# + Evidence list in context
	```

	#### Output
	```python
	# ChatMessage with:
	message.text = """
	## Assessment
	✅ SUFFICIENT EVIDENCE (confidence: 85%). STOP SEARCHING.

	### Scores
	- Mechanism: 8/10
	- Clinical: 7/10

	### Reasoning
	Strong evidence for testosterone-AR pathway...
	"""
	message.additional_properties = {
	"assessment": JudgeAssessment.model_dump()
	}
	```

	#### State Access

	\| Operation \| Key \| Type \| Description \|
	\|-----------\|-----\|------\|-------------\|
	\| READ \| Evidence from context \| list[Evidence] \| Passed by Manager \|
	\| WRITE \| None \| - \| Read-only evaluation \|

	#### Side Effects
	- None (pure evaluation)

	#### Critical Output Signal
	- `"✅ SUFFICIENT EVIDENCE"` → Manager delegates to ReportAgent
	- `"❌ INSUFFICIENT"` → Manager calls SearchAgent with suggested queries

	---

	### HypothesisAgent

	Factory: `create_hypothesis_agent(chat_client, domain, api_key) -> ChatAgent`

	#### Input
	```python
	# Manager instruction
	"Generate mechanistic hypotheses for: {query}"
	```

	#### Output
	```python
	# ChatMessage with:
	message.text = """
	## Hypothesis 1 (Confidence: 75%)
	Mechanism: Testosterone → Androgen Receptor → BDNF → Libido
	Suggested searches: testosterone BDNF, androgen receptor signaling

	## Primary Hypothesis
	Testosterone → AR → dopamine release → reward pathway

	## Knowledge Gaps
	- Dose-response relationship unclear
	"""
	message.additional_properties = {
	"assessment": HypothesisAssessment.model_dump()
	}
	```

	#### State Access

	\| Operation \| Key \| Type \| Description \|
	\|-----------\|-----\|------\|-------------\|
	\| READ \| `memory.query` \| str \| Research question \|
	\| READ \| Evidence from context \| list[Evidence] \| Current evidence \|
	\| WRITE \| `evidence_store["hypotheses"]` \| list \| Appends hypotheses \|

	---

	### ReportAgent

	Factory: `create_report_agent(chat_client, domain, api_key) -> ChatAgent`

	#### Input
	```python
	# Manager instruction
	"Generate final research report for: {query}"
	```

	#### Output
	```python
	# ChatMessage with:
	message.text = ResearchReport.to_markdown() # Full markdown report
	message.additional_properties = {
	"report": ResearchReport.model_dump()
	}
	```

	#### State Access

	\| Operation \| Key \| Type \| Description \|
	\|-----------\|-----\|------\|-------------\|
	\| READ \| `memory.get_all_evidence()` \| list[Evidence] \| All collected evidence \|
	\| READ \| `evidence_store["hypotheses"]` \| list \| Generated hypotheses \|
	\| READ \| `evidence_store["last_assessment"]` \| JudgeAssessment \| Final assessment \|
	\| WRITE \| `evidence_store["final_report"]` \| ResearchReport \| Stores report \|

	#### Tool: get_bibliography()
	```python
	@ai_function
	def get_bibliography() -> str:
	"""Returns formatted reference list from all evidence."""
	evidence = state.memory.get_all_evidence()
	return format_as_references(evidence)
	```

	---

	## Judge Decision Criteria

	### Scoring Dimensions

	Mechanism Score (0-10)

	\| Score \| Meaning \|
	\|-------\|---------\|
	\| 0-3 \| Minimal mechanism understanding \|
	\| 4-5 \| Partial mechanism (some targets identified) \|
	\| 6-7 \| Clear mechanism (targets + pathways) \|
	\| 8-9 \| Comprehensive (multiple pathways, regulation) \|
	\| 10 \| Complete understanding \|

	Clinical Evidence Score (0-10)

	\| Score \| Meaning \|
	\|-------\|---------\|
	\| 0-3 \| Preclinical only or weak human evidence \|
	\| 4-5 \| Some human evidence (small trials, case reports) \|
	\| 6-7 \| Strong human evidence (RCTs) \|
	\| 8-9 \| Robust (meta-analysis, large RCTs) \|
	\| 10 \| Definitive clinical proof \|

	### Sufficiency Decision

	```python
	# SUFFICIENT (recommendation="synthesize")
	if (
	confidence >= 0.7 # 70%
	and mechanism_score >= 6
	and clinical_evidence_score >= 6
	):
	sufficient = True
	recommendation = "synthesize"

	# INSUFFICIENT (recommendation="continue")
	else:
	sufficient = False
	recommendation = "continue"
	next_search_queries = ["suggested query 1", "suggested query 2"]
	```

	### JudgeAssessment Model

	```python
	class JudgeAssessment(BaseModel):
	details: AssessmentDetails
	mechanism_score: int # 0-10
	mechanism_reasoning: str # min 10 chars
	clinical_evidence_score: int # 0-10
	clinical_reasoning: str # min 10 chars
	drug_candidates: list[str]
	key_findings: list[str]

	sufficient: bool # Ready for synthesis?
	confidence: float # 0.0-1.0
	recommendation: Literal["continue", "synthesize"]
	next_search_queries: list[str] # If continue
	reasoning: str # min 20 chars
	```

	---

	## Shared State (ResearchMemory)

	### Initialization

	```python
	# Per-query isolation via ContextVar
	state = init_magentic_state(query, embedding_service)
	# Returns MagenticState wrapping ResearchMemory
	```

	### Memory Structure

	```python
	class ResearchMemory:
	query: str # Research question
	hypotheses: list[Hypothesis] # Generated hypotheses
	conflicts: list[Conflict] # Detected conflicts
	evidence_ids: list[str] # URLs (unique keys)
	_evidence_cache: dict[str, Evidence] # URL -> Evidence
	iteration_count: int # Current iteration
	_embedding_service: EmbeddingServiceProtocol
	```

	### Key Methods

	\| Method \| Returns \| Description \|
	\|--------\|---------\|-------------\|
	\| `store_evidence(evidence)` \| `list[str]` \| Store with dedup, return new IDs \|
	\| `get_all_evidence()` \| `list[Evidence]` \| All accumulated evidence \|
	\| `get_relevant_evidence(n)` \| `list[Evidence]` \| Top N by semantic similarity \|
	\| `get_context_summary()` \| `str` \| Markdown summary for fallback \|
	\| `add_hypothesis(h)` \| `None` \| Append hypothesis \|
	\| `get_confirmed_hypotheses()` \| `list[Hypothesis]` \| Confidence > 0.8 \|

	### State Flow

	```
	User Query
	│
	▼
	┌─────────────────────────────────────────────────────────────┐
	│ ResearchMemory initialized (empty) │
	└─────────────────────────────────────────────────────────────┘
	│
	▼
	SearchAgent ──▶ store_evidence([Evidence]) ──▶ evidence_ids grows
	│
	▼
	JudgeAgent ──▶ reads evidence from context ──▶ returns assessment
	│
	├─── INSUFFICIENT ──▶ SearchAgent (with next_search_queries)
	│
	└─── SUFFICIENT ──▶ ReportAgent
	│
	▼
	get_all_evidence() ──▶ ResearchReport
	```

	---

	## Tool Contracts

	### search_pubmed

	File: `src/agents/tools.py`

	```python
	@ai_function
	async def search_pubmed(query: str, max_results: int = 10) -> str:
	"""Search PubMed for biomedical research papers."""
	```

	\| Aspect \| Value \|
	\|--------\|-------\|
	\| External API \| NCBI E-utilities \|
	\| Rate Limit \| 3/sec (10/sec with NCBI_API_KEY) \|
	\| Output \| Formatted string with titles/abstracts \|
	\| Side Effect \| Stores Evidence in memory \|

	### search_clinical_trials

	```python
	@ai_function
	async def search_clinical_trials(query: str, max_results: int = 10) -> str:
	"""Search ClinicalTrials.gov for clinical studies."""
	```

	\| Aspect \| Value \|
	\|--------\|-------\|
	\| External API \| ClinicalTrials.gov (uses `requests` not httpx) \|
	\| Rate Limit \| Standard HTTP limits \|
	\| Output \| Trial status, conditions, interventions \|
	\| Side Effect \| Stores Evidence in memory \|

	### search_preprints

	```python
	@ai_function
	async def search_preprints(query: str, max_results: int = 10) -> str:
	"""Search Europe PMC for preprints and papers."""
	```

	\| Aspect \| Value \|
	\|--------\|-------\|
	\| External API \| Europe PMC REST API \|
	\| Output \| Papers with PMIDs, DOIs \|
	\| Side Effect \| Stores Evidence in memory \|

	### get_bibliography

	```python
	@ai_function
	def get_bibliography() -> str:
	"""Get formatted reference list from all collected evidence."""
	```

	\| Aspect \| Value \|
	\|--------\|-------\|
	\| External API \| None \|
	\| Reads \| `memory.get_all_evidence()` \|
	\| Output \| Numbered reference list \|

	### search_web

	```python
	@ai_function
	async def search_web(query: str, max_results: int = 10) -> str:
	"""Search web using DuckDuckGo."""
	```

	\| Aspect \| Value \|
	\|--------\|-------\|
	\| External API \| DuckDuckGo \|
	\| Output \| Web results with URLs \|
	\| Side Effect \| Stores Evidence in memory \|

	---

	## Event Flow

	### AgentEvent Types

	\| Type \| When Emitted \| Data \|
	\|------\|--------------\|------\|
	\| `started` \| Workflow begins \| None \|
	\| `thinking` \| Before first agent event \| None \|
	\| `searching` \| SearchAgent active \| agent_id \|
	\| `search_complete` \| SearchAgent done \| evidence count \|
	\| `judging` \| JudgeAgent active \| agent_id \|
	\| `judge_complete` \| JudgeAgent done \| assessment \|
	\| `hypothesizing` \| HypothesisAgent active \| agent_id \|
	\| `synthesizing` \| ReportAgent active \| agent_id \|
	\| `streaming` \| Real-time text \| text, agent_id \|
	\| `complete` \| Workflow done \| report, iterations \|
	\| `error` \| Error occurred \| error message \|
	\| `progress` \| Status update \| status message \|

	### Typical Sequence

	```
	1. started → "Starting research..."
	2. progress → "Loading embedding service..."
	3. thinking → "Multi-agent reasoning..."
	4. streaming (searcher) → "Found 15 sources..."
	5. streaming (judge) → "✅ SUFFICIENT..."
	6. streaming (reporter) → "## Research Report..."
	7. complete → Final report
	```

	---

	## Break Conditions

	The orchestrator exits when ANY of these occur:

	### 1. Judge Approval ✅

	```python
	if "SUFFICIENT EVIDENCE" in judge_response:
	# Manager delegates to ReportAgent
	# ReportAgent completes → Workflow ends
	```

	### 2. Max Rounds Reached 🔄

	```python
	# MagenticBuilder config
	max_round_count = 5 # Default

	# After 5 manager rounds:
	if not reporter_ran:
	# Force fallback synthesis
	async for event in _synthesize_fallback(iteration, "max_rounds"):
	yield event
	```

	### 3. Timeout ⏱️

	```python
	try:
	async with asyncio.timeout(settings.advanced_timeout): # 600s default
	async for event in workflow.run_stream(task):
	yield event
	except TimeoutError:
	async for event in _synthesize_fallback(iteration, "timeout"):
	yield event
	```

	### 4. Token Budget 💾

	```python
	# Implicit via PydanticAI/LLM client
	# ~50K tokens per query (from settings)
	# Individual agent calls handle retries
	```

	---

	## Dependency Matrix

	### "If I change X, what breaks?"

	\| Changed Component \| Affected Components \| Impact \|
	\|-------------------\|---------------------\|--------\|
	\| Evidence model \| All agents, Memory, Tools \| HIGH - Core data type \|
	\| JudgeAssessment \| Judge, Orchestrator \| HIGH - Decision flow \|
	\| ResearchMemory \| All agents \| HIGH - Shared state \|
	\| search_pubmed \| SearchAgent \| MEDIUM - One tool \|
	\| get_bibliography \| ReportAgent \| MEDIUM - References \|
	\| AgentEvent \| Orchestrator, UI \| MEDIUM - Streaming \|
	\| EmbeddingService \| Memory, Dedup \| MEDIUM - Similarity \|
	\| Judge thresholds \| Workflow loop count \| LOW - Tuning \|
	\| System prompts \| Agent behavior \| LOW - Prompt eng \|

	### Agent Dependencies

	```
	SearchAgent
	├── REQUIRES: MagenticState, EmbeddingService
	├── WRITES TO: ResearchMemory (evidence)
	└── NO DEPS ON: Other agents

	JudgeAgent
	├── REQUIRES: Evidence context (from Manager)
	├── WRITES TO: Nothing
	└── CONTROLS: SearchAgent (continue) or ReportAgent (synthesize)

	HypothesisAgent
	├── REQUIRES: Evidence context
	├── WRITES TO: evidence_store["hypotheses"]
	└── NO DEPS ON: Other agents

	ReportAgent
	├── REQUIRES: ResearchMemory, hypotheses, assessment
	├── READS FROM: All prior state
	└── WRITES TO: evidence_store["final_report"]
	```

	---

	## Critical Thresholds

	\| Threshold \| Value \| Location \| Impact \|
	\|-----------\|-------\|----------\|--------\|
	\| Confidence threshold \| 0.7 (70%) \| JudgeAssessment \| Sufficiency decision \|
	\| Mechanism score threshold \| 6 \| Judge criteria \| Sufficiency decision \|
	\| Clinical score threshold \| 6 \| Judge criteria \| Sufficiency decision \|
	\| Max manager rounds \| 5 \| AdvancedOrchestrator \| Loop termination \|
	\| Max stall count \| 3 \| MagenticBuilder \| Stall detection \|
	\| Dedup similarity \| 0.9 \| EmbeddingService \| Evidence dedup \|
	\| Max evidence for judge \| 30 \| prompts/judge.py \| Context limit \|
	\| Confirmed hypothesis \| 0.8 \| ResearchMemory \| High-confidence filter \|
	\| Timeout \| 600s \| settings.advanced_timeout \| Workflow timeout \|

	---

	## Developer Checklist

	When modifying agents:

	- [ ] Update this document if contracts change
	- [ ] Verify state access (read/write) is correct
	- [ ] Check tool side effects
	- [ ] Test with `make check`
	- [ ] Verify event emission

	When adding new agents:

	- [ ] Create factory function in `magentic_agents.py`
	- [ ] Define input/output contract
	- [ ] Document state access
	- [ ] Add to Agent Inventory table
	- [ ] Update Dependency Matrix

	When changing Judge criteria:

	- [ ] Update JudgeAssessment model
	- [ ] Update Critical Thresholds table
	- [ ] Test workflow loop behavior
	- [ ] Verify fallback synthesis triggers correctly

	---

	This document is the source of truth for multi-agent coordination.