A newer version of the Gradio SDK is available:
6.1.0
Agent-Tool-State Contract Registry
Status: Canonical Source of Truth Last Updated: 2025-12-06 Purpose: Developer reference for multi-agent coordination
This document defines the exact contracts between agents, tools, and shared state. Use this when:
- Adding new agents or tools
- Modifying agent behavior
- Debugging coordination issues
- Understanding "if I change X, what breaks?"
Table of Contents
- System Overview
- Agent Contracts
- Judge Decision Criteria
- Shared State (ResearchMemory)
- Tool Contracts
- Event Flow
- Break Conditions
- Dependency Matrix
System Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ORCHESTRATOR (AdvancedOrchestrator) β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Manager ββββΆβ Agents ββββΆβ Memory β β
β β (Magentic) β β (ChatAgent) β β(ResearchMem)β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β β
β β βΌ βΌ β
β β βββββββββββββββ βββββββββββββββ β
β ββββββββββΆβ Tools ββββΆβ Embeddings β β
β β(@ai_function)β β (ChromaDB) β β
β βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Agent Inventory
| Agent | File | Role | Tools |
|---|---|---|---|
| SearchAgent | magentic_agents.py |
Evidence gathering | search_pubmed, search_clinical_trials, search_preprints |
| JudgeAgent | magentic_agents.py |
Evidence evaluation | None (LLM only) |
| HypothesisAgent | magentic_agents.py |
Mechanism generation | None (LLM only) |
| ReportAgent | magentic_agents.py |
Report synthesis | get_bibliography |
| RetrievalAgent | retrieval_agent.py |
Web search | search_web |
β οΈ Dead Code Warning: RetrievalAgent is implemented but NOT wired into
magentic_agents.py. The orchestrator only uses SearchAgent (PubMed, ClinicalTrials, EuropePMC), not web search. See GitHub issue #134 for decision to delete or wire in.
Agent Contracts
SearchAgent
Factory: create_search_agent(chat_client, domain, api_key) -> ChatAgent
Input
# Manager instruction (string)
"Search for testosterone and libido mechanisms in peer-reviewed literature"
Output
# ChatMessage with:
message.text = """
Found 15 sources (12 new added to context):
- [Title 1](url): Abstract excerpt...
- [Title 2](url): Abstract excerpt...
"""
message.additional_properties = {
"evidence": [Evidence.model_dump(), ...]
}
State Access
| Operation | Key | Type | Description |
|---|---|---|---|
| READ | memory.query |
str | Current research question |
| READ | memory.evidence_ids |
list[str] | Existing evidence URLs |
| WRITE | memory._evidence_cache |
dict[str, Evidence] | Caches Evidence objects |
| WRITE | memory.evidence_ids |
list[str] | Appends new URLs |
| WRITE | embedding_service |
VectorDB | Stores embeddings |
Side Effects
- Calls external APIs (PubMed, ClinicalTrials, Europe PMC)
- Deduplicates via semantic similarity (0.9 threshold)
- Stores in vector database
Error Behavior
- API failure β Returns "No results found for: {query}"
- Rate limit β Raises
RateLimitError(caught by orchestrator)
JudgeAgent
Factory: create_judge_agent(chat_client, domain, api_key) -> ChatAgent
Input
# Manager instruction with evidence context
"Evaluate if we have sufficient evidence to answer: {query}"
# + Evidence list in context
Output
# ChatMessage with:
message.text = """
## Assessment
β
SUFFICIENT EVIDENCE (confidence: 85%). STOP SEARCHING.
### Scores
- Mechanism: 8/10
- Clinical: 7/10
### Reasoning
Strong evidence for testosterone-AR pathway...
"""
message.additional_properties = {
"assessment": JudgeAssessment.model_dump()
}
State Access
| Operation | Key | Type | Description |
|---|---|---|---|
| READ | Evidence from context | list[Evidence] | Passed by Manager |
| WRITE | None | - | Read-only evaluation |
Side Effects
- None (pure evaluation)
Critical Output Signal
"β SUFFICIENT EVIDENCE"β Manager delegates to ReportAgent"β INSUFFICIENT"β Manager calls SearchAgent with suggested queries
HypothesisAgent
Factory: create_hypothesis_agent(chat_client, domain, api_key) -> ChatAgent
Input
# Manager instruction
"Generate mechanistic hypotheses for: {query}"
Output
# ChatMessage with:
message.text = """
## Hypothesis 1 (Confidence: 75%)
**Mechanism**: Testosterone β Androgen Receptor β BDNF β Libido
**Suggested searches**: testosterone BDNF, androgen receptor signaling
## Primary Hypothesis
Testosterone β AR β dopamine release β reward pathway
## Knowledge Gaps
- Dose-response relationship unclear
"""
message.additional_properties = {
"assessment": HypothesisAssessment.model_dump()
}
State Access
| Operation | Key | Type | Description |
|---|---|---|---|
| READ | memory.query |
str | Research question |
| READ | Evidence from context | list[Evidence] | Current evidence |
| WRITE | evidence_store["hypotheses"] |
list | Appends hypotheses |
ReportAgent
Factory: create_report_agent(chat_client, domain, api_key) -> ChatAgent
Input
# Manager instruction
"Generate final research report for: {query}"
Output
# ChatMessage with:
message.text = ResearchReport.to_markdown() # Full markdown report
message.additional_properties = {
"report": ResearchReport.model_dump()
}
State Access
| Operation | Key | Type | Description |
|---|---|---|---|
| READ | memory.get_all_evidence() |
list[Evidence] | All collected evidence |
| READ | evidence_store["hypotheses"] |
list | Generated hypotheses |
| READ | evidence_store["last_assessment"] |
JudgeAssessment | Final assessment |
| WRITE | evidence_store["final_report"] |
ResearchReport | Stores report |
Tool: get_bibliography()
@ai_function
def get_bibliography() -> str:
"""Returns formatted reference list from all evidence."""
evidence = state.memory.get_all_evidence()
return format_as_references(evidence)
Judge Decision Criteria
Scoring Dimensions
Mechanism Score (0-10)
| Score | Meaning |
|---|---|
| 0-3 | Minimal mechanism understanding |
| 4-5 | Partial mechanism (some targets identified) |
| 6-7 | Clear mechanism (targets + pathways) |
| 8-9 | Comprehensive (multiple pathways, regulation) |
| 10 | Complete understanding |
Clinical Evidence Score (0-10)
| Score | Meaning |
|---|---|
| 0-3 | Preclinical only or weak human evidence |
| 4-5 | Some human evidence (small trials, case reports) |
| 6-7 | Strong human evidence (RCTs) |
| 8-9 | Robust (meta-analysis, large RCTs) |
| 10 | Definitive clinical proof |
Sufficiency Decision
# SUFFICIENT (recommendation="synthesize")
if (
confidence >= 0.7 # 70%
and mechanism_score >= 6
and clinical_evidence_score >= 6
):
sufficient = True
recommendation = "synthesize"
# INSUFFICIENT (recommendation="continue")
else:
sufficient = False
recommendation = "continue"
next_search_queries = ["suggested query 1", "suggested query 2"]
JudgeAssessment Model
class JudgeAssessment(BaseModel):
details: AssessmentDetails
mechanism_score: int # 0-10
mechanism_reasoning: str # min 10 chars
clinical_evidence_score: int # 0-10
clinical_reasoning: str # min 10 chars
drug_candidates: list[str]
key_findings: list[str]
sufficient: bool # Ready for synthesis?
confidence: float # 0.0-1.0
recommendation: Literal["continue", "synthesize"]
next_search_queries: list[str] # If continue
reasoning: str # min 20 chars
Shared State (ResearchMemory)
Initialization
# Per-query isolation via ContextVar
state = init_magentic_state(query, embedding_service)
# Returns MagenticState wrapping ResearchMemory
Memory Structure
class ResearchMemory:
query: str # Research question
hypotheses: list[Hypothesis] # Generated hypotheses
conflicts: list[Conflict] # Detected conflicts
evidence_ids: list[str] # URLs (unique keys)
_evidence_cache: dict[str, Evidence] # URL -> Evidence
iteration_count: int # Current iteration
_embedding_service: EmbeddingServiceProtocol
Key Methods
| Method | Returns | Description |
|---|---|---|
store_evidence(evidence) |
list[str] |
Store with dedup, return new IDs |
get_all_evidence() |
list[Evidence] |
All accumulated evidence |
get_relevant_evidence(n) |
list[Evidence] |
Top N by semantic similarity |
get_context_summary() |
str |
Markdown summary for fallback |
add_hypothesis(h) |
None |
Append hypothesis |
get_confirmed_hypotheses() |
list[Hypothesis] |
Confidence > 0.8 |
State Flow
User Query
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ResearchMemory initialized (empty) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
SearchAgent βββΆ store_evidence([Evidence]) βββΆ evidence_ids grows
β
βΌ
JudgeAgent βββΆ reads evidence from context βββΆ returns assessment
β
ββββ INSUFFICIENT βββΆ SearchAgent (with next_search_queries)
β
ββββ SUFFICIENT βββΆ ReportAgent
β
βΌ
get_all_evidence() βββΆ ResearchReport
Tool Contracts
search_pubmed
File: src/agents/tools.py
@ai_function
async def search_pubmed(query: str, max_results: int = 10) -> str:
"""Search PubMed for biomedical research papers."""
| Aspect | Value |
|---|---|
| External API | NCBI E-utilities |
| Rate Limit | 3/sec (10/sec with NCBI_API_KEY) |
| Output | Formatted string with titles/abstracts |
| Side Effect | Stores Evidence in memory |
search_clinical_trials
@ai_function
async def search_clinical_trials(query: str, max_results: int = 10) -> str:
"""Search ClinicalTrials.gov for clinical studies."""
| Aspect | Value |
|---|---|
| External API | ClinicalTrials.gov (uses requests not httpx) |
| Rate Limit | Standard HTTP limits |
| Output | Trial status, conditions, interventions |
| Side Effect | Stores Evidence in memory |
search_preprints
@ai_function
async def search_preprints(query: str, max_results: int = 10) -> str:
"""Search Europe PMC for preprints and papers."""
| Aspect | Value |
|---|---|
| External API | Europe PMC REST API |
| Output | Papers with PMIDs, DOIs |
| Side Effect | Stores Evidence in memory |
get_bibliography
@ai_function
def get_bibliography() -> str:
"""Get formatted reference list from all collected evidence."""
| Aspect | Value |
|---|---|
| External API | None |
| Reads | memory.get_all_evidence() |
| Output | Numbered reference list |
search_web
@ai_function
async def search_web(query: str, max_results: int = 10) -> str:
"""Search web using DuckDuckGo."""
| Aspect | Value |
|---|---|
| External API | DuckDuckGo |
| Output | Web results with URLs |
| Side Effect | Stores Evidence in memory |
Event Flow
AgentEvent Types
| Type | When Emitted | Data |
|---|---|---|
started |
Workflow begins | None |
thinking |
Before first agent event | None |
searching |
SearchAgent active | agent_id |
search_complete |
SearchAgent done | evidence count |
judging |
JudgeAgent active | agent_id |
judge_complete |
JudgeAgent done | assessment |
hypothesizing |
HypothesisAgent active | agent_id |
synthesizing |
ReportAgent active | agent_id |
streaming |
Real-time text | text, agent_id |
complete |
Workflow done | report, iterations |
error |
Error occurred | error message |
progress |
Status update | status message |
Typical Sequence
1. started β "Starting research..."
2. progress β "Loading embedding service..."
3. thinking β "Multi-agent reasoning..."
4. streaming (searcher) β "Found 15 sources..."
5. streaming (judge) β "β
SUFFICIENT..."
6. streaming (reporter) β "## Research Report..."
7. complete β Final report
Break Conditions
The orchestrator exits when ANY of these occur:
1. Judge Approval β
if "SUFFICIENT EVIDENCE" in judge_response:
# Manager delegates to ReportAgent
# ReportAgent completes β Workflow ends
2. Max Rounds Reached π
# MagenticBuilder config
max_round_count = 5 # Default
# After 5 manager rounds:
if not reporter_ran:
# Force fallback synthesis
async for event in _synthesize_fallback(iteration, "max_rounds"):
yield event
3. Timeout β±οΈ
try:
async with asyncio.timeout(settings.advanced_timeout): # 600s default
async for event in workflow.run_stream(task):
yield event
except TimeoutError:
async for event in _synthesize_fallback(iteration, "timeout"):
yield event
4. Token Budget πΎ
# Implicit via PydanticAI/LLM client
# ~50K tokens per query (from settings)
# Individual agent calls handle retries
Dependency Matrix
"If I change X, what breaks?"
| Changed Component | Affected Components | Impact |
|---|---|---|
| Evidence model | All agents, Memory, Tools | HIGH - Core data type |
| JudgeAssessment | Judge, Orchestrator | HIGH - Decision flow |
| ResearchMemory | All agents | HIGH - Shared state |
| search_pubmed | SearchAgent | MEDIUM - One tool |
| get_bibliography | ReportAgent | MEDIUM - References |
| AgentEvent | Orchestrator, UI | MEDIUM - Streaming |
| EmbeddingService | Memory, Dedup | MEDIUM - Similarity |
| Judge thresholds | Workflow loop count | LOW - Tuning |
| System prompts | Agent behavior | LOW - Prompt eng |
Agent Dependencies
SearchAgent
βββ REQUIRES: MagenticState, EmbeddingService
βββ WRITES TO: ResearchMemory (evidence)
βββ NO DEPS ON: Other agents
JudgeAgent
βββ REQUIRES: Evidence context (from Manager)
βββ WRITES TO: Nothing
βββ CONTROLS: SearchAgent (continue) or ReportAgent (synthesize)
HypothesisAgent
βββ REQUIRES: Evidence context
βββ WRITES TO: evidence_store["hypotheses"]
βββ NO DEPS ON: Other agents
ReportAgent
βββ REQUIRES: ResearchMemory, hypotheses, assessment
βββ READS FROM: All prior state
βββ WRITES TO: evidence_store["final_report"]
Critical Thresholds
| Threshold | Value | Location | Impact |
|---|---|---|---|
| Confidence threshold | 0.7 (70%) | JudgeAssessment | Sufficiency decision |
| Mechanism score threshold | 6 | Judge criteria | Sufficiency decision |
| Clinical score threshold | 6 | Judge criteria | Sufficiency decision |
| Max manager rounds | 5 | AdvancedOrchestrator | Loop termination |
| Max stall count | 3 | MagenticBuilder | Stall detection |
| Dedup similarity | 0.9 | EmbeddingService | Evidence dedup |
| Max evidence for judge | 30 | prompts/judge.py | Context limit |
| Confirmed hypothesis | 0.8 | ResearchMemory | High-confidence filter |
| Timeout | 600s | settings.advanced_timeout | Workflow timeout |
Developer Checklist
When modifying agents:
- Update this document if contracts change
- Verify state access (read/write) is correct
- Check tool side effects
- Test with
make check - Verify event emission
When adding new agents:
- Create factory function in
magentic_agents.py - Define input/output contract
- Document state access
- Add to Agent Inventory table
- Update Dependency Matrix
When changing Judge criteria:
- Update JudgeAssessment model
- Update Critical Thresholds table
- Test workflow loop behavior
- Verify fallback synthesis triggers correctly
This document is the source of truth for multi-agent coordination.