Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

DeepBoner / TOOL_ANALYSIS_CRITICAL.md

VibecoderMcSwaggins

fix(SPEC_11): address CodeRabbit review feedback (#92)

89f1173 unverified 17 days ago

preview code

raw

history blame

11.9 kB

	# Critical Analysis: Search Tools - Limitations, Gaps, and Improvements

	Date: November 2025
	Purpose: Honest assessment of all search tools to identify what's working, what's broken, and what needs improvement WITHOUT horizontal sprawl.

	---

	## Executive Summary

	DeepBoner currently has 4 search tools:
	1. PubMed (NCBI E-utilities)
	2. ClinicalTrials.gov (API v2)
	3. Europe PMC (includes preprints)
	4. OpenAlex (citation-aware)

	Overall Assessment: Tools are functional but have significant gaps in:
	- Deduplication (PubMed ∩ Europe PMC ∩ OpenAlex = massive overlap)
	- Full-text retrieval (only abstracts currently)
	- Citation graph traversal (OpenAlex has data but we don't use it)
	- Query optimization (basic synonym expansion, no MeSH term mapping)

	---

	## Tool 1: PubMed (NCBI E-utilities)

	File: `src/tools/pubmed.py`

	### What It Does Well
	\| Feature \| Status \| Notes \|
	\|---------\|--------\|-------\|
	\| Rate limiting \| ✅ \| Shared limiter, respects 3/sec (no key) or 10/sec (with key) \|
	\| Retry logic \| ✅ \| tenacity with exponential backoff \|
	\| Query preprocessing \| ✅ \| Strips question words, expands synonyms \|
	\| Abstract parsing \| ✅ \| Handles XML edge cases (dict vs list) \|

	### Limitations (API-Level)
	\| Limitation \| Severity \| Workaround Possible? \|
	\|------------\|----------\|---------------------\|
	\| 10,000 result cap per query \| Medium \| Yes - use date ranges to paginate \|
	\| Abstracts only (no full text) \| High \| No - full text requires PMC or publisher \|
	\| No citation counts \| Medium \| Yes - cross-reference with OpenAlex \|
	\| Rate limit (10/sec max) \| Low \| Already handled \|

	### Current Implementation Gaps
	```python
	# GAP 1: No MeSH term expansion
	# Current: expand_synonyms() uses hardcoded dict
	# Better: Use NCBI's E-utilities to get MeSH terms for query

	# GAP 2: No date filtering
	# Current: Gets whatever PubMed returns (biased toward recent)
	# Better: Add date range parameter for historical research

	# GAP 3: No publication type filtering
	# Current: Returns all types (reviews, case reports, RCTs)
	# Better: Filter for RCTs and systematic reviews when appropriate
	```

	### Priority Improvements
	1. HIGH: Add publication type filter (Reviews, RCTs, Meta-analyses)
	2. MEDIUM: Add date range parameter
	3. LOW: MeSH term expansion via E-utilities

	---

	## Tool 2: ClinicalTrials.gov

	File: `src/tools/clinicaltrials.py`

	### What It Does Well
	\| Feature \| Status \| Notes \|
	\|---------\|--------\|-------\|
	\| API v2 usage \| ✅ \| Modern API, not deprecated v1 \|
	\| Interventional filter \| ✅ \| Only gets drug/treatment studies \|
	\| Status filter \| ✅ \| COMPLETED, ACTIVE, RECRUITING \|
	\| httpx → requests workaround \| ✅ \| Bypasses WAF TLS fingerprint block \|

	### Limitations (API-Level)
	\| Limitation \| Severity \| Workaround Possible? \|
	\|------------\|----------\|---------------------\|
	\| No results data \| High \| Yes - available via different endpoint \|
	\| No outcome measures \| High \| Yes - add to FIELDS list \|
	\| No adverse events \| Medium \| Yes - separate API call \|
	\| Sparse drug mechanism data \| Medium \| No - not in API \|

	### Current Implementation Gaps
	```python
	# GAP 1: Missing critical fields
	FIELDS: ClassVar[list[str]] = [
	"NCTId",
	"BriefTitle",
	"Phase",
	"OverallStatus",
	"Condition",
	"InterventionName",
	"StartDate",
	"BriefSummary",
	# MISSING:
	# "PrimaryOutcome",
	# "SecondaryOutcome",
	# "ResultsFirstSubmitDate",
	# "StudyResults", # Whether results are posted
	]

	# GAP 2: No results retrieval
	# Many completed trials have posted results
	# We could get actual efficacy data, not just trial existence

	# GAP 3: No linked publications
	# Trials often link to PubMed articles with results
	# We could follow these links for richer evidence
	```

	### Priority Improvements
	1. HIGH: Add outcome measures to FIELDS
	2. HIGH: Check for and retrieve posted results
	3. MEDIUM: Follow linked publications (NCT → PMID)

	---

	## Tool 3: Europe PMC

	File: `src/tools/europepmc.py`

	### What It Does Well
	\| Feature \| Status \| Notes \|
	\|---------\|--------\|-------\|
	\| Preprint coverage \| ✅ \| bioRxiv, medRxiv, ChemRxiv indexed \|
	\| Preprint labeling \| ✅ \| `[PREPRINT - Not peer-reviewed]` marker \|
	\| DOI/PMID fallback URLs \| ✅ \| Smart URL construction \|
	\| Relevance scoring \| ✅ \| Preprints weighted lower (0.75 vs 0.9) \|

	### Limitations (API-Level)
	\| Limitation \| Severity \| Workaround Possible? \|
	\|------------\|----------\|---------------------\|
	\| No full text for most articles \| High \| Partial - CC-licensed available after 14 days \|
	\| Citation data limited \| Medium \| Only journal articles, not preprints \|
	\| Preprint-publication linking gaps \| Medium \| ~50% of links missing per Crossref \|
	\| License info sometimes missing \| Low \| Manual review required \|

	### Current Implementation Gaps
	```python
	# GAP 1: No full-text retrieval
	# Europe PMC has full text for many CC-licensed articles
	# Could retrieve full text XML via separate endpoint

	# GAP 2: Massive overlap with PubMed
	# Europe PMC indexes all of PubMed/MEDLINE
	# We're getting duplicates with no deduplication

	# GAP 3: No citation network
	# Europe PMC has "citedByCount" but we don't use it
	# Could prioritize highly-cited preprints
	```

	### Priority Improvements
	1. HIGH: Add deduplication with PubMed (by PMID)
	2. MEDIUM: Retrieve citation counts for ranking
	3. LOW: Full-text retrieval for CC-licensed articles

	---

	## Tool 4: OpenAlex

	File: `src/tools/openalex.py`

	### What It Does Well
	\| Feature \| Status \| Notes \|
	\|---------\|--------\|-------\|
	\| Citation counts \| ✅ \| Sorted by `cited_by_count:desc` \|
	\| Abstract reconstruction \| ✅ \| Handles inverted index format \|
	\| Concept extraction \| ✅ \| Hierarchical classification \|
	\| Open access detection \| ✅ \| `is_oa` and `pdf_url` \|
	\| Polite pool \| ✅ \| mailto for 100k/day limit \|
	\| Rich metadata \| ✅ \| Best metadata of all tools \|

	### Limitations (API-Level)
	\| Limitation \| Severity \| Workaround Possible? \|
	\|------------\|----------\|---------------------\|
	\| Author truncation at 100 \| Low \| Only affects mega-author papers \|
	\| No full text \| High \| No - OpenAlex is metadata only \|
	\| Stale data (1-2 day lag) \| Low \| Acceptable for research \|

	### Current Implementation Gaps
	```python
	# GAP 1: No citation graph traversal
	# OpenAlex has `cited_by` and `references` endpoints
	# We could find seminal papers by following citation chains

	# GAP 2: No related works
	# OpenAlex has ML-powered "related_works" field
	# Could expand search to similar papers

	# GAP 3: No concept filtering
	# OpenAlex has hierarchical concepts
	# Could filter for specific domains (e.g., "Sexual health" concept)

	# GAP 4: Overlap with PubMed
	# OpenAlex indexes most of PubMed
	# More duplicates without deduplication
	```

	### Priority Improvements
	1. HIGH: Add citation graph traversal (find seminal papers)
	2. HIGH: Add deduplication with PubMed/Europe PMC
	3. MEDIUM: Use `related_works` for query expansion
	4. LOW: Concept-based filtering

	---

	## Cross-Tool Issues

	### Issue 1: MASSIVE DUPLICATION

	```
	PubMed: 36M+ articles
	Europe PMC: Indexes ALL of PubMed + preprints
	OpenAlex: 250M+ works (includes PubMed)

	Current behavior: All 3 return the same papers
	Result: Duplicate evidence, wasted tokens, inflated counts
	```

	Solution: Deduplication by PMID/DOI
	```python
	# Proposed: Add to SearchHandler
	def deduplicate_evidence(evidence_list: list[Evidence]) -> list[Evidence]:
	seen_ids: set[str] = set()
	unique: list[Evidence] = []
	for e in evidence_list:
	# Extract PMID or DOI from URL
	paper_id = extract_paper_id(e.citation.url)
	if paper_id not in seen_ids:
	seen_ids.add(paper_id)
	unique.append(e)
	return unique
	```

	### Issue 2: NO FULL-TEXT RETRIEVAL

	All tools return abstracts only. For deep research, this is limiting.

	What's Actually Possible:
	\| Source \| Full Text Access \| How \|
	\|--------\|------------------\|-----\|
	\| PubMed Central (PMC) \| Yes, for OA articles \| Separate API: `efetch` with `db=pmc` \|
	\| Europe PMC \| Yes, CC-licensed after 14 days \| `/fullTextXML/{id}` endpoint \|
	\| OpenAlex \| No \| Metadata only \|
	\| Unpaywall \| Yes, OA link discovery \| Separate API \|

	Recommendation: Add PMC full-text retrieval for open access articles.

	### Issue 3: NO CITATION GRAPH

	OpenAlex has rich citation data but we only use `cited_by_count` for sorting.

	Untapped Capabilities:
	- `cited_by`: Find papers that cite a key paper
	- `references`: Find sources a paper cites
	- `related_works`: ML-powered similar papers

	Use Case: User asks about "testosterone therapy for HSDD". We find a seminal 2019 RCT. We could automatically find:
	- Papers that cite it (newer evidence)
	- Papers it cites (foundational research)
	- Related papers (similar topics)

	---

	## What's NOT Possible (API Constraints)

	\| Feature \| Why Not Possible \|
	\|---------\|------------------\|
	\| bioRxiv direct search \| No keyword search API, only RSS feed of latest \|
	\| arXiv search \| API exists but irrelevant for sexual health \|
	\| PubMed full text \| Requires publisher access or PMC \|
	\| Real-time trial results \| ClinicalTrials.gov results are static snapshots \|
	\| Drug mechanism data \| Not in any API - would need ChEMBL or DrugBank \|

	---

	## Recommended Improvements (Priority Order)

	### Phase 1: Fix Fundamentals (High ROI)
	1. Deduplication - Stop returning the same paper 3 times
	2. Outcome measures in ClinicalTrials - Get actual efficacy data
	3. Citation counts from all sources - Rank by influence, not recency

	### Phase 2: Depth Improvements (Medium ROI)
	4. PMC full-text retrieval - Get full papers for OA articles
	5. Citation graph traversal - Find seminal papers automatically
	6. Publication type filtering - Prioritize RCTs and meta-analyses

	### Phase 3: Quality Improvements (Lower ROI, Nice-to-Have)
	7. MeSH term expansion - Better PubMed queries
	8. Related works expansion - Use OpenAlex ML similarity
	9. Date range filtering - Historical vs recent research

	---

	## Neo4j Integration (Future Consideration)

	Question: Should we add Neo4j for citation graph storage?

	Answer: Not yet. Here's why:

	\| Approach \| Complexity \| Value \|
	\|----------\|------------\|-------\|
	\| OpenAlex API for citation traversal \| Low \| High \|
	\| Neo4j for local citation graph \| High \| Medium (unless doing graph analytics) \|
	\| Cron job to sync OpenAlex → Neo4j \| Medium \| Only if we need offline access \|

	Recommendation: Use OpenAlex API for citation traversal first. Only add Neo4j if:
	1. We need to do complex graph queries (PageRank on citations, community detection)
	2. We need offline access to citation data
	3. We're hitting OpenAlex rate limits

	---

	## Summary: What's Broken vs What's Working

	### Working Well
	- Basic search across all 4 sources
	- Rate limiting and retry logic
	- Query preprocessing
	- Evidence model with citations

	### Needs Fixing (Current Scope)
	- Deduplication (critical)
	- Outcome measures in ClinicalTrials (critical)
	- Citation-based ranking (important)

	### Future Enhancements (Out of Current Scope)
	- Full-text retrieval
	- Citation graph traversal
	- Neo4j integration
	- Drug mechanism data (would need new data sources)

	---

	## Sources

	- [NCBI E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25497/)
	- [NCBI Rate Limits](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)
	- [OpenAlex API Docs](https://docs.openalex.org/)
	- [OpenAlex Limitations](https://docs.openalex.org/api-entities/authors/limitations)
	- [Europe PMC RESTful API](https://europepmc.org/RestfulWebService)
	- [Europe PMC Preprints](https://pmc.ncbi.nlm.nih.gov/articles/PMC11426508/)
	- [ClinicalTrials.gov API](https://clinicaltrials.gov/data-api/api)