DeepBoner / TOOL_ANALYSIS_CRITICAL.md
VibecoderMcSwaggins's picture
fix(SPEC_11): address CodeRabbit review feedback (#92)
89f1173 unverified
|
raw
history blame
11.9 kB
# Critical Analysis: Search Tools - Limitations, Gaps, and Improvements
**Date**: November 2025
**Purpose**: Honest assessment of all search tools to identify what's working, what's broken, and what needs improvement WITHOUT horizontal sprawl.
---
## Executive Summary
DeepBoner currently has **4 search tools**:
1. PubMed (NCBI E-utilities)
2. ClinicalTrials.gov (API v2)
3. Europe PMC (includes preprints)
4. OpenAlex (citation-aware)
**Overall Assessment**: Tools are functional but have significant gaps in:
- Deduplication (PubMed ∩ Europe PMC ∩ OpenAlex = massive overlap)
- Full-text retrieval (only abstracts currently)
- Citation graph traversal (OpenAlex has data but we don't use it)
- Query optimization (basic synonym expansion, no MeSH term mapping)
---
## Tool 1: PubMed (NCBI E-utilities)
**File**: `src/tools/pubmed.py`
### What It Does Well
| Feature | Status | Notes |
|---------|--------|-------|
| Rate limiting | βœ… | Shared limiter, respects 3/sec (no key) or 10/sec (with key) |
| Retry logic | βœ… | tenacity with exponential backoff |
| Query preprocessing | βœ… | Strips question words, expands synonyms |
| Abstract parsing | βœ… | Handles XML edge cases (dict vs list) |
### Limitations (API-Level)
| Limitation | Severity | Workaround Possible? |
|------------|----------|---------------------|
| **10,000 result cap per query** | Medium | Yes - use date ranges to paginate |
| **Abstracts only** (no full text) | High | No - full text requires PMC or publisher |
| **No citation counts** | Medium | Yes - cross-reference with OpenAlex |
| **Rate limit (10/sec max)** | Low | Already handled |
### Current Implementation Gaps
```python
# GAP 1: No MeSH term expansion
# Current: expand_synonyms() uses hardcoded dict
# Better: Use NCBI's E-utilities to get MeSH terms for query
# GAP 2: No date filtering
# Current: Gets whatever PubMed returns (biased toward recent)
# Better: Add date range parameter for historical research
# GAP 3: No publication type filtering
# Current: Returns all types (reviews, case reports, RCTs)
# Better: Filter for RCTs and systematic reviews when appropriate
```
### Priority Improvements
1. **HIGH**: Add publication type filter (Reviews, RCTs, Meta-analyses)
2. **MEDIUM**: Add date range parameter
3. **LOW**: MeSH term expansion via E-utilities
---
## Tool 2: ClinicalTrials.gov
**File**: `src/tools/clinicaltrials.py`
### What It Does Well
| Feature | Status | Notes |
|---------|--------|-------|
| API v2 usage | βœ… | Modern API, not deprecated v1 |
| Interventional filter | βœ… | Only gets drug/treatment studies |
| Status filter | βœ… | COMPLETED, ACTIVE, RECRUITING |
| httpx β†’ requests workaround | βœ… | Bypasses WAF TLS fingerprint block |
### Limitations (API-Level)
| Limitation | Severity | Workaround Possible? |
|------------|----------|---------------------|
| **No results data** | High | Yes - available via different endpoint |
| **No outcome measures** | High | Yes - add to FIELDS list |
| **No adverse events** | Medium | Yes - separate API call |
| **Sparse drug mechanism data** | Medium | No - not in API |
### Current Implementation Gaps
```python
# GAP 1: Missing critical fields
FIELDS: ClassVar[list[str]] = [
"NCTId",
"BriefTitle",
"Phase",
"OverallStatus",
"Condition",
"InterventionName",
"StartDate",
"BriefSummary",
# MISSING:
# "PrimaryOutcome",
# "SecondaryOutcome",
# "ResultsFirstSubmitDate",
# "StudyResults", # Whether results are posted
]
# GAP 2: No results retrieval
# Many completed trials have posted results
# We could get actual efficacy data, not just trial existence
# GAP 3: No linked publications
# Trials often link to PubMed articles with results
# We could follow these links for richer evidence
```
### Priority Improvements
1. **HIGH**: Add outcome measures to FIELDS
2. **HIGH**: Check for and retrieve posted results
3. **MEDIUM**: Follow linked publications (NCT β†’ PMID)
---
## Tool 3: Europe PMC
**File**: `src/tools/europepmc.py`
### What It Does Well
| Feature | Status | Notes |
|---------|--------|-------|
| Preprint coverage | βœ… | bioRxiv, medRxiv, ChemRxiv indexed |
| Preprint labeling | βœ… | `[PREPRINT - Not peer-reviewed]` marker |
| DOI/PMID fallback URLs | βœ… | Smart URL construction |
| Relevance scoring | βœ… | Preprints weighted lower (0.75 vs 0.9) |
### Limitations (API-Level)
| Limitation | Severity | Workaround Possible? |
|------------|----------|---------------------|
| **No full text for most articles** | High | Partial - CC-licensed available after 14 days |
| **Citation data limited** | Medium | Only journal articles, not preprints |
| **Preprint-publication linking gaps** | Medium | ~50% of links missing per Crossref |
| **License info sometimes missing** | Low | Manual review required |
### Current Implementation Gaps
```python
# GAP 1: No full-text retrieval
# Europe PMC has full text for many CC-licensed articles
# Could retrieve full text XML via separate endpoint
# GAP 2: Massive overlap with PubMed
# Europe PMC indexes all of PubMed/MEDLINE
# We're getting duplicates with no deduplication
# GAP 3: No citation network
# Europe PMC has "citedByCount" but we don't use it
# Could prioritize highly-cited preprints
```
### Priority Improvements
1. **HIGH**: Add deduplication with PubMed (by PMID)
2. **MEDIUM**: Retrieve citation counts for ranking
3. **LOW**: Full-text retrieval for CC-licensed articles
---
## Tool 4: OpenAlex
**File**: `src/tools/openalex.py`
### What It Does Well
| Feature | Status | Notes |
|---------|--------|-------|
| Citation counts | βœ… | Sorted by `cited_by_count:desc` |
| Abstract reconstruction | βœ… | Handles inverted index format |
| Concept extraction | βœ… | Hierarchical classification |
| Open access detection | βœ… | `is_oa` and `pdf_url` |
| Polite pool | βœ… | mailto for 100k/day limit |
| Rich metadata | βœ… | Best metadata of all tools |
### Limitations (API-Level)
| Limitation | Severity | Workaround Possible? |
|------------|----------|---------------------|
| **Author truncation at 100** | Low | Only affects mega-author papers |
| **No full text** | High | No - OpenAlex is metadata only |
| **Stale data (1-2 day lag)** | Low | Acceptable for research |
### Current Implementation Gaps
```python
# GAP 1: No citation graph traversal
# OpenAlex has `cited_by` and `references` endpoints
# We could find seminal papers by following citation chains
# GAP 2: No related works
# OpenAlex has ML-powered "related_works" field
# Could expand search to similar papers
# GAP 3: No concept filtering
# OpenAlex has hierarchical concepts
# Could filter for specific domains (e.g., "Sexual health" concept)
# GAP 4: Overlap with PubMed
# OpenAlex indexes most of PubMed
# More duplicates without deduplication
```
### Priority Improvements
1. **HIGH**: Add citation graph traversal (find seminal papers)
2. **HIGH**: Add deduplication with PubMed/Europe PMC
3. **MEDIUM**: Use `related_works` for query expansion
4. **LOW**: Concept-based filtering
---
## Cross-Tool Issues
### Issue 1: MASSIVE DUPLICATION
```
PubMed: 36M+ articles
Europe PMC: Indexes ALL of PubMed + preprints
OpenAlex: 250M+ works (includes PubMed)
Current behavior: All 3 return the same papers
Result: Duplicate evidence, wasted tokens, inflated counts
```
**Solution**: Deduplication by PMID/DOI
```python
# Proposed: Add to SearchHandler
def deduplicate_evidence(evidence_list: list[Evidence]) -> list[Evidence]:
seen_ids: set[str] = set()
unique: list[Evidence] = []
for e in evidence_list:
# Extract PMID or DOI from URL
paper_id = extract_paper_id(e.citation.url)
if paper_id not in seen_ids:
seen_ids.add(paper_id)
unique.append(e)
return unique
```
### Issue 2: NO FULL-TEXT RETRIEVAL
All tools return **abstracts only**. For deep research, this is limiting.
**What's Actually Possible**:
| Source | Full Text Access | How |
|--------|------------------|-----|
| PubMed Central (PMC) | Yes, for OA articles | Separate API: `efetch` with `db=pmc` |
| Europe PMC | Yes, CC-licensed after 14 days | `/fullTextXML/{id}` endpoint |
| OpenAlex | No | Metadata only |
| Unpaywall | Yes, OA link discovery | Separate API |
**Recommendation**: Add PMC full-text retrieval for open access articles.
### Issue 3: NO CITATION GRAPH
OpenAlex has rich citation data but we only use `cited_by_count` for sorting.
**Untapped Capabilities**:
- `cited_by`: Find papers that cite a key paper
- `references`: Find sources a paper cites
- `related_works`: ML-powered similar papers
**Use Case**: User asks about "testosterone therapy for HSDD". We find a seminal 2019 RCT. We could automatically find:
- Papers that cite it (newer evidence)
- Papers it cites (foundational research)
- Related papers (similar topics)
---
## What's NOT Possible (API Constraints)
| Feature | Why Not Possible |
|---------|------------------|
| **bioRxiv direct search** | No keyword search API, only RSS feed of latest |
| **arXiv search** | API exists but irrelevant for sexual health |
| **PubMed full text** | Requires publisher access or PMC |
| **Real-time trial results** | ClinicalTrials.gov results are static snapshots |
| **Drug mechanism data** | Not in any API - would need ChEMBL or DrugBank |
---
## Recommended Improvements (Priority Order)
### Phase 1: Fix Fundamentals (High ROI)
1. **Deduplication** - Stop returning the same paper 3 times
2. **Outcome measures in ClinicalTrials** - Get actual efficacy data
3. **Citation counts from all sources** - Rank by influence, not recency
### Phase 2: Depth Improvements (Medium ROI)
4. **PMC full-text retrieval** - Get full papers for OA articles
5. **Citation graph traversal** - Find seminal papers automatically
6. **Publication type filtering** - Prioritize RCTs and meta-analyses
### Phase 3: Quality Improvements (Lower ROI, Nice-to-Have)
7. **MeSH term expansion** - Better PubMed queries
8. **Related works expansion** - Use OpenAlex ML similarity
9. **Date range filtering** - Historical vs recent research
---
## Neo4j Integration (Future Consideration)
**Question**: Should we add Neo4j for citation graph storage?
**Answer**: Not yet. Here's why:
| Approach | Complexity | Value |
|----------|------------|-------|
| OpenAlex API for citation traversal | Low | High |
| Neo4j for local citation graph | High | Medium (unless doing graph analytics) |
| Cron job to sync OpenAlex β†’ Neo4j | Medium | Only if we need offline access |
**Recommendation**: Use OpenAlex API for citation traversal first. Only add Neo4j if:
1. We need to do complex graph queries (PageRank on citations, community detection)
2. We need offline access to citation data
3. We're hitting OpenAlex rate limits
---
## Summary: What's Broken vs What's Working
### Working Well
- Basic search across all 4 sources
- Rate limiting and retry logic
- Query preprocessing
- Evidence model with citations
### Needs Fixing (Current Scope)
- Deduplication (critical)
- Outcome measures in ClinicalTrials (critical)
- Citation-based ranking (important)
### Future Enhancements (Out of Current Scope)
- Full-text retrieval
- Citation graph traversal
- Neo4j integration
- Drug mechanism data (would need new data sources)
---
## Sources
- [NCBI E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25497/)
- [NCBI Rate Limits](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)
- [OpenAlex API Docs](https://docs.openalex.org/)
- [OpenAlex Limitations](https://docs.openalex.org/api-entities/authors/limitations)
- [Europe PMC RESTful API](https://europepmc.org/RestfulWebService)
- [Europe PMC Preprints](https://pmc.ncbi.nlm.nih.gov/articles/PMC11426508/)
- [ClinicalTrials.gov API](https://clinicaltrials.gov/data-api/api)