File size: 11,875 Bytes
89f1173
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
# Critical Analysis: Search Tools - Limitations, Gaps, and Improvements

**Date**: November 2025
**Purpose**: Honest assessment of all search tools to identify what's working, what's broken, and what needs improvement WITHOUT horizontal sprawl.

---

## Executive Summary

DeepBoner currently has **4 search tools**:
1. PubMed (NCBI E-utilities)
2. ClinicalTrials.gov (API v2)
3. Europe PMC (includes preprints)
4. OpenAlex (citation-aware)

**Overall Assessment**: Tools are functional but have significant gaps in:
- Deduplication (PubMed ∩ Europe PMC ∩ OpenAlex = massive overlap)
- Full-text retrieval (only abstracts currently)
- Citation graph traversal (OpenAlex has data but we don't use it)
- Query optimization (basic synonym expansion, no MeSH term mapping)

---

## Tool 1: PubMed (NCBI E-utilities)

**File**: `src/tools/pubmed.py`

### What It Does Well
| Feature | Status | Notes |
|---------|--------|-------|
| Rate limiting | βœ… | Shared limiter, respects 3/sec (no key) or 10/sec (with key) |
| Retry logic | βœ… | tenacity with exponential backoff |
| Query preprocessing | βœ… | Strips question words, expands synonyms |
| Abstract parsing | βœ… | Handles XML edge cases (dict vs list) |

### Limitations (API-Level)
| Limitation | Severity | Workaround Possible? |
|------------|----------|---------------------|
| **10,000 result cap per query** | Medium | Yes - use date ranges to paginate |
| **Abstracts only** (no full text) | High | No - full text requires PMC or publisher |
| **No citation counts** | Medium | Yes - cross-reference with OpenAlex |
| **Rate limit (10/sec max)** | Low | Already handled |

### Current Implementation Gaps
```python
# GAP 1: No MeSH term expansion
# Current: expand_synonyms() uses hardcoded dict
# Better: Use NCBI's E-utilities to get MeSH terms for query

# GAP 2: No date filtering
# Current: Gets whatever PubMed returns (biased toward recent)
# Better: Add date range parameter for historical research

# GAP 3: No publication type filtering
# Current: Returns all types (reviews, case reports, RCTs)
# Better: Filter for RCTs and systematic reviews when appropriate
```

### Priority Improvements
1. **HIGH**: Add publication type filter (Reviews, RCTs, Meta-analyses)
2. **MEDIUM**: Add date range parameter
3. **LOW**: MeSH term expansion via E-utilities

---

## Tool 2: ClinicalTrials.gov

**File**: `src/tools/clinicaltrials.py`

### What It Does Well
| Feature | Status | Notes |
|---------|--------|-------|
| API v2 usage | βœ… | Modern API, not deprecated v1 |
| Interventional filter | βœ… | Only gets drug/treatment studies |
| Status filter | βœ… | COMPLETED, ACTIVE, RECRUITING |
| httpx β†’ requests workaround | βœ… | Bypasses WAF TLS fingerprint block |

### Limitations (API-Level)
| Limitation | Severity | Workaround Possible? |
|------------|----------|---------------------|
| **No results data** | High | Yes - available via different endpoint |
| **No outcome measures** | High | Yes - add to FIELDS list |
| **No adverse events** | Medium | Yes - separate API call |
| **Sparse drug mechanism data** | Medium | No - not in API |

### Current Implementation Gaps
```python
# GAP 1: Missing critical fields
FIELDS: ClassVar[list[str]] = [
    "NCTId",
    "BriefTitle",
    "Phase",
    "OverallStatus",
    "Condition",
    "InterventionName",
    "StartDate",
    "BriefSummary",
    # MISSING:
    # "PrimaryOutcome",
    # "SecondaryOutcome",
    # "ResultsFirstSubmitDate",
    # "StudyResults",  # Whether results are posted
]

# GAP 2: No results retrieval
# Many completed trials have posted results
# We could get actual efficacy data, not just trial existence

# GAP 3: No linked publications
# Trials often link to PubMed articles with results
# We could follow these links for richer evidence
```

### Priority Improvements
1. **HIGH**: Add outcome measures to FIELDS
2. **HIGH**: Check for and retrieve posted results
3. **MEDIUM**: Follow linked publications (NCT β†’ PMID)

---

## Tool 3: Europe PMC

**File**: `src/tools/europepmc.py`

### What It Does Well
| Feature | Status | Notes |
|---------|--------|-------|
| Preprint coverage | βœ… | bioRxiv, medRxiv, ChemRxiv indexed |
| Preprint labeling | βœ… | `[PREPRINT - Not peer-reviewed]` marker |
| DOI/PMID fallback URLs | βœ… | Smart URL construction |
| Relevance scoring | βœ… | Preprints weighted lower (0.75 vs 0.9) |

### Limitations (API-Level)
| Limitation | Severity | Workaround Possible? |
|------------|----------|---------------------|
| **No full text for most articles** | High | Partial - CC-licensed available after 14 days |
| **Citation data limited** | Medium | Only journal articles, not preprints |
| **Preprint-publication linking gaps** | Medium | ~50% of links missing per Crossref |
| **License info sometimes missing** | Low | Manual review required |

### Current Implementation Gaps
```python
# GAP 1: No full-text retrieval
# Europe PMC has full text for many CC-licensed articles
# Could retrieve full text XML via separate endpoint

# GAP 2: Massive overlap with PubMed
# Europe PMC indexes all of PubMed/MEDLINE
# We're getting duplicates with no deduplication

# GAP 3: No citation network
# Europe PMC has "citedByCount" but we don't use it
# Could prioritize highly-cited preprints
```

### Priority Improvements
1. **HIGH**: Add deduplication with PubMed (by PMID)
2. **MEDIUM**: Retrieve citation counts for ranking
3. **LOW**: Full-text retrieval for CC-licensed articles

---

## Tool 4: OpenAlex

**File**: `src/tools/openalex.py`

### What It Does Well
| Feature | Status | Notes |
|---------|--------|-------|
| Citation counts | βœ… | Sorted by `cited_by_count:desc` |
| Abstract reconstruction | βœ… | Handles inverted index format |
| Concept extraction | βœ… | Hierarchical classification |
| Open access detection | βœ… | `is_oa` and `pdf_url` |
| Polite pool | βœ… | mailto for 100k/day limit |
| Rich metadata | βœ… | Best metadata of all tools |

### Limitations (API-Level)
| Limitation | Severity | Workaround Possible? |
|------------|----------|---------------------|
| **Author truncation at 100** | Low | Only affects mega-author papers |
| **No full text** | High | No - OpenAlex is metadata only |
| **Stale data (1-2 day lag)** | Low | Acceptable for research |

### Current Implementation Gaps
```python
# GAP 1: No citation graph traversal
# OpenAlex has `cited_by` and `references` endpoints
# We could find seminal papers by following citation chains

# GAP 2: No related works
# OpenAlex has ML-powered "related_works" field
# Could expand search to similar papers

# GAP 3: No concept filtering
# OpenAlex has hierarchical concepts
# Could filter for specific domains (e.g., "Sexual health" concept)

# GAP 4: Overlap with PubMed
# OpenAlex indexes most of PubMed
# More duplicates without deduplication
```

### Priority Improvements
1. **HIGH**: Add citation graph traversal (find seminal papers)
2. **HIGH**: Add deduplication with PubMed/Europe PMC
3. **MEDIUM**: Use `related_works` for query expansion
4. **LOW**: Concept-based filtering

---

## Cross-Tool Issues

### Issue 1: MASSIVE DUPLICATION

```
PubMed: 36M+ articles
Europe PMC: Indexes ALL of PubMed + preprints
OpenAlex: 250M+ works (includes PubMed)

Current behavior: All 3 return the same papers
Result: Duplicate evidence, wasted tokens, inflated counts
```

**Solution**: Deduplication by PMID/DOI
```python
# Proposed: Add to SearchHandler
def deduplicate_evidence(evidence_list: list[Evidence]) -> list[Evidence]:
    seen_ids: set[str] = set()
    unique: list[Evidence] = []
    for e in evidence_list:
        # Extract PMID or DOI from URL
        paper_id = extract_paper_id(e.citation.url)
        if paper_id not in seen_ids:
            seen_ids.add(paper_id)
            unique.append(e)
    return unique
```

### Issue 2: NO FULL-TEXT RETRIEVAL

All tools return **abstracts only**. For deep research, this is limiting.

**What's Actually Possible**:
| Source | Full Text Access | How |
|--------|------------------|-----|
| PubMed Central (PMC) | Yes, for OA articles | Separate API: `efetch` with `db=pmc` |
| Europe PMC | Yes, CC-licensed after 14 days | `/fullTextXML/{id}` endpoint |
| OpenAlex | No | Metadata only |
| Unpaywall | Yes, OA link discovery | Separate API |

**Recommendation**: Add PMC full-text retrieval for open access articles.

### Issue 3: NO CITATION GRAPH

OpenAlex has rich citation data but we only use `cited_by_count` for sorting.

**Untapped Capabilities**:
- `cited_by`: Find papers that cite a key paper
- `references`: Find sources a paper cites
- `related_works`: ML-powered similar papers

**Use Case**: User asks about "testosterone therapy for HSDD". We find a seminal 2019 RCT. We could automatically find:
- Papers that cite it (newer evidence)
- Papers it cites (foundational research)
- Related papers (similar topics)

---

## What's NOT Possible (API Constraints)

| Feature | Why Not Possible |
|---------|------------------|
| **bioRxiv direct search** | No keyword search API, only RSS feed of latest |
| **arXiv search** | API exists but irrelevant for sexual health |
| **PubMed full text** | Requires publisher access or PMC |
| **Real-time trial results** | ClinicalTrials.gov results are static snapshots |
| **Drug mechanism data** | Not in any API - would need ChEMBL or DrugBank |

---

## Recommended Improvements (Priority Order)

### Phase 1: Fix Fundamentals (High ROI)
1. **Deduplication** - Stop returning the same paper 3 times
2. **Outcome measures in ClinicalTrials** - Get actual efficacy data
3. **Citation counts from all sources** - Rank by influence, not recency

### Phase 2: Depth Improvements (Medium ROI)
4. **PMC full-text retrieval** - Get full papers for OA articles
5. **Citation graph traversal** - Find seminal papers automatically
6. **Publication type filtering** - Prioritize RCTs and meta-analyses

### Phase 3: Quality Improvements (Lower ROI, Nice-to-Have)
7. **MeSH term expansion** - Better PubMed queries
8. **Related works expansion** - Use OpenAlex ML similarity
9. **Date range filtering** - Historical vs recent research

---

## Neo4j Integration (Future Consideration)

**Question**: Should we add Neo4j for citation graph storage?

**Answer**: Not yet. Here's why:

| Approach | Complexity | Value |
|----------|------------|-------|
| OpenAlex API for citation traversal | Low | High |
| Neo4j for local citation graph | High | Medium (unless doing graph analytics) |
| Cron job to sync OpenAlex β†’ Neo4j | Medium | Only if we need offline access |

**Recommendation**: Use OpenAlex API for citation traversal first. Only add Neo4j if:
1. We need to do complex graph queries (PageRank on citations, community detection)
2. We need offline access to citation data
3. We're hitting OpenAlex rate limits

---

## Summary: What's Broken vs What's Working

### Working Well
- Basic search across all 4 sources
- Rate limiting and retry logic
- Query preprocessing
- Evidence model with citations

### Needs Fixing (Current Scope)
- Deduplication (critical)
- Outcome measures in ClinicalTrials (critical)
- Citation-based ranking (important)

### Future Enhancements (Out of Current Scope)
- Full-text retrieval
- Citation graph traversal
- Neo4j integration
- Drug mechanism data (would need new data sources)

---

## Sources

- [NCBI E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25497/)
- [NCBI Rate Limits](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)
- [OpenAlex API Docs](https://docs.openalex.org/)
- [OpenAlex Limitations](https://docs.openalex.org/api-entities/authors/limitations)
- [Europe PMC RESTful API](https://europepmc.org/RestfulWebService)
- [Europe PMC Preprints](https://pmc.ncbi.nlm.nih.gov/articles/PMC11426508/)
- [ClinicalTrials.gov API](https://clinicaltrials.gov/data-api/api)