DeepBoner / docs /brainstorming /03_EUROPEPMC_IMPROVEMENTS.md
VibecoderMcSwaggins's picture
rebrand: DeepCritical β†’ DeepBoner (sexual health research agent)
5d12635
|
raw
history blame
6.05 kB
# Europe PMC Tool: Current State & Future Improvements
**Status**: Currently Implemented (Replaced bioRxiv)
**Priority**: High (Preprint + Open Access Source)
---
## Why Europe PMC Over bioRxiv?
### bioRxiv API Limitations (Why We Abandoned It)
1. **No Search API**: Only returns papers by date range or DOI
2. **No Query Capability**: Cannot search for "metformin cancer"
3. **Workaround Required**: Would need to download ALL preprints and build local search
4. **Known Issue**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) documents the limitation
### Europe PMC Advantages
1. **Full Search API**: Boolean queries, filters, facets
2. **Aggregates bioRxiv**: Includes bioRxiv, medRxiv content anyway
3. **Includes PubMed**: Also has MEDLINE content
4. **34 Preprint Servers**: Not just bioRxiv
5. **Open Access Focus**: Full-text when available
---
## Current Implementation
### What We Have (`src/tools/europepmc.py`)
- REST API search via `europepmc.org/webservices/rest/search`
- Preprint flagging via `firstPublicationDate` heuristics
- Returns: title, abstract, authors, DOI, source
- Marks preprints for transparency
### Current Limitations
1. **No Full-Text Retrieval**: Only metadata/abstracts
2. **No Citation Network**: Missing references/citations
3. **No Supplementary Files**: Not fetching figures/data
4. **Basic Preprint Detection**: Heuristic, not explicit flag
---
## Europe PMC API Capabilities
### Endpoints We Could Use
| Endpoint | Purpose | Currently Using |
|----------|---------|-----------------|
| `/search` | Query papers | Yes |
| `/fulltext/{ID}` | Full text (XML/JSON) | No |
| `/{PMCID}/supplementaryFiles` | Figures, data | No |
| `/citations/{ID}` | Who cited this | No |
| `/references/{ID}` | What this cites | No |
| `/annotations` | Text-mined entities | No |
### Rich Query Syntax
```python
# Current simple query
query = "metformin cancer"
# Could use advanced syntax
query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)"
query += " AND (SRC:PPR)" # Only preprints
query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])" # Date range
query += " AND (OPEN_ACCESS:y)" # Only open access
```
### Source Filters
```python
# Filter by source
"SRC:MED" # MEDLINE
"SRC:PMC" # PubMed Central
"SRC:PPR" # Preprints (bioRxiv, medRxiv, etc.)
"SRC:AGR" # Agricola
"SRC:CBA" # Chinese Biological Abstracts
```
---
## Recommended Improvements
### Phase 1: Rich Metadata
```python
# Add to search results
additional_fields = [
"citedByCount", # Impact indicator
"source", # Explicit source (MED, PMC, PPR)
"isOpenAccess", # Boolean flag
"fullTextUrlList", # URLs for full text
"authorAffiliations", # Institution info
"grantsList", # Funding info
]
```
### Phase 2: Full-Text Retrieval
```python
async def get_fulltext(pmcid: str) -> str | None:
"""Get full text for open access papers."""
# XML format
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML"
# Or JSON
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON"
```
### Phase 3: Citation Network
```python
async def get_citations(pmcid: str) -> list[str]:
"""Get papers that cite this one."""
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations"
async def get_references(pmcid: str) -> list[str]:
"""Get papers this one cites."""
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references"
```
### Phase 4: Text-Mined Annotations
Europe PMC extracts entities automatically:
```python
async def get_annotations(pmcid: str) -> dict:
"""Get text-mined entities (genes, diseases, drugs)."""
url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds"
params = {
"articleIds": f"PMC:{pmcid}",
"type": "Gene_Proteins,Diseases,Chemicals",
"format": "JSON",
}
# Returns structured entity mentions with positions
```
---
## Supplementary File Retrieval
From reference repo (`bioinformatics_tools.py` lines 123-149):
```python
def get_figures(pmcid: str) -> dict[str, str]:
"""Download figures and supplementary files."""
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true"
# Returns ZIP with images, returns base64-encoded
```
---
## Preprint-Specific Features
### Identify Preprint Servers
```python
PREPRINT_SOURCES = {
"PPR": "General preprints",
"bioRxiv": "Biology preprints",
"medRxiv": "Medical preprints",
"chemRxiv": "Chemistry preprints",
"Research Square": "Multi-disciplinary",
"Preprints.org": "MDPI preprints",
}
# Check if published version exists
async def check_published_version(preprint_doi: str) -> str | None:
"""Check if preprint has been peer-reviewed and published."""
# Europe PMC links preprints to final versions
```
---
## Rate Limiting
Europe PMC is more generous than NCBI:
```python
# No documented hard limit, but be respectful
# Recommend: 10-20 requests/second max
# Use email in User-Agent for polite pool
headers = {
"User-Agent": "DeepBoner/1.0 (mailto:[email protected])"
}
```
---
## vs. The Lens & OpenAlex
| Feature | Europe PMC | The Lens | OpenAlex |
|---------|------------|----------|----------|
| Biomedical Focus | Yes | Partial | Partial |
| Preprints | Yes (34 servers) | Yes | Yes |
| Full Text | PMC papers | Links | No |
| Citations | Yes | Yes | Yes |
| Annotations | Yes (text-mined) | No | No |
| Rate Limits | Generous | Moderate | Very generous |
| API Key | Optional | Required | Optional |
---
## Sources
- [Europe PMC REST API](https://europepmc.org/RestfulWebService)
- [Europe PMC Annotations API](https://europepmc.org/AnnotationsApi)
- [Europe PMC Articles API](https://europepmc.org/ArticlesApi)
- [rOpenSci medrxivr](https://docs.ropensci.org/medrxivr/)
- [bioRxiv TDM Resources](https://www.biorxiv.org/tdm)