File size: 6,050 Bytes
9286db5 5d12635 9286db5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 |
# Europe PMC Tool: Current State & Future Improvements
**Status**: Currently Implemented (Replaced bioRxiv)
**Priority**: High (Preprint + Open Access Source)
---
## Why Europe PMC Over bioRxiv?
### bioRxiv API Limitations (Why We Abandoned It)
1. **No Search API**: Only returns papers by date range or DOI
2. **No Query Capability**: Cannot search for "metformin cancer"
3. **Workaround Required**: Would need to download ALL preprints and build local search
4. **Known Issue**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) documents the limitation
### Europe PMC Advantages
1. **Full Search API**: Boolean queries, filters, facets
2. **Aggregates bioRxiv**: Includes bioRxiv, medRxiv content anyway
3. **Includes PubMed**: Also has MEDLINE content
4. **34 Preprint Servers**: Not just bioRxiv
5. **Open Access Focus**: Full-text when available
---
## Current Implementation
### What We Have (`src/tools/europepmc.py`)
- REST API search via `europepmc.org/webservices/rest/search`
- Preprint flagging via `firstPublicationDate` heuristics
- Returns: title, abstract, authors, DOI, source
- Marks preprints for transparency
### Current Limitations
1. **No Full-Text Retrieval**: Only metadata/abstracts
2. **No Citation Network**: Missing references/citations
3. **No Supplementary Files**: Not fetching figures/data
4. **Basic Preprint Detection**: Heuristic, not explicit flag
---
## Europe PMC API Capabilities
### Endpoints We Could Use
| Endpoint | Purpose | Currently Using |
|----------|---------|-----------------|
| `/search` | Query papers | Yes |
| `/fulltext/{ID}` | Full text (XML/JSON) | No |
| `/{PMCID}/supplementaryFiles` | Figures, data | No |
| `/citations/{ID}` | Who cited this | No |
| `/references/{ID}` | What this cites | No |
| `/annotations` | Text-mined entities | No |
### Rich Query Syntax
```python
# Current simple query
query = "metformin cancer"
# Could use advanced syntax
query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)"
query += " AND (SRC:PPR)" # Only preprints
query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])" # Date range
query += " AND (OPEN_ACCESS:y)" # Only open access
```
### Source Filters
```python
# Filter by source
"SRC:MED" # MEDLINE
"SRC:PMC" # PubMed Central
"SRC:PPR" # Preprints (bioRxiv, medRxiv, etc.)
"SRC:AGR" # Agricola
"SRC:CBA" # Chinese Biological Abstracts
```
---
## Recommended Improvements
### Phase 1: Rich Metadata
```python
# Add to search results
additional_fields = [
"citedByCount", # Impact indicator
"source", # Explicit source (MED, PMC, PPR)
"isOpenAccess", # Boolean flag
"fullTextUrlList", # URLs for full text
"authorAffiliations", # Institution info
"grantsList", # Funding info
]
```
### Phase 2: Full-Text Retrieval
```python
async def get_fulltext(pmcid: str) -> str | None:
"""Get full text for open access papers."""
# XML format
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML"
# Or JSON
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON"
```
### Phase 3: Citation Network
```python
async def get_citations(pmcid: str) -> list[str]:
"""Get papers that cite this one."""
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations"
async def get_references(pmcid: str) -> list[str]:
"""Get papers this one cites."""
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references"
```
### Phase 4: Text-Mined Annotations
Europe PMC extracts entities automatically:
```python
async def get_annotations(pmcid: str) -> dict:
"""Get text-mined entities (genes, diseases, drugs)."""
url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds"
params = {
"articleIds": f"PMC:{pmcid}",
"type": "Gene_Proteins,Diseases,Chemicals",
"format": "JSON",
}
# Returns structured entity mentions with positions
```
---
## Supplementary File Retrieval
From reference repo (`bioinformatics_tools.py` lines 123-149):
```python
def get_figures(pmcid: str) -> dict[str, str]:
"""Download figures and supplementary files."""
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true"
# Returns ZIP with images, returns base64-encoded
```
---
## Preprint-Specific Features
### Identify Preprint Servers
```python
PREPRINT_SOURCES = {
"PPR": "General preprints",
"bioRxiv": "Biology preprints",
"medRxiv": "Medical preprints",
"chemRxiv": "Chemistry preprints",
"Research Square": "Multi-disciplinary",
"Preprints.org": "MDPI preprints",
}
# Check if published version exists
async def check_published_version(preprint_doi: str) -> str | None:
"""Check if preprint has been peer-reviewed and published."""
# Europe PMC links preprints to final versions
```
---
## Rate Limiting
Europe PMC is more generous than NCBI:
```python
# No documented hard limit, but be respectful
# Recommend: 10-20 requests/second max
# Use email in User-Agent for polite pool
headers = {
"User-Agent": "DeepBoner/1.0 (mailto:[email protected])"
}
```
---
## vs. The Lens & OpenAlex
| Feature | Europe PMC | The Lens | OpenAlex |
|---------|------------|----------|----------|
| Biomedical Focus | Yes | Partial | Partial |
| Preprints | Yes (34 servers) | Yes | Yes |
| Full Text | PMC papers | Links | No |
| Citations | Yes | Yes | Yes |
| Annotations | Yes (text-mined) | No | No |
| Rate Limits | Generous | Moderate | Very generous |
| API Key | Optional | Required | Optional |
---
## Sources
- [Europe PMC REST API](https://europepmc.org/RestfulWebService)
- [Europe PMC Annotations API](https://europepmc.org/AnnotationsApi)
- [Europe PMC Articles API](https://europepmc.org/ArticlesApi)
- [rOpenSci medrxivr](https://docs.ropensci.org/medrxivr/)
- [bioRxiv TDM Resources](https://www.biorxiv.org/tdm)
|