File size: 6,050 Bytes
9286db5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d12635
9286db5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
# Europe PMC Tool: Current State & Future Improvements

**Status**: Currently Implemented (Replaced bioRxiv)
**Priority**: High (Preprint + Open Access Source)

---

## Why Europe PMC Over bioRxiv?

### bioRxiv API Limitations (Why We Abandoned It)

1. **No Search API**: Only returns papers by date range or DOI
2. **No Query Capability**: Cannot search for "metformin cancer"
3. **Workaround Required**: Would need to download ALL preprints and build local search
4. **Known Issue**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) documents the limitation

### Europe PMC Advantages

1. **Full Search API**: Boolean queries, filters, facets
2. **Aggregates bioRxiv**: Includes bioRxiv, medRxiv content anyway
3. **Includes PubMed**: Also has MEDLINE content
4. **34 Preprint Servers**: Not just bioRxiv
5. **Open Access Focus**: Full-text when available

---

## Current Implementation

### What We Have (`src/tools/europepmc.py`)

- REST API search via `europepmc.org/webservices/rest/search`
- Preprint flagging via `firstPublicationDate` heuristics
- Returns: title, abstract, authors, DOI, source
- Marks preprints for transparency

### Current Limitations

1. **No Full-Text Retrieval**: Only metadata/abstracts
2. **No Citation Network**: Missing references/citations
3. **No Supplementary Files**: Not fetching figures/data
4. **Basic Preprint Detection**: Heuristic, not explicit flag

---

## Europe PMC API Capabilities

### Endpoints We Could Use

| Endpoint | Purpose | Currently Using |
|----------|---------|-----------------|
| `/search` | Query papers | Yes |
| `/fulltext/{ID}` | Full text (XML/JSON) | No |
| `/{PMCID}/supplementaryFiles` | Figures, data | No |
| `/citations/{ID}` | Who cited this | No |
| `/references/{ID}` | What this cites | No |
| `/annotations` | Text-mined entities | No |

### Rich Query Syntax

```python
# Current simple query
query = "metformin cancer"

# Could use advanced syntax
query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)"
query += " AND (SRC:PPR)"  # Only preprints
query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])"  # Date range
query += " AND (OPEN_ACCESS:y)"  # Only open access
```

### Source Filters

```python
# Filter by source
"SRC:MED"     # MEDLINE
"SRC:PMC"     # PubMed Central
"SRC:PPR"     # Preprints (bioRxiv, medRxiv, etc.)
"SRC:AGR"     # Agricola
"SRC:CBA"     # Chinese Biological Abstracts
```

---

## Recommended Improvements

### Phase 1: Rich Metadata

```python
# Add to search results
additional_fields = [
    "citedByCount",           # Impact indicator
    "source",                 # Explicit source (MED, PMC, PPR)
    "isOpenAccess",           # Boolean flag
    "fullTextUrlList",        # URLs for full text
    "authorAffiliations",     # Institution info
    "grantsList",             # Funding info
]
```

### Phase 2: Full-Text Retrieval

```python
async def get_fulltext(pmcid: str) -> str | None:
    """Get full text for open access papers."""
    # XML format
    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML"
    # Or JSON
    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON"
```

### Phase 3: Citation Network

```python
async def get_citations(pmcid: str) -> list[str]:
    """Get papers that cite this one."""
    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations"

async def get_references(pmcid: str) -> list[str]:
    """Get papers this one cites."""
    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references"
```

### Phase 4: Text-Mined Annotations

Europe PMC extracts entities automatically:

```python
async def get_annotations(pmcid: str) -> dict:
    """Get text-mined entities (genes, diseases, drugs)."""
    url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds"
    params = {
        "articleIds": f"PMC:{pmcid}",
        "type": "Gene_Proteins,Diseases,Chemicals",
        "format": "JSON",
    }
    # Returns structured entity mentions with positions
```

---

## Supplementary File Retrieval

From reference repo (`bioinformatics_tools.py` lines 123-149):

```python
def get_figures(pmcid: str) -> dict[str, str]:
    """Download figures and supplementary files."""
    url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true"
    # Returns ZIP with images, returns base64-encoded
```

---

## Preprint-Specific Features

### Identify Preprint Servers

```python
PREPRINT_SOURCES = {
    "PPR": "General preprints",
    "bioRxiv": "Biology preprints",
    "medRxiv": "Medical preprints",
    "chemRxiv": "Chemistry preprints",
    "Research Square": "Multi-disciplinary",
    "Preprints.org": "MDPI preprints",
}

# Check if published version exists
async def check_published_version(preprint_doi: str) -> str | None:
    """Check if preprint has been peer-reviewed and published."""
    # Europe PMC links preprints to final versions
```

---

## Rate Limiting

Europe PMC is more generous than NCBI:

```python
# No documented hard limit, but be respectful
# Recommend: 10-20 requests/second max
# Use email in User-Agent for polite pool
headers = {
    "User-Agent": "DeepBoner/1.0 (mailto:[email protected])"
}
```

---

## vs. The Lens & OpenAlex

| Feature | Europe PMC | The Lens | OpenAlex |
|---------|------------|----------|----------|
| Biomedical Focus | Yes | Partial | Partial |
| Preprints | Yes (34 servers) | Yes | Yes |
| Full Text | PMC papers | Links | No |
| Citations | Yes | Yes | Yes |
| Annotations | Yes (text-mined) | No | No |
| Rate Limits | Generous | Moderate | Very generous |
| API Key | Optional | Required | Optional |

---

## Sources

- [Europe PMC REST API](https://europepmc.org/RestfulWebService)
- [Europe PMC Annotations API](https://europepmc.org/AnnotationsApi)
- [Europe PMC Articles API](https://europepmc.org/ArticlesApi)
- [rOpenSci medrxivr](https://docs.ropensci.org/medrxivr/)
- [bioRxiv TDM Resources](https://www.biorxiv.org/tdm)