| # Europe PMC Tool: Current State & Future Improvements | |
| **Status**: Currently Implemented (Replaced bioRxiv) | |
| **Priority**: High (Preprint + Open Access Source) | |
| --- | |
| ## Why Europe PMC Over bioRxiv? | |
| ### bioRxiv API Limitations (Why We Abandoned It) | |
| 1. **No Search API**: Only returns papers by date range or DOI | |
| 2. **No Query Capability**: Cannot search for "metformin cancer" | |
| 3. **Workaround Required**: Would need to download ALL preprints and build local search | |
| 4. **Known Issue**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) documents the limitation | |
| ### Europe PMC Advantages | |
| 1. **Full Search API**: Boolean queries, filters, facets | |
| 2. **Aggregates bioRxiv**: Includes bioRxiv, medRxiv content anyway | |
| 3. **Includes PubMed**: Also has MEDLINE content | |
| 4. **34 Preprint Servers**: Not just bioRxiv | |
| 5. **Open Access Focus**: Full-text when available | |
| --- | |
| ## Current Implementation | |
| ### What We Have (`src/tools/europepmc.py`) | |
| - REST API search via `europepmc.org/webservices/rest/search` | |
| - Preprint flagging via `firstPublicationDate` heuristics | |
| - Returns: title, abstract, authors, DOI, source | |
| - Marks preprints for transparency | |
| ### Current Limitations | |
| 1. **No Full-Text Retrieval**: Only metadata/abstracts | |
| 2. **No Citation Network**: Missing references/citations | |
| 3. **No Supplementary Files**: Not fetching figures/data | |
| 4. **Basic Preprint Detection**: Heuristic, not explicit flag | |
| --- | |
| ## Europe PMC API Capabilities | |
| ### Endpoints We Could Use | |
| | Endpoint | Purpose | Currently Using | | |
| |----------|---------|-----------------| | |
| | `/search` | Query papers | Yes | | |
| | `/fulltext/{ID}` | Full text (XML/JSON) | No | | |
| | `/{PMCID}/supplementaryFiles` | Figures, data | No | | |
| | `/citations/{ID}` | Who cited this | No | | |
| | `/references/{ID}` | What this cites | No | | |
| | `/annotations` | Text-mined entities | No | | |
| ### Rich Query Syntax | |
| ```python | |
| # Current simple query | |
| query = "metformin cancer" | |
| # Could use advanced syntax | |
| query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)" | |
| query += " AND (SRC:PPR)" # Only preprints | |
| query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])" # Date range | |
| query += " AND (OPEN_ACCESS:y)" # Only open access | |
| ``` | |
| ### Source Filters | |
| ```python | |
| # Filter by source | |
| "SRC:MED" # MEDLINE | |
| "SRC:PMC" # PubMed Central | |
| "SRC:PPR" # Preprints (bioRxiv, medRxiv, etc.) | |
| "SRC:AGR" # Agricola | |
| "SRC:CBA" # Chinese Biological Abstracts | |
| ``` | |
| --- | |
| ## Recommended Improvements | |
| ### Phase 1: Rich Metadata | |
| ```python | |
| # Add to search results | |
| additional_fields = [ | |
| "citedByCount", # Impact indicator | |
| "source", # Explicit source (MED, PMC, PPR) | |
| "isOpenAccess", # Boolean flag | |
| "fullTextUrlList", # URLs for full text | |
| "authorAffiliations", # Institution info | |
| "grantsList", # Funding info | |
| ] | |
| ``` | |
| ### Phase 2: Full-Text Retrieval | |
| ```python | |
| async def get_fulltext(pmcid: str) -> str | None: | |
| """Get full text for open access papers.""" | |
| # XML format | |
| url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML" | |
| # Or JSON | |
| url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON" | |
| ``` | |
| ### Phase 3: Citation Network | |
| ```python | |
| async def get_citations(pmcid: str) -> list[str]: | |
| """Get papers that cite this one.""" | |
| url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations" | |
| async def get_references(pmcid: str) -> list[str]: | |
| """Get papers this one cites.""" | |
| url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references" | |
| ``` | |
| ### Phase 4: Text-Mined Annotations | |
| Europe PMC extracts entities automatically: | |
| ```python | |
| async def get_annotations(pmcid: str) -> dict: | |
| """Get text-mined entities (genes, diseases, drugs).""" | |
| url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds" | |
| params = { | |
| "articleIds": f"PMC:{pmcid}", | |
| "type": "Gene_Proteins,Diseases,Chemicals", | |
| "format": "JSON", | |
| } | |
| # Returns structured entity mentions with positions | |
| ``` | |
| --- | |
| ## Supplementary File Retrieval | |
| From reference repo (`bioinformatics_tools.py` lines 123-149): | |
| ```python | |
| def get_figures(pmcid: str) -> dict[str, str]: | |
| """Download figures and supplementary files.""" | |
| url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true" | |
| # Returns ZIP with images, returns base64-encoded | |
| ``` | |
| --- | |
| ## Preprint-Specific Features | |
| ### Identify Preprint Servers | |
| ```python | |
| PREPRINT_SOURCES = { | |
| "PPR": "General preprints", | |
| "bioRxiv": "Biology preprints", | |
| "medRxiv": "Medical preprints", | |
| "chemRxiv": "Chemistry preprints", | |
| "Research Square": "Multi-disciplinary", | |
| "Preprints.org": "MDPI preprints", | |
| } | |
| # Check if published version exists | |
| async def check_published_version(preprint_doi: str) -> str | None: | |
| """Check if preprint has been peer-reviewed and published.""" | |
| # Europe PMC links preprints to final versions | |
| ``` | |
| --- | |
| ## Rate Limiting | |
| Europe PMC is more generous than NCBI: | |
| ```python | |
| # No documented hard limit, but be respectful | |
| # Recommend: 10-20 requests/second max | |
| # Use email in User-Agent for polite pool | |
| headers = { | |
| "User-Agent": "DeepBoner/1.0 (mailto:[email protected])" | |
| } | |
| ``` | |
| --- | |
| ## vs. The Lens & OpenAlex | |
| | Feature | Europe PMC | The Lens | OpenAlex | | |
| |---------|------------|----------|----------| | |
| | Biomedical Focus | Yes | Partial | Partial | | |
| | Preprints | Yes (34 servers) | Yes | Yes | | |
| | Full Text | PMC papers | Links | No | | |
| | Citations | Yes | Yes | Yes | | |
| | Annotations | Yes (text-mined) | No | No | | |
| | Rate Limits | Generous | Moderate | Very generous | | |
| | API Key | Optional | Required | Optional | | |
| --- | |
| ## Sources | |
| - [Europe PMC REST API](https://europepmc.org/RestfulWebService) | |
| - [Europe PMC Annotations API](https://europepmc.org/AnnotationsApi) | |
| - [Europe PMC Articles API](https://europepmc.org/ArticlesApi) | |
| - [rOpenSci medrxivr](https://docs.ropensci.org/medrxivr/) | |
| - [bioRxiv TDM Resources](https://www.biorxiv.org/tdm) | |