Spaces:

VibecoderMcSwaggins
/

DeepBoner

Paused

App Files Files Community

DeepBoner / docs /brainstorming /03_EUROPEPMC_IMPROVEMENTS.md

VibecoderMcSwaggins

rebrand: DeepCritical → DeepBoner (sexual health research agent)

5d12635 20 days ago

preview code

raw

history blame

6.05 kB

	# Europe PMC Tool: Current State & Future Improvements

	Status: Currently Implemented (Replaced bioRxiv)
	Priority: High (Preprint + Open Access Source)

	---

	## Why Europe PMC Over bioRxiv?

	### bioRxiv API Limitations (Why We Abandoned It)

	1. No Search API: Only returns papers by date range or DOI
	2. No Query Capability: Cannot search for "metformin cancer"
	3. Workaround Required: Would need to download ALL preprints and build local search
	4. Known Issue: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) documents the limitation

	### Europe PMC Advantages

	1. Full Search API: Boolean queries, filters, facets
	2. Aggregates bioRxiv: Includes bioRxiv, medRxiv content anyway
	3. Includes PubMed: Also has MEDLINE content
	4. 34 Preprint Servers: Not just bioRxiv
	5. Open Access Focus: Full-text when available

	---

	## Current Implementation

	### What We Have (`src/tools/europepmc.py`)

	- REST API search via `europepmc.org/webservices/rest/search`
	- Preprint flagging via `firstPublicationDate` heuristics
	- Returns: title, abstract, authors, DOI, source
	- Marks preprints for transparency

	### Current Limitations

	1. No Full-Text Retrieval: Only metadata/abstracts
	2. No Citation Network: Missing references/citations
	3. No Supplementary Files: Not fetching figures/data
	4. Basic Preprint Detection: Heuristic, not explicit flag

	---

	## Europe PMC API Capabilities

	### Endpoints We Could Use

	\| Endpoint \| Purpose \| Currently Using \|
	\|----------\|---------\|-----------------\|
	\| `/search` \| Query papers \| Yes \|
	\| `/fulltext/{ID}` \| Full text (XML/JSON) \| No \|
	\| `/{PMCID}/supplementaryFiles` \| Figures, data \| No \|
	\| `/citations/{ID}` \| Who cited this \| No \|
	\| `/references/{ID}` \| What this cites \| No \|
	\| `/annotations` \| Text-mined entities \| No \|

	### Rich Query Syntax

	```python
	# Current simple query
	query = "metformin cancer"

	# Could use advanced syntax
	query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)"
	query += " AND (SRC:PPR)" # Only preprints
	query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])" # Date range
	query += " AND (OPEN_ACCESS:y)" # Only open access
	```

	### Source Filters

	```python
	# Filter by source
	"SRC:MED" # MEDLINE
	"SRC:PMC" # PubMed Central
	"SRC:PPR" # Preprints (bioRxiv, medRxiv, etc.)
	"SRC:AGR" # Agricola
	"SRC:CBA" # Chinese Biological Abstracts
	```

	---

	## Recommended Improvements

	### Phase 1: Rich Metadata

	```python
	# Add to search results
	additional_fields = [
	"citedByCount", # Impact indicator
	"source", # Explicit source (MED, PMC, PPR)
	"isOpenAccess", # Boolean flag
	"fullTextUrlList", # URLs for full text
	"authorAffiliations", # Institution info
	"grantsList", # Funding info
	]
	```

	### Phase 2: Full-Text Retrieval

	```python
	async def get_fulltext(pmcid: str) -> str \| None:
	"""Get full text for open access papers."""
	# XML format
	url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML"
	# Or JSON
	url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON"
	```

	### Phase 3: Citation Network

	```python
	async def get_citations(pmcid: str) -> list[str]:
	"""Get papers that cite this one."""
	url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations"

	async def get_references(pmcid: str) -> list[str]:
	"""Get papers this one cites."""
	url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references"
	```

	### Phase 4: Text-Mined Annotations

	Europe PMC extracts entities automatically:

	```python
	async def get_annotations(pmcid: str) -> dict:
	"""Get text-mined entities (genes, diseases, drugs)."""
	url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds"
	params = {
	"articleIds": f"PMC:{pmcid}",
	"type": "Gene_Proteins,Diseases,Chemicals",
	"format": "JSON",
	}
	# Returns structured entity mentions with positions
	```

	---

	## Supplementary File Retrieval

	From reference repo (`bioinformatics_tools.py` lines 123-149):

	```python
	def get_figures(pmcid: str) -> dict[str, str]:
	"""Download figures and supplementary files."""
	url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true"
	# Returns ZIP with images, returns base64-encoded
	```

	---

	## Preprint-Specific Features

	### Identify Preprint Servers

	```python
	PREPRINT_SOURCES = {
	"PPR": "General preprints",
	"bioRxiv": "Biology preprints",
	"medRxiv": "Medical preprints",
	"chemRxiv": "Chemistry preprints",
	"Research Square": "Multi-disciplinary",
	"Preprints.org": "MDPI preprints",
	}

	# Check if published version exists
	async def check_published_version(preprint_doi: str) -> str \| None:
	"""Check if preprint has been peer-reviewed and published."""
	# Europe PMC links preprints to final versions
	```

	---

	## Rate Limiting

	Europe PMC is more generous than NCBI:

	```python
	# No documented hard limit, but be respectful
	# Recommend: 10-20 requests/second max
	# Use email in User-Agent for polite pool
	headers = {
	"User-Agent": "DeepBoner/1.0 (mailto:[email protected])"
	}
	```

	---

	## vs. The Lens & OpenAlex

	\| Feature \| Europe PMC \| The Lens \| OpenAlex \|
	\|---------\|------------\|----------\|----------\|
	\| Biomedical Focus \| Yes \| Partial \| Partial \|
	\| Preprints \| Yes (34 servers) \| Yes \| Yes \|
	\| Full Text \| PMC papers \| Links \| No \|
	\| Citations \| Yes \| Yes \| Yes \|
	\| Annotations \| Yes (text-mined) \| No \| No \|
	\| Rate Limits \| Generous \| Moderate \| Very generous \|
	\| API Key \| Optional \| Required \| Optional \|

	---

	## Sources

	- [Europe PMC REST API](https://europepmc.org/RestfulWebService)
	- [Europe PMC Annotations API](https://europepmc.org/AnnotationsApi)
	- [Europe PMC Articles API](https://europepmc.org/ArticlesApi)
	- [rOpenSci medrxivr](https://docs.ropensci.org/medrxivr/)
	- [bioRxiv TDM Resources](https://www.biorxiv.org/tdm)