SynCABEL_SPACCC / README.md

NEW README

4a1ec92 20 days ago

7.38 kB

	---
	license: apache-2.0

	base_model:
	- meta-llama/Meta-Llama-3-8B-Instruct

	language:
	- es

	tags:
	- BEL
	- retrieval
	- entity-retrieval
	- named-entity-disambiguation
	- entity-disambiguation
	- named-entity-linking
	- entity-linking
	- text2text-generation
	- biomedical
	- healthcare
	- synthetic-data
	- causal-lm
	- llm

	library_name: transformers
	finetuning_task:
	- text2text-generation
	- entity-linking
	---


	# SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

	## SynCABEL

	SynCABEL is a novel framework that addresses data scarcity in biomedical entity linking through synthetic data generation. The method, introduced in our [paper]

	## SynCABEL (SPACCC Edition)

	This is a finetuned version of LLaMA-3-8B trained on SPACCC using [SynthSPACCC](https://huggingface.co/datasets/AnonymousARR42/SynCABEL) (our synthetic dataset generated via the SynCABEL framework).

	\| \| \|
	\|--------\|---------\|
	\| Base Model \| [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) \|
	\| Training Data \| SPACCC (real) + [SynthSPACCC](https://huggingface.co/datasets/AnonymousARR42/SynCABEL) (synthetic) \|
	\| Fine-tuning \| [Supervised Fine-Tuning](https://huggingface.co/docs/trl/en/sft_trainer) \|

	## Training Data Composition

	The model is trained on a mix of human-annotated and synthetic data:

	```
	SPACCC (human) : 27,799 examples
	SynthSPACCC (synthetic) : 1,813,463 examples
	```

	To ensure balanced learning, human data is upsampled during training so that each batch contains:

	```
	50% human-annotated data
	50% synthetic data
	```

	In other words, although SynthMM is larger, the model always sees a 1:1 ratio of human to synthetic examples, preventing synthetic data from overwhelming human supervision.


	## Usage


	### Loading
	```python
	import torch
	from transformers import AutoModelForCausalLM

	# Load the model (requires trust_remote_code for custom architecture)
	model = AutoModelForCausalLM.from_pretrained(
	"AnonymousARR42/SynCABEL_SPACCC",
	trust_remote_code=True,
	device_map="auto"
	)
	```

	### Unconstrained Generation
	```python
	# Let the model freely generate concept names
	sentences = [
	"El paciente con [embolia pulmonar masiva]{ENFERMEDAD} presentó signos de dificultad respiratoria.",
	"El paciente se sometió a una [angioplastia coronaria]{PROCEDIMIENTO} para restaurar el flujo sanguíneo."
	]

	results = model.sample(
	sentences=sentences,
	constrained=False,
	num_beams=2,
	)

	for i, beam_results in enumerate(results):
	print(f"Input: {sentences[i]}")

	mention = beam_results[0]["mention"]
	print(f"Mention: {mention}")

	for j, result in enumerate(beam_results):
	print(
	f"Beam {j+1}:\n"
	f"Predicted concept name:{result['pred_concept_name']}\n"
	f"Predicted code: {result['pred_concept_code']}\n"
	f"Beam score: {result['beam_score']:.3f}\n"
	)
	```

	Output:
	```
	Input: El paciente con [embolia pulmonar masiva]{ENFERMEDAD} presentó signos de dificultad respiratoria.
	Mention: embolia pulmonar masiva
	Beam 1:
	Predicted concept name:tromboembolia pulmonar masiva aguda
	Predicted code: NO_CODE
	Beam score: 0.818

	Beam 2:
	Predicted concept name:tromboembolia masiva
	Predicted code: 58417008
	Beam score: 0.816

	Input: El paciente se sometió a una [angioplastia coronaria]{PROCEDIMIENTO} para restaurar el flujo sanguíneo.
	Mention: angioplastia coronaria
	Beam 1:
	Predicted concept name:operaciones transluminales en arteria coronaria
	Predicted code: NO_CODE
	Beam score: 0.764

	Beam 2:
	Predicted concept name:procedimiento en arteria coronaria
	Predicted code: NO_CODE
	Beam score: 0.728
	```

	### Constrained Decoding (Recommended for Entity Linking)
	```python
	# Constrained to valid biomedical concepts
	sentences = [
	"El paciente con [embolia pulmonar masiva]{ENFERMEDAD} presentó signos de dificultad respiratoria.",
	"El paciente se sometió a una [angioplastia coronaria]{PROCEDIMIENTO} para restaurar el flujo sanguíneo."
	]

	results = model.sample(
	sentences=sentences,
	constrained=True,
	num_beams=2,
	)

	for i, beam_results in enumerate(results):
	print(f"Input: {sentences[i]}")

	mention = beam_results[0]["mention"]
	print(f"Mention: {mention}")

	for j, result in enumerate(beam_results):
	print(
	f"Beam {j+1}:\n"
	f"Predicted concept name:{result['pred_concept_name']}\n"
	f"Predicted code: {result['pred_concept_code']}\n"
	f"Beam score: {result['beam_score']:.3f}\n"
	)
	```

	Output:
	```
	Input: El paciente con [embolia pulmonar masiva]{ENFERMEDAD} presentó signos de dificultad respiratoria.
	Mention: embolia pulmonar masiva
	Beam 1:
	Predicted concept name:tromboembolia masiva
	Predicted code: 58417008
	Beam score: 0.816

	Beam 2:
	Predicted concept name:tromboembolia pulmonar aguda
	Predicted code: 707414004
	Beam score: 0.763

	Input: El paciente se sometió a una [angioplastia coronaria]{PROCEDIMIENTO} para restaurar el flujo sanguíneo.
	Mention: angioplastia coronaria
	Beam 1:
	Predicted concept name:operaciones transluminales en arteria pulmonar
	Predicted code: 175266007
	Beam score: 0.238

	Beam 2:
	Predicted concept name:operaciones transluminales en la arteria femoral o poplítea
	Predicted code: 265530008
	Beam score: 0.182
	```

	## Scores

	Entity linking performance (Recall@1) on biomedical benchmarks. The best results are shown in bold, the second-best results are <u>underlined</u>, and the "Average" column reports the mean score across the four benchmarks.

	\| Model \| MM-ST21PV<br>(english) \| QUAERO-MEDLINE<br>(french) \| QUAERO-EMEA<br>(french) \| SPACCC<br>(spanish) \| Avg. \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| SciSpacy \| 53.8 \| 40.5 \| 37.1 \| 13.2 \| 36.2 \|
	\| SapBERT \| 51.1 \| 50.6 \| 49.8 \| 33.9 \| 46.4 \|
	\| CODER-all \| 56.6 \| 58.7 \| 58.1 \| 43.7 \| 54.3 \|
	\| SapBERT-all \| 64.6 \| 74.7 \| 67.9 \| 47.9 \| 63.8 \|
	\| ArboEL \| <u>74.5</u> \| 70.9 \| 62.8 \| 49.0 \| 64.2 \|
	\| mBART-large \| 65.5 \| 61.5 \| 58.6 \| 57.7 \| 60.8 \|
	\| + Guided inference \| 70.0 \| 72.8 \| 71.1 \| 61.8 \| 68.9 \|
	\| + SynCABEL (Our method) \| 71.5 \| 77.1 \| <u>75.3</u> \| 64.0 \| 72.0 \|
	\| Llama-3-8B \| 69.0 \| 66.4 \| 65.5 \| 59.9 \| 65.2 \|
	\| + Guided inference \| 74.4 \| <u>77.5</u> \| 72.9 \| <u>64.2</u> \| <u>72.3</u> \|
	\| + SynCABEL (Our method) \| 75.4 \| 79.7 \| 79.0 \| 67.0 \| 75.3 \|

	Here, we provide the source repositories for the baselines:
	- [SciSpacy](https://github.com/allenai/scispacy)
	- [SapBERT](https://hf.co/cambridgeltl/SapBERT-from-PubMedBERT-fulltext)
	- [SapBERT-all](https://hf.co/cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR)
	- [CODER-all](https://hf.co/GanjinZero/coder_all)
	- [ArboEL](https://github.com/dhdhagar/arboEL)
	- [mBART-large](https://hf.co/facebook/mbart-large-50)
	- [LLaMA-3-8B](https://hf.co/meta-llama/Meta-Llama-3-8B-Instruct).


	### Speed and Memory

	\| Model \| Model (GB) \| Cand. (GB) \| Speed (/s) \|
	\|--------------\|------------\|------------\|------------\|
	\| SapBERT \| 2.1 \| 20.1 \| 575.5 \|
	\| ArboEL \| 1.2 \| 7.1 \| 38.9 \|
	\| mBART \| 2.3 \| 5.4 \| 51.0 \|
	\| Llama-3-8B \| 28.6 \| 5.4 \| 19.1 \|

	Measured on single H100 GPU, constrained decoding