isy-thl
/

multilingual-e5-base-course-skill-tuned

Sentence Similarity

sentence-transformers

information retrieval

text-embeddings-inference

Model card Files Files and versions

Metrics Training metrics Community

pascalhuerten commited on Dec 29, 2025

Commit

4d074f6

·

verified ·

1 Parent(s): 0692b8c

Update README.md

Files changed (1) hide show

README.md +4 -1

README.md CHANGED Viewed

@@ -110,11 +110,14 @@ db.similarity_search_with_relevance_scores(query, 20)
 ### Finetuning Dataset
-  - The model was finetuned on the [German Course Competency Alignment Dataset](https://huggingface.co/datasets/isy-thl/course_competency_alignment_de), which includes alignments of course descriptions to the skill taxonomies of ESCO (European Skills, Competences, Qualifications and Occupations) and GRETA (a competency model for professional teaching competencies in adult education).
   - This dataset was compiled as part of the **WISY@KI** project, with major contributions from the **Institut für Interaktive Systeme** at the **University of Applied Sciences Lübeck**, the **Kursportal Schleswig-Holstein**, and **Weiterbildung Hessen eV**. Special thanks to colleagues from **MyEduLife** and **Trainspot**.
 ### Finetuning Process
 - **Hardware Used:**
   - Single NVIDIA T4 GPU with 15 GB VRAM.
 - **Duration:**

 ### Finetuning Dataset
+  - The model was finetuned with data from the [German Course Competency Alignment Dataset](https://huggingface.co/datasets/isy-thl/course_competency_alignment_de), which includes alignments of course descriptions to the skill taxonomies of ESCO (European Skills, Competences, Qualifications and Occupations) and GRETA (a competency model for professional teaching competencies in adult education). About 100 additional pairs of course descriptions and relevant ESCO-skills from the databases of **Kursportal Schleswig-Holstein**, and **Weiterbildung Hessen eV** were used during training but were not allowed to be included in the public dataset.
+  - Best results were achieved during training when the human validated pairs of course descriptions and relevant skills were supplemented with only about 1500 random samples from the *ESCO Skill Relations* subset. The dataset used during finetuning therefore included about 2000 samples in total.
   - This dataset was compiled as part of the **WISY@KI** project, with major contributions from the **Institut für Interaktive Systeme** at the **University of Applied Sciences Lübeck**, the **Kursportal Schleswig-Holstein**, and **Weiterbildung Hessen eV**. Special thanks to colleagues from **MyEduLife** and **Trainspot**.
 ### Finetuning Process
+For finetuning the scripts included in the [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding/tree/master) repository were used with following enviroment details and training parameters.
 - **Hardware Used:**
   - Single NVIDIA T4 GPU with 15 GB VRAM.
 - **Duration:**