Spaces:

comma-project
/

README

Running

ponteineptique commited on about 1 month ago

Commit

73d1f43

verified ·

1 Parent(s): a6112a6

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -11,6 +11,23 @@ pinned: false
 **CoMMA** is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts.
 ### 🏛️ What’s Inside
 * **Multilingual Corpora:** Annotated texts in Old French and Medieval Latin.
 * **Specialized Models:** Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax.

 **CoMMA** is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts.
+Original paper:
+```bib
+@unpublished{clerice:hal-05299220,
+  TITLE = {{CoMMA, a Large-scale Corpus of Multilingual Medieval Archives}},
+  AUTHOR = {Cl{'e}rice, Thibault and Gabay, Simon and Vlachou-Efstathiou, Malamatenia and Pinche, Ariane and Sagot, Beno{^i}t},
+  URL = {https://inria.hal.science/hal-05299220},
+  NOTE = {working paper or preprint},
+  YEAR = {2025},
+  MONTH = Oct,
+  KEYWORDS = {Automatic Text Recognition Medieval manuscripts Latin French Digital humanities Corpus ; Automatic Text Recognition ; Medieval manuscripts ; Latin ; French ; Digital humanities ; Corpus},
+  PDF = {https://inria.hal.science/hal-05299220v1/file/Latin_and_Old_French_Manuscripts-8.pdf},
+  HAL_ID = {hal-05299220},
+  HAL_VERSION = {v1},
+}
+```
 ### 🏛️ What’s Inside
 * **Multilingual Corpora:** Annotated texts in Old French and Medieval Latin.
 * **Specialized Models:** Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax.