ponteineptique commited on
Commit
73d1f43
·
verified ·
1 Parent(s): a6112a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -0
README.md CHANGED
@@ -11,6 +11,23 @@ pinned: false
11
 
12
  **CoMMA** is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts.
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ### 🏛️ What’s Inside
15
  * **Multilingual Corpora:** Annotated texts in Old French and Medieval Latin.
16
  * **Specialized Models:** Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax.
 
11
 
12
  **CoMMA** is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts.
13
 
14
+ Original paper:
15
+
16
+ ```bib
17
+ @unpublished{clerice:hal-05299220,
18
+ TITLE = {{CoMMA, a Large-scale Corpus of Multilingual Medieval Archives}},
19
+ AUTHOR = {Cl{'e}rice, Thibault and Gabay, Simon and Vlachou-Efstathiou, Malamatenia and Pinche, Ariane and Sagot, Beno{^i}t},
20
+ URL = {https://inria.hal.science/hal-05299220},
21
+ NOTE = {working paper or preprint},
22
+ YEAR = {2025},
23
+ MONTH = Oct,
24
+ KEYWORDS = {Automatic Text Recognition Medieval manuscripts Latin French Digital humanities Corpus ; Automatic Text Recognition ; Medieval manuscripts ; Latin ; French ; Digital humanities ; Corpus},
25
+ PDF = {https://inria.hal.science/hal-05299220v1/file/Latin_and_Old_French_Manuscripts-8.pdf},
26
+ HAL_ID = {hal-05299220},
27
+ HAL_VERSION = {v1},
28
+ }
29
+ ```
30
+
31
  ### 🏛️ What’s Inside
32
  * **Multilingual Corpora:** Annotated texts in Old French and Medieval Latin.
33
  * **Specialized Models:** Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax.