Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -11,6 +11,23 @@ pinned: false
|
|
| 11 |
|
| 12 |
**CoMMA** is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts.
|
| 13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
### 🏛️ What’s Inside
|
| 15 |
* **Multilingual Corpora:** Annotated texts in Old French and Medieval Latin.
|
| 16 |
* **Specialized Models:** Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax.
|
|
|
|
| 11 |
|
| 12 |
**CoMMA** is an initiative dedicated to the computational study of medieval primary sources. We bridge the gap between traditional philology and NLP by providing standardized, machine-readable datasets of archival texts.
|
| 13 |
|
| 14 |
+
Original paper:
|
| 15 |
+
|
| 16 |
+
```bib
|
| 17 |
+
@unpublished{clerice:hal-05299220,
|
| 18 |
+
TITLE = {{CoMMA, a Large-scale Corpus of Multilingual Medieval Archives}},
|
| 19 |
+
AUTHOR = {Cl{'e}rice, Thibault and Gabay, Simon and Vlachou-Efstathiou, Malamatenia and Pinche, Ariane and Sagot, Beno{^i}t},
|
| 20 |
+
URL = {https://inria.hal.science/hal-05299220},
|
| 21 |
+
NOTE = {working paper or preprint},
|
| 22 |
+
YEAR = {2025},
|
| 23 |
+
MONTH = Oct,
|
| 24 |
+
KEYWORDS = {Automatic Text Recognition Medieval manuscripts Latin French Digital humanities Corpus ; Automatic Text Recognition ; Medieval manuscripts ; Latin ; French ; Digital humanities ; Corpus},
|
| 25 |
+
PDF = {https://inria.hal.science/hal-05299220v1/file/Latin_and_Old_French_Manuscripts-8.pdf},
|
| 26 |
+
HAL_ID = {hal-05299220},
|
| 27 |
+
HAL_VERSION = {v1},
|
| 28 |
+
}
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
### 🏛️ What’s Inside
|
| 32 |
* **Multilingual Corpora:** Annotated texts in Old French and Medieval Latin.
|
| 33 |
* **Specialized Models:** Tokenizers and fine-tuned embedding models optimized for non-standardized orthography and medieval syntax.
|