| | --- |
| | license: mit |
| | language: |
| | - pt |
| | tags: |
| | - gervasio-pt* |
| | - gervasio-ptpt |
| | - gervasio-8b-portuguese-ptpt-decoder |
| | - portulan |
| | - albertina-pt* |
| | - serafim-pt* |
| | - clm |
| | - gpt |
| | - portuguese |
| | - decoder |
| | - foundation model |
| | base_model: |
| | - meta-llama/Llama-3.1-8B-Instruct |
| | base_model_relation: quantized |
| | pipeline_tag: text-generation |
| | library_name: transformers |
| | --- |
| | <br> |
| | <br> |
| | <img align="left" width="40" height="40" src="https://github.githubassets.com/images/icons/emoji/unicode/1f917.png"> |
| | <p style="text-align: center;"> This is the model card for <b>Gervásio 8B PTPT</b> decoder quantized at 4 bits. |
| | <br>This model is integrated in the <a href="https://evaristo.ai"><b>Evaristo.ai chatbot</b></a>, where its generative capabilities can be experimented with on the fly through a GUI. |
| | <br>You may be interested in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders), Gervásio (decoders) and Serafim (sentence encoder) families</a>. |
| | </p> |
| | <br> |
| | <br> |
| |
|
| | <img width="500" src="logo_gervasio_long_color.png"> |
| |
|
| |
|
| | # Gervásio 8B PTPT |
| |
|
| | </br> |
| |
|
| | **Gervásio 8B PTPT** is an **open** decoder for the **Portuguese language**. |
| |
|
| | It is a **decoder** of the LLaMA family, based on the neural architecture Transformer and developed over the LLaMA 3.1 8B Instruct model. |
| | Its further improvement through additional training was done over language resources that include data sets of Portuguese prepared for this purpose, that include [extraGLUE-Instruct |
| | ](https://huggingface.co/datasets/PORTULAN/extraglue-instruct), as well as other data sets whose release is being prepared (MMLU PT, Natural Instructions PT, Wikipedia subset, Provérbios PT). |
| |
|
| | **Gervásio 8B PTPT** is openly distributed for free under an open license, including thus for research and commercial purposes, and given its size, can be run on consumer-grade hardware. |
| |
|
| | **Gervásio 8B PTPT** is developed by NLX-Natural Language and Speech Group, at the University of Lisbon, Faculty of Sciences, Department of Informatics, Portugal. |
| |
|
| | For the record, its full name is **Gervásio Produz Textos em Português**, to which corresponds the natural acronym **GPT PT**, |
| | and which is known more shortly as **Gervásio PT*** or, even more briefly, just as **Gervásio**, among its acquaintances. |
| |
|
| | **Gervásio 8B PTPT** is developed by a team from the University of Lisbon, Portugal. |
| |
|
| | The model in this repository is a version of **Gervásio 8B PTPT** quantized at 4 bits (Q4_K_M). |
| | The non-quantized version can be found [here](https://huggingface.co/PORTULAN/gervasio-8b-portuguese-ptpt-decoder). |
| |
|
| | <br> |
| |
|
| |
|
| | # Model Description |
| |
|
| | The model has 8 billion parameters, over 32 layers, with a hidden size of 4096, an intermediate size of 14336, and 32 attention heads. It uses a RoPE tokenizer with a vocabulary of size 128256. |
| |
|
| |
|
| | # Training Data |
| |
|
| | **Gervásio 8B PTPT** was trained on various datasets, either native to European Portuguese or translated into European Portuguese. |
| | For the latter, we selected only those datasets where the outcome of their translation into European Portuguese could preserve, in the target language, the linguistic properties at stake. |
| |
|
| | The training data comprises: |
| | - [extraGLUE-Instruct](https://huggingface.co/datasets/PORTULAN/extraglue-instruct) |
| | - MMLU PT (multiple choice question answering). |
| | - A subset of Natural Instructions (mostly multiple choice question answering tasks). |
| | - A manually curated subset of Wikipedia. |
| | - A manually curated list of proverbs. |
| |
|
| |
|
| | # Training Details |
| |
|
| | We applied supervised fine-tuning with a causal language modeling training objective following a zero-out technique during the fine-tuning process. Specifically, while the entire prompt and chat template received attention during fine-tuning, only the response tokens were subjected to back-propagation. |
| |
|
| | To accelerate training, the Fully Sharded Data Parallel (FSDP) paradigm was used over 10 L40S GPUs. |
| |
|
| |
|
| | # Performance |
| |
|
| | For testing, we use translations of the standard benchmarks GPQA Diamond, MMLU and MMLU Pro, as well as the CoPA, MRPC and RTE datasets in [extraGLUE](https://huggingface.co/datasets/PORTULAN/extraglue). |
| |
|
| | | Model | GPQA Diamond PT | MMLU PT | MMLU Pro PT | CoPA | MRPC | RTE | Average | |
| | | ------------------------ | --------------: | --------: | ----------: | --------: | --------: | --------: | --------: | |
| | | Gervásio 8B PTPT (4 bit) | 28.79 | 39.25 | 12.62 | 85.00 | 74.20 | **80.14** | 53.33 | |
| | | Gervásio 8B PTPT | **34.85** | **62.15** | **36.79** | **87.00** | **77.45** | 77.62 | **62.64** | |
| | | LLaMA 3.1 8B Instruct | 32.32 | 61.49 | 36.10 | 83.00 | 75.25 | 79.42 | 61.26 | |
| |
|
| |
|
| | # How to use |
| |
|
| | You can use this model directly with a pipeline for causal language modeling: |
| |
|
| | ```python3 |
| | >>> from transformers import pipeline |
| | >>> generator = pipeline(model='PORTULAN/gervasio-8b-portuguese-ptpt-decoder-quantized-4bit') |
| | >>> generator("A comida portuguesa é", max_new_tokens=10) |
| | ``` |
| |
|
| |
|
| | # Please cite |
| |
|
| | ``` latex |
| | @misc{gervasio, |
| | title={Advancing Generative AI for Portuguese with |
| | Open Decoder Gervásio PT-*}, |
| | author={Rodrigo Santos, João Silva, Luís Gomes, |
| | João Rodrigues, António Branco}, |
| | year={2024}, |
| | eprint={2402.18766}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL} |
| | } |
| | ``` |
| |
|
| | Please use the above canonical reference when using or citing this model. |
| |
|
| |
|
| | # Acknowledgments |
| |
|
| | The research reported here was partially supported by: |
| | PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016; |
| | innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação I.P. under the grant C625734525-00462629, of Plano de Recuperação e Resiliência, call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização; |
| | research project "Hey, Hal, curb your hallucination! / Enhancing AI chatbots with enhanced RAG solutions", funded by FCT-Fundação para a Ciência e a Tecnologia under the grant 2024.07592.IACDC; |
| | project "CLARIN – Infraestrutura de Investigação para a Ciência e Tecnologia da Linguagem", funded by programme Lisboa2030 under the grant LISBOA2030-FEDER-01316900PORTULAN. |
| |
|