guillermoruiz
/

bilma_UY

Model card Files Files and versions

bilma_UY / README.md

guillermoruiz's picture

Create README.md

b962ee9 verified almost 2 years ago

|

history blame contribute delete

2.14 kB

	---
	license: mit
	language:
	- es
	metrics:
	- accuracy
	tags:
	- code
	- nlp
	- custom
	- bilma
	tokenizer:
	- yes
	---
	# BILMA (Bert In Latin aMericA)

	Bilma is a BERT implementation in tensorflow and trained on the Masked Language Model task under the
	https://sadit.github.io/regional-spanish-models-talk-2022/ datasets. It is a model trained on regionalized
	Spanish short texts from the Twitter (now X) platform.

	We have pretrained models for the countries of Argentina, Chile, Colombia, Spain, Mexico, United States, Uruguay, and Venezuela.

	The accuracy of the models trained on the MLM task for different regions are:

	![bilma-mlm-comp](https://user-images.githubusercontent.com/392873/163045798-89bd45c5-b654-4f16-b3e2-5cf404e12ddd.png)

	# Pre-requisites

	You will need TensorFlow 2.4 or newer.

	# Quick guide

	Install the following version for the transformers library
	```
	!pip install transformers==4.30.2
	```



	Instanciate the tokenizer and the trained model
	```
	from transformers import AutoTokenizer
	from transformers import TFAutoModel

	tok = AutoTokenizer.from_pretrained("guillermoruiz/bilma_mx")
	model = TFAutoModel.from_pretrained("guillermoruiz/bilma_mx", trust_remote_code=True)
	```

	Now,we will need some text and then pass it through the tokenizer:
	```
	text = ["Vamos a comer [MASK].",
	"Hace mucho que no voy al [MASK]."]
	t = tok(text, padding="max_length", return_tensors="tf", max_length=280)
	```

	With this, we are ready to use the model
	```
	p = model(t)
	```

	Now, we get the most likely words with:
	```
	import tensorflow as tf
	tok.batch_decode(tf.argmax(p["logits"], 2)[:,1:], skip_special_tokens=True)
	```

	which produces the output:
	```
	['vamos a comer tacos.', 'hace mucho que no voy al gym.']
	```

	If you find this model useful for your research, please cite the following paper:
	```
	@misc{tellez2022regionalized,
	title={Regionalized models for Spanish language variations based on Twitter},
	author={Eric S. Tellez and Daniela Moctezuma and Sabino Miranda and Mario Graff and Guillermo Ruiz},
	year={2022},
	eprint={2110.06128},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```