|
--- |
|
license: mit |
|
language: |
|
- es |
|
metrics: |
|
- accuracy |
|
tags: |
|
- code |
|
- nlp |
|
- custom |
|
- bilma |
|
tokenizer: |
|
- yes |
|
--- |
|
# BILMA (Bert In Latin aMericA) |
|
|
|
Bilma is a BERT implementation in tensorflow and trained on the Masked Language Model task under the |
|
https://sadit.github.io/regional-spanish-models-talk-2022/ datasets. It is a model trained on regionalized |
|
Spanish short texts from the Twitter (now X) platform. |
|
|
|
We have pretrained models for the countries of Argentina, Chile, Colombia, Spain, Mexico, United States, Uruguay, and Venezuela. |
|
|
|
The accuracy of the models trained on the MLM task for different regions are: |
|
|
|
![bilma-mlm-comp](https://user-images.githubusercontent.com/392873/163045798-89bd45c5-b654-4f16-b3e2-5cf404e12ddd.png) |
|
|
|
# Pre-requisites |
|
|
|
You will need TensorFlow 2.4 or newer. |
|
|
|
# Quick guide |
|
|
|
Install the following version for the transformers library |
|
``` |
|
!pip install transformers==4.30.2 |
|
``` |
|
|
|
|
|
|
|
Instanciate the tokenizer and the trained model |
|
``` |
|
from transformers import AutoTokenizer |
|
from transformers import TFAutoModel |
|
|
|
tok = AutoTokenizer.from_pretrained("guillermoruiz/bilma_mx") |
|
model = TFAutoModel.from_pretrained("guillermoruiz/bilma_mx", trust_remote_code=True) |
|
``` |
|
|
|
Now,we will need some text and then pass it through the tokenizer: |
|
``` |
|
text = ["Vamos a comer [MASK].", |
|
"Hace mucho que no voy al [MASK]."] |
|
t = tok(text, padding="max_length", return_tensors="tf", max_length=280) |
|
``` |
|
|
|
With this, we are ready to use the model |
|
``` |
|
p = model(t) |
|
``` |
|
|
|
Now, we get the most likely words with: |
|
``` |
|
import tensorflow as tf |
|
tok.batch_decode(tf.argmax(p["logits"], 2)[:,1:], skip_special_tokens=True) |
|
``` |
|
|
|
which produces the output: |
|
``` |
|
['vamos a comer tacos.', 'hace mucho que no voy al gym.'] |
|
``` |
|
|
|
If you find this model useful for your research, please cite the following paper: |
|
``` |
|
@misc{tellez2022regionalized, |
|
title={Regionalized models for Spanish language variations based on Twitter}, |
|
author={Eric S. Tellez and Daniela Moctezuma and Sabino Miranda and Mario Graff and Guillermo Ruiz}, |
|
year={2022}, |
|
eprint={2110.06128}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |