Edit model card

RuPERTa: the Spanish RoBERTa 🎃spain flag

RuPERTa-base (uncased) is a RoBERTa model trained on a uncased verison of big Spanish corpus. RoBERTa iterates on BERT's pretraining procedure, including training the model longer, with bigger batches over more data; removing the next sentence prediction objective; training on longer sequences; and dynamically changing the masking pattern applied to the training data. The architecture is the same as roberta-base:

roberta.base: RoBERTa using the BERT-base architecture 125M params

Benchmarks 🧾

WIP (I continue working on it) 🚧

Task/Dataset F1 Precision Recall Fine-tuned model Reproduce it
POS 97.39 97.47 97.32 RuPERTa-base-finetuned-pos Open In Colab
NER 77.55 75.53 79.68 RuPERTa-base-finetuned-ner
SQUAD-es v1 to-do RuPERTa-base-finetuned-squadv1
SQUAD-es v2 to-do RuPERTa-base-finetuned-squadv2

Model in action 🔨

Usage for POS and NER 🏷

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

id2label = {
    "0": "B-LOC",
    "1": "B-MISC",
    "2": "B-ORG",
    "3": "B-PER",
    "4": "I-LOC",
    "5": "I-MISC",
    "6": "I-ORG",
    "7": "I-PER",
    "8": "O"
}

tokenizer = AutoTokenizer.from_pretrained('mrm8488/RuPERTa-base-finetuned-ner')
model = AutoModelForTokenClassification.from_pretrained('mrm8488/RuPERTa-base-finetuned-ner')

text ="Julien, CEO de HF, nació en Francia."

input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)

outputs = model(input_ids)
last_hidden_states = outputs[0]

for m in last_hidden_states:
  for index, n in enumerate(m):
    if(index > 0 and index <= len(text.split(" "))):
      print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])

# Output:
'''
Julien,: I-PER
CEO: O
de: O
HF,: B-ORG
nació: I-PER
en: I-PER
Francia.: I-LOC
'''

For POS just change the id2label dictionary and the model path to mrm8488/RuPERTa-base-finetuned-pos

Fast usage for LM with pipelines 🧪

from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained('mrm8488/RuPERTa-base')
tokenizer = AutoTokenizer.from_pretrained("mrm8488/RuPERTa-base", do_lower_case=True)

from transformers import pipeline

pipeline_fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

pipeline_fill_mask("España es un país muy <mask> en la UE")
[
  {
    "score": 0.1814306527376175,
    "sequence": "<s> españa es un país muy importante en la ue</s>",
    "token": 1560
  },
  {
    "score": 0.024842597544193268,
    "sequence": "<s> españa es un país muy fuerte en la ue</s>",
    "token": 2854
  },
  {
    "score": 0.02473250962793827,
    "sequence": "<s> españa es un país muy pequeño en la ue</s>",
    "token": 2948
  },
  {
    "score": 0.023991240188479424,
    "sequence": "<s> españa es un país muy antiguo en la ue</s>",
    "token": 5240
  },
  {
    "score": 0.0215945765376091,
    "sequence": "<s> españa es un país muy popular en la ue</s>",
    "token": 5782
  }
]

Acknowledgments

I thank 🤗/transformers team for answering my doubts and Google for helping me with the TensorFlow Research Cloud program.

Created by Manuel Romero/@mrm8488

Made with in Spain

Downloads last month
36
Safetensors
Model size
127M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.