README.md · SkitCon/gec-spanish-BARTO-SYNTHETIC at d05b3563863f902f93693e973df1e0a01e5c52bd

metadata

language:
  - es
metrics:
  - bleu
base_model:
  - vgaraujov/bart-base-spanish
pipeline_tag: text2text-generation
library_name: transformers
tags:
  - gec
  - spanish
  - seq2seq
  - bart
  - cows-l2h

This model has been trained on 80% of the COWS-L2H dataset and 80,984 SYNTHETICALLY-GENERATED errorful sentences for grammatical error correction of Spanish text. The corpus was sentencized, so the model has been fine-tuned for SENTENCE CORRECTION. This model will likely not perform well on an entire paragraph. To correct a paragraph, sentencize the text and run the model for each sentence.

The synthetic data was generated based on a rule-based algorithm from well-formed Spanish sentences. The code for synthetic generaton is available in the Github repo for this project: https://github.com/SkitCon/synth_gec_es BLEU: 0.794 on COWS-L2H

Example usage:

from transformers import AutoTokenizer, BartForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("SkitCon/gec-spanish-BARTO-SYNTHETIC")
model = BartForConditionalGeneration.from_pretrained("SkitCon/gec-spanish-BARTO-SYNTHETIC")

input_sentences = ["Yo va al tienda.", "Espero que tú ganas."]

tokenized_text = tokenizer(input_sentences, return_tensors="pt")

input_ids = source_enc["input_ids"].squeeze()
attention_mask = source_enc["attention_mask"].squeeze()

outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask)

for sentence in tokenizer.batch_decode(outputs, skip_special_tokens=True):
  print(sentence)