SkitCon commited on
Commit
d05b356
·
verified ·
1 Parent(s): 365c1e8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -0
README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - es
4
+ metrics:
5
+ - bleu
6
+ base_model:
7
+ - vgaraujov/bart-base-spanish
8
+ pipeline_tag: text2text-generation
9
+ library_name: transformers
10
+ tags:
11
+ - gec
12
+ - spanish
13
+ - seq2seq
14
+ - bart
15
+ - cows-l2h
16
+ ---
17
+
18
+ This model has been trained on 80% of the COWS-L2H dataset and 80,984 SYNTHETICALLY-GENERATED errorful sentences for grammatical error correction of Spanish text. The corpus was sentencized, so the model has been fine-tuned for SENTENCE CORRECTION. This model will likely not perform well on an entire paragraph. To correct a paragraph, sentencize the text and run the model for each sentence.
19
+
20
+ The synthetic data was generated based on a rule-based algorithm from well-formed Spanish sentences. The code for synthetic generaton is available in the Github repo for this project: https://github.com/SkitCon/synth_gec_es
21
+ BLEU: 0.794 on COWS-L2H
22
+
23
+ Example usage:
24
+
25
+ ```python
26
+ from transformers import AutoTokenizer, BartForConditionalGeneration
27
+
28
+ tokenizer = AutoTokenizer.from_pretrained("SkitCon/gec-spanish-BARTO-SYNTHETIC")
29
+ model = BartForConditionalGeneration.from_pretrained("SkitCon/gec-spanish-BARTO-SYNTHETIC")
30
+
31
+ input_sentences = ["Yo va al tienda.", "Espero que tú ganas."]
32
+
33
+ tokenized_text = tokenizer(input_sentences, return_tensors="pt")
34
+
35
+ input_ids = source_enc["input_ids"].squeeze()
36
+ attention_mask = source_enc["attention_mask"].squeeze()
37
+
38
+ outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask)
39
+
40
+ for sentence in tokenizer.batch_decode(outputs, skip_special_tokens=True):
41
+ print(sentence)
42
+ ```