BlackKakapo commited on
Commit
9aff431
·
1 Parent(s): 01486c5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +77 -1
README.md CHANGED
@@ -1,3 +1,79 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ annotations_creators: []
3
+ language:
4
+ - ro
5
+ language_creators:
6
+ - machine-generated
7
+ license:
8
+ - apache-2.0
9
+ multilinguality:
10
+ - monolingual
11
+ pretty_name: BlackKakapo/t5-base-paraphrase-ro
12
+ size_categories:
13
+ - 10K<n<100K
14
+ source_datasets:
15
+ - original
16
+ tags: []
17
+ task_categories:
18
+ - text2text-generation
19
+ task_ids: []
20
  ---
21
+ # Romanian paraphrase
22
+
23
+ ![v2.0](https://img.shields.io/badge/V.2-19.08.2022-brightgreen)
24
+
25
+ Fine-tune t5-base-paraphrase-ro model for paraphrase. Since there is no Romanian dataset for paraphrasing, I had to create my own [dataset](https://huggingface.co/datasets/BlackKakapo/paraphrase-ro-v2). The dataset contains ~30k examples.
26
+
27
+ ### How to use
28
+
29
+ ```python
30
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
31
+
32
+ tokenizer = AutoTokenizer.from_pretrained("BlackKakapo/t5-base-paraphrase-ro-v2")
33
+ model = AutoModelForSeq2SeqLM.from_pretrained("BlackKakapo/t5-base-paraphrase-ro-v2")
34
+ ```
35
+
36
+ ### Or
37
+
38
+ ```python
39
+ from transformers import T5ForConditionalGeneration, T5TokenizerFast
40
+
41
+ model = T5ForConditionalGeneration.from_pretrained("BlackKakapo/t5-base-paraphrase-ro-v2")
42
+ tokenizer = T5TokenizerFast.from_pretrained("BlackKakapo/t5-base-paraphrase-ro-v2")
43
+ ```
44
+
45
+ ### Generate
46
+
47
+ ```python
48
+ text = "Într-un interviu pentru Radio Europa Liberă România, acesta a menționat că Bucureștiul este pregătit oricând și ar dura doar o oră de la solicitare, până când gazele ar ajunge la Chișinău."
49
+
50
+ encoding = tokenizer.encode_plus(text, pad_to_max_length=True, return_tensors="pt")
51
+ input_ids, attention_masks = encoding["input_ids"], encoding["attention_mask"]
52
+
53
+ beam_outputs = model.generate(
54
+ input_ids=input_ids,
55
+ attention_mask=attention_masks,
56
+ do_sample=True,
57
+ max_length=256,
58
+ top_k=20,
59
+ top_p=0.9,
60
+ early_stopping=False,
61
+ num_return_sequences=5
62
+ )
63
+
64
+ final_outputs = []
65
+
66
+ for beam_output in beam_outputs:
67
+ text_para = tokenizer.decode(beam_output, skip_special_tokens=True,clean_up_tokenization_spaces=True)
68
+
69
+ if text.lower() != text_para.lower() or text not in final_outputs:
70
+ final_outputs.append(text_para)
71
+
72
+
73
+ print(final_outputs)
74
+ ```
75
+ ### Output
76
+
77
+ ```out
78
+ ['Într-un interviu cu Radio Europa Liberă România, el a spus că Bucureștiul este pregătit în orice moment și ar dura doar o oră de la cererea până când gazele ar ajunge la Chișinău.']
79
+ ```