DeathReaper0965
commited on
Commit
·
15155c0
1
Parent(s):
bdfd0e9
Add Usage and datasets
Browse files
README.md
CHANGED
@@ -1,17 +1,106 @@
|
|
1 |
---
|
2 |
datasets:
|
3 |
- google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction
|
|
|
4 |
|
5 |
language:
|
6 |
- en
|
|
|
7 |
license: mit
|
|
|
8 |
widget:
|
9 |
-
- text:
|
10 |
-
example_title:
|
11 |
-
- text:
|
12 |
-
example_title:
|
13 |
-
- text:
|
14 |
-
example_title:
|
15 |
-
- text:
|
16 |
-
example_title:
|
17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
datasets:
|
3 |
- google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction
|
4 |
+
- common-crawl
|
5 |
|
6 |
language:
|
7 |
- en
|
8 |
+
|
9 |
license: mit
|
10 |
+
|
11 |
widget:
|
12 |
+
- text: Do you even know why I always need changed our checking account number
|
13 |
+
example_title: Example 1
|
14 |
+
- text: Ironman and Captain America is going out
|
15 |
+
example_title: Example 2
|
16 |
+
- text: We all eat fish and then made dessert
|
17 |
+
example_title: Example 3
|
18 |
+
- text: We have our Dinner yesterday
|
19 |
+
example_title: Example 4
|
20 |
+
|
21 |
+
tags:
|
22 |
+
- context-correction
|
23 |
+
- error-correction
|
24 |
+
---
|
25 |
+
|
26 |
+
# T5 Context Corrector (base-sized)
|
27 |
+
t5-context-corrector model is a fine-tuned [T5 model](https://huggingface.co/t5-base) on the [Synthetic GEC](https://github.com/google-research-datasets/C4_200M-synthetic-dataset-for-grammatical-error-correction) and filtered CommonCrawl data in English Language.
|
28 |
+
The Base Model(T5) is Pre-trained on C4(Colossal Clean Crawled Corpus) dataset and works with numerous downstrem tasks. <br>
|
29 |
+
Our Model specifically is fine-tuned on a single downstream task of context correction that's trained on the above mentioned two datasets.
|
30 |
+
|
31 |
+
## Model description
|
32 |
+
This Model has the same architecture as of its base model, thus having 220 Million Parameters while consisting of 12 encoder blocks and 12 decoder blocks with an input embedding size of 32128. Please refer to this [link](https://arxiv.org/pdf/1910.10683.pdf) to know more about the model details.
|
33 |
+
|
34 |
+
## Intended Use & Limitations
|
35 |
+
As the model is intented to correct the context of the given sentence, all you have to do is pass the non-contextually correct sentence and get the corrected response back.<br>
|
36 |
+
Based on Multiple experiments performed as part of the training, we observe that the model works best when the the total number of tokens in the input is less than 256.<br>
|
37 |
+
So, if you have a long paragraph that needs to be context corrected, we suggest to first sentence tokenize the paragraph and run the context corrector for each sentence separately to obtain best results.
|
38 |
+
|
39 |
+
Note that the model is primarily trained on general publicly available corpus, so it maynot work well for Medical Contexts.
|
40 |
+
|
41 |
+
## Usage
|
42 |
+
|
43 |
+
You can use this model directly with a pipeline for Text to Text Generation:
|
44 |
+
|
45 |
+
```python
|
46 |
+
from transformers import pipeline
|
47 |
+
|
48 |
+
|
49 |
+
ctx_corr = pipeline("text-generation", model='DeathReaper0965/t5-context-corrector')
|
50 |
+
ctx_corr("Do you even know why I always need changed our checking account number")
|
51 |
+
|
52 |
+
###########OUTPUT###########
|
53 |
+
|
54 |
+
# [{'generated_text': 'Do you even know why I always need to change our checking account number?'}]
|
55 |
+
```
|
56 |
+
|
57 |
+
Or you can also use the model to get the features for a given text:
|
58 |
+
|
59 |
+
```python
|
60 |
+
from nltk import sent_tokenize
|
61 |
+
|
62 |
+
from transformers import T5ForConditionalGeneration, T5Tokenizer
|
63 |
+
|
64 |
+
|
65 |
+
# Load model and tokenizer
|
66 |
+
cc_tokenizer = T5Tokenizer.from_pretrained("DeathReaper0965/t5-context-corrector")
|
67 |
+
cc_model = T5ForConditionalGeneration.from_pretrained("DeathReaper0965/t5-context-corrector")
|
68 |
+
|
69 |
+
# Utility function to correct context
|
70 |
+
def correct_context(input_text, temperature=0.5):
|
71 |
+
# tokenize
|
72 |
+
batch = cc_tokenizer(input_text,
|
73 |
+
truncation=True,
|
74 |
+
padding='max_length',
|
75 |
+
max_length=256,
|
76 |
+
return_tensors="pt")
|
77 |
+
|
78 |
+
# forward pass
|
79 |
+
results = cc_model.generate(**batch,
|
80 |
+
max_length=256,
|
81 |
+
num_beams=3,
|
82 |
+
no_repeat_ngram_size=2,
|
83 |
+
repetition_penalty=2.5,
|
84 |
+
temperature=temperature,
|
85 |
+
do_sample=True)
|
86 |
+
|
87 |
+
return results
|
88 |
+
|
89 |
+
# Utility function to split the paragraph to multiple sentences
|
90 |
+
def split_and_correct_context(sent):
|
91 |
+
sents = sent_tokenize(sent)
|
92 |
+
|
93 |
+
final_sents = cc_tokenizer.batch_decode(correct_context(sents),
|
94 |
+
clean_up_tokenization_spaces=True,
|
95 |
+
skip_special_tokens=True)
|
96 |
+
|
97 |
+
final_sents = " ".join([final_sents[i].strip() for i in range(len(final_sents))])
|
98 |
+
|
99 |
+
return final_sents
|
100 |
+
|
101 |
+
|
102 |
+
split_and_correct_context("Do you even know why I always need changed our checking account number. If not let me know")
|
103 |
+
|
104 |
+
###########OUTPUT###########
|
105 |
+
'Do you even know why I always need to change our checking account number? If not, let me know.'
|
106 |
+
```
|