Update README.md
Browse files
README.md
CHANGED
@@ -4,12 +4,17 @@ language:
|
|
4 |
- bn
|
5 |
metrics:
|
6 |
- bleu
|
|
|
|
|
|
|
7 |
library_name: transformers
|
8 |
pipeline_tag: text2text-generation
|
|
|
|
|
9 |
---
|
10 |
-
#
|
11 |
|
12 |
-
The goal of this project was to develop a software model that could fix grammatical and syntax errors in Bengali text. The approach was similar to how a language translator works, where the incorrect sentence is transformed into a correct one. We fine tune a pertained model, namely [mBart50] with a [dataset] of 1.M samples for 6500 steps and achieve a
|
13 |
|
14 |
## Initial Testing:
|
15 |
|
@@ -67,12 +72,13 @@ correct_bengali_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True
|
|
67 |
# আপনি কেমন আছেন?
|
68 |
```
|
69 |
|
|
|
70 |
|
71 |
#### Important note: You need to make sure if have used `use_safetensors =True` parameter during loading the model.
|
72 |
|
73 |
# General issues faced during the entire journey:
|
74 |
|
75 |
-
- Issue: The system is not printing any evaluation
|
76 |
Solution: The GPU that I am training on doesn't support FP16/BF16 precision. Commenting out `fp16 =True` in the Seq2SeqTrainingArguments solved the issue.
|
77 |
|
78 |
- Issue: Training on TPU crashes on both Colab and Kaggle.
|
@@ -85,5 +91,6 @@ The model is clearly overfitting, and we can reduce that. My best guess is that
|
|
85 |
I'm also planning to run a 4-bit quantization on the same model to see how it performs against the base model. It should be a fun experiment.
|
86 |
|
87 |
## Resources and References:
|
|
|
88 |
[Dataset Source](https://github.com/hishab-nlp/BNSECData)
|
89 |
[Model Documentation and Troubleshooting](https://huggingface.co/docs/transformers/model_doc/mbart)
|
|
|
4 |
- bn
|
5 |
metrics:
|
6 |
- bleu
|
7 |
+
- cer
|
8 |
+
- wer
|
9 |
+
- meteor
|
10 |
library_name: transformers
|
11 |
pipeline_tag: text2text-generation
|
12 |
+
tags:
|
13 |
+
- text-generation-inference
|
14 |
---
|
15 |
+
# Bengali Text Correction Overview:
|
16 |
|
17 |
+
The goal of this project was to develop a software model that could fix grammatical and syntax errors in Bengali text. The approach was similar to how a language translator works, where the incorrect sentence is transformed into a correct one. We fine tune a pertained model, namely [mBart50](https://huggingface.co/facebook/mbart-large-50) with a [dataset](https://github.com/hishab-nlp/BNSECData) of 1.3 M samples for 6500 steps and achieve a score of `{BLEU: 0.443, CER:0.159, WER:0.406, Meteor: 0.655}`when tested on unseen data. Clone/download this [repo](https://github.com/himisir/Bengali-Sentence-Error-Correction), run the `correction.py` script and type the sentence after the prompt and you are all set.
|
18 |
|
19 |
## Initial Testing:
|
20 |
|
|
|
72 |
# আপনি কেমন আছেন?
|
73 |
```
|
74 |
|
75 |
+
If you want to test this model from the terminal, run the `python correction.py` script and type the sentence after the prompt and you are all set. you'll need the `transformers` library to run this script. Install the `transformers` model using `pip install -q transformers[torch] -U`.
|
76 |
|
77 |
#### Important note: You need to make sure if have used `use_safetensors =True` parameter during loading the model.
|
78 |
|
79 |
# General issues faced during the entire journey:
|
80 |
|
81 |
+
- Issue: The system is not printing any evaluation function.
|
82 |
Solution: The GPU that I am training on doesn't support FP16/BF16 precision. Commenting out `fp16 =True` in the Seq2SeqTrainingArguments solved the issue.
|
83 |
|
84 |
- Issue: Training on TPU crashes on both Colab and Kaggle.
|
|
|
91 |
I'm also planning to run a 4-bit quantization on the same model to see how it performs against the base model. It should be a fun experiment.
|
92 |
|
93 |
## Resources and References:
|
94 |
+
|
95 |
[Dataset Source](https://github.com/hishab-nlp/BNSECData)
|
96 |
[Model Documentation and Troubleshooting](https://huggingface.co/docs/transformers/model_doc/mbart)
|