asif00
/

mbart_bn_error_correction

@@ -22,13 +22,13 @@ The goal of this project was to develop a software model that could fix grammati
 ## Initial Testing:
-In the beginning, I experimented with several T5 models (mt5(small/base), Flan T5, and Bengali T5), but the results were not very good. I was limited by computational resources, which allowed me to train these models with only 10% of the data for 1 to 2 epochs. With this limited testing the result wasn't promising enough to invest all the available resources to the limit therefore I explored other models and found a winner model that is well suited for this task.
 I also tested casual large models like Mistral 7B and Gemma 2B, and even with optimizations like QLoRa, they were too large and costly to run.
 During the initial testing, I tried training the same models with different token lengths, a maximum token length of 20 provided much better results than 64. The current model has a maximum token length of 32.
-Beyond Seq2Seq models and approach a few other ideas also crossed my mind. Other methods considered included using NER (Named Entity Recognition) to tag words as correct or incorrect, and masked models that focused on correcting one wrong word at a time. Both methods required knowing the errors in advance or making multiple calls to get a final verdict, which was not practical to say. There are other solutions too that don't use ML at all. Approaches like running each word against a reference list and replacing them when there's no hit. Attempts to replace each word based on a reference list worked somewhat like a spell checker which wasn't the goal.
 Ultimately, mBART 50 was chosen as the best model because of its flexibility, resource efficiency, and reproducibility.
@@ -63,10 +63,10 @@ Here is a simple way to use the fine-tuned model to correct Bengali sentences:
 If you are trying to use it on a script, this is how can do It:
 ```python
-from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
 checkpoint = "model/checkpoint"
-tokenizer = AutoTokenizer.from_pretrained(checkpoint, src_lang="bn_IN", tgt_lang="bn_IN", use_fast=True)
 model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,  use_safetensors =True)
 incorrect_bengali_sentence = "আপনি কমন আছেন?"
@@ -82,7 +82,7 @@ If you want to test this model from the terminal, run the `python correction.py`
 # General issues faced during the entire journey:
-- Issue: The system is not printing any evaluation function.
   Solution: The GPU that I am training on doesn't support FP16/BF16 precision. Commenting out `fp16 =True` in the Seq2SeqTrainingArguments solved the issue.
 - Issue: Training on TPU crashes on both Colab and Kaggle.

 ## Initial Testing:
+In the beginning, I experimented with several T5 models (mt5 (small/base), Flan T5, and Bengali T5), but the results were not very good. I was limited by computational resources, which allowed me to train these models with only 10% of the data for 1 to 2 epochs. With this limited testing the result wasn't promising enough to invest all the available resources to the limit therefore I explored other models and found a winner model that is well suited for this task.
 I also tested casual large models like Mistral 7B and Gemma 2B, and even with optimizations like QLoRa, they were too large and costly to run.
 During the initial testing, I tried training the same models with different token lengths, a maximum token length of 20 provided much better results than 64. The current model has a maximum token length of 32.
+Beyond Seq2Seq models and approaches, a few other ideas also crossed my mind. Other methods considered included using NER (Named Entity Recognition) to tag words as correct or incorrect, and masked models that focused on correcting one wrong word at a time. Both methods required knowing the errors in advance or making multiple calls to get a final verdict, which was not practical to say. There are other solutions too that don't use ML at all. Approaches like running each word against a reference list and replacing them when there's no hit. Attempts to replace each word based on a reference list worked somewhat like a spell checker which wasn't the goal.
 Ultimately, mBART 50 was chosen as the best model because of its flexibility, resource efficiency, and reproducibility.
 If you are trying to use it on a script, this is how can do It:
 ```python
+from transformers import AutoModelForSeq2SeqLM, MBart50Tokenizer
 checkpoint = "model/checkpoint"
+tokenizer = MBart50Tokenizer.from_pretrained(checkpoint, src_lang="bn_IN", tgt_lang="bn_IN", use_fast=True)
 model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint,  use_safetensors =True)
 incorrect_bengali_sentence = "আপনি কমন আছেন?"
 # General issues faced during the entire journey:
+- Issue: The system is not printing any evaluation functions.
   Solution: The GPU that I am training on doesn't support FP16/BF16 precision. Commenting out `fp16 =True` in the Seq2SeqTrainingArguments solved the issue.
 - Issue: Training on TPU crashes on both Colab and Kaggle.