--- datasets: - cobrayyxx/FLEURS_ID-EN language: - id - en metrics: - bleu - chrf base_model: - facebook/nllb-200-distilled-600M pipeline_tag: translation --- ## Model description This model is a fine-tuned version of [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) on an Indonesian-English [CoVoST2](https://huggingface.co/datasets/cobrayyxx/COVOST2_ID-EN) dataset. ## Intended uses & limitations This model is used to predict the translation of Indonesian Transcription. ## How to Use This is how to use the model with Faster-Whisper. 1. Convert the model into the CTranslate2 format with float16 quantization. ``` !ct2-transformers-converter --model cobrayyxx/nllb-indo-en-covost2 --quantization float16 --output_dir ct2/ct2-nllb-indo-en-float16 ``` 2. Load the converted model using `ctranslate2` library. ``` from faster_whisper import WhisperModel import os ct2_model_name = "ct2-nllb-indo-en-float16" ct_model_path = os.path.join("ct2", ct2_model_name) translator = ctranslate2.Translator(ct_model_path, device=device) ``` 3. Download the SentencePiece model ``` !wget https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model ``` 4. Load the SentencePiece model ``` import sentencepiece as spm sp_model_path = os.path.join(directory, "flores200_sacrebleu_tokenizer_spm.model") sp = spm.SentencePieceProcessor() sp.load(sp_model_path) ``` 5. Now, the loaded model can be used. ``` src_lang = "ind_Latn" tgt_lang = "eng_Latn" beam_size = 5 source_sentences = lst_of_sentences source_sents = [sent.strip() for sent in source_sentences] target_prefix = [[tgt_lang]] * len(source_sents) # Chunk source sentences into subword source_sents_subworded = sp.encode_as_pieces(source_sents) source_sents_subworded = [[src_lang] + sent + [""] for sent in source_sents_subworded] # Translate the source sentences translations = translator.translate_batch(source_sents_subworded, batch_type="tokens", max_batch_size=2024, beam_size=beam_size, target_prefix=target_prefix) translations = [translation.hypotheses[0] for translation in translations] # Merge all of the subword in the target sentences translations_desubword = sp.decode(translations) translations_desubword = [sent[len(tgt_lang):].strip() for sent in translations_desubword] ``` Note: If you faced the kernel error everytime running the code above. You have to install `nvidia-cublas` and `nvidia-cudnn` ``` apt update apt install libcudnn9-cuda-12 ``` and Install the library using pip. [Read The Documentation for more.](https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#gpu) ``` pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.* export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'` ``` Special thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for her help in resolving this. ## Training procedure ### Training Results | Epoch | Training Loss | Validation Loss | BLEU | |-------|--------------|----------------|------| | 1 | 0.119100 | 0.048539 | 60.267190 | | 2 | 0.020900 | 0.044844 | 59.821654 | | 3 | 0.014600 | 0.048637 | 60.185481 | | 4 | 0.007200 | 0.052005 | 60.150045 | | 5 | 0.005100 | 0.054909 | 59.938441 | | 6 | 0.004500 | 0.056668 | 60.032409 | | 7 | 0.003800 | 0.058903 | 60.176242 | | 8 | 0.002900 | 0.059880 | 60.168394 | | 9 | 0.002400 | 0.060914 | 60.280851 | ## Model Evaluation The performance of the baseline and fine-tuned model were evaluated using the BLEU and CHRF++ metrics on the validation dataset. This fine-tuned model shows some improvement over the baseline model. | Model | BLEU | ChrF++ | |-----------------------|------:|-------:| | Baseline | 50.91 | 68.1 | | Fine-Tuned |58.3 | 73.62 | ### Evaluation details - BLEU: Measures the overlap between predicted and reference text based on n-grams. - CHRF: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages. # Credits Huge thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for mentoring me.