cobrayyxx's picture
Update README.md
433dad0 verified
metadata
datasets:
  - cobrayyxx/FLEURS_ID-EN
language:
  - id
  - en
metrics:
  - bleu
  - chrf
base_model:
  - facebook/nllb-200-distilled-600M
pipeline_tag: translation

Model description

This model is a fine-tuned version of facebook/nllb-200-distilled-600M on an Indonesian-English CoVoST2 dataset.

Intended uses & limitations

This model is used to predict the translation of Indonesian Transcription.

How to Use

This is how to use the model with Faster-Whisper.

  1. Convert the model into the CTranslate2 format with float16 quantization.

    !ct2-transformers-converter --model cobrayyxx/nllb-indo-en-covost2 --quantization float16 --output_dir ct2/ct2-nllb-indo-en-float16
    
  2. Load the converted model using ctranslate2 library.

     from faster_whisper import WhisperModel
     import os
    
     ct2_model_name = "ct2-nllb-indo-en-float16"
     
     ct_model_path = os.path.join("ct2", ct2_model_name)
     translator = ctranslate2.Translator(ct_model_path, device=device)
    
  3. Download the SentencePiece model

    !wget https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model
    
  4. Load the SentencePiece model

    import sentencepiece as spm
    
    sp_model_path = os.path.join(directory, "flores200_sacrebleu_tokenizer_spm.model")
    
    sp = spm.SentencePieceProcessor()
    sp.load(sp_model_path)
    
  5. Now, the loaded model can be used.

     src_lang = "ind_Latn"
     tgt_lang = "eng_Latn"
     
     beam_size = 5
     
     source_sentences = lst_of_sentences
     
     source_sents = [sent.strip() for sent in source_sentences]
     target_prefix = [[tgt_lang]] * len(source_sents)
     
     # Chunk source sentences into subword
     source_sents_subworded = sp.encode_as_pieces(source_sents)
     source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]
     
     # Translate the source sentences
     translations = translator.translate_batch(source_sents_subworded,
                                               batch_type="tokens",
                                               max_batch_size=2024,
                                               beam_size=beam_size,
                                               target_prefix=target_prefix)
     translations = [translation.hypotheses[0] for translation in translations]
     
     # Merge all of the subword in the target sentences
     translations_desubword = sp.decode(translations)
     translations_desubword = [sent[len(tgt_lang):].strip() for sent in translations_desubword]
    

    Note: If you faced the kernel error everytime running the code above. You have to install nvidia-cublas and nvidia-cudnn

    apt update
    apt install libcudnn9-cuda-12
    

    and Install the library using pip. Read The Documentation for more.

    pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
    
    export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
    

    Special thanks to Yasmin Moslem for her help in resolving this.

Training procedure

Training Results

Epoch Training Loss Validation Loss BLEU
1 0.119100 0.048539 60.267190
2 0.020900 0.044844 59.821654
3 0.014600 0.048637 60.185481
4 0.007200 0.052005 60.150045
5 0.005100 0.054909 59.938441
6 0.004500 0.056668 60.032409
7 0.003800 0.058903 60.176242
8 0.002900 0.059880 60.168394
9 0.002400 0.060914 60.280851

Model Evaluation

The performance of the baseline and fine-tuned model were evaluated using the BLEU and CHRF++ metrics on the validation dataset. This fine-tuned model shows some improvement over the baseline model.

Model BLEU ChrF++
Baseline 50.91 68.1
Fine-Tuned 58.3 73.62

Evaluation details

  • BLEU: Measures the overlap between predicted and reference text based on n-grams.
  • CHRF: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.

Credits

Huge thanks to Yasmin Moslem for mentoring me.