|
--- |
|
datasets: |
|
- cobrayyxx/FLEURS_ID-EN |
|
language: |
|
- id |
|
- en |
|
metrics: |
|
- bleu |
|
- chrf |
|
base_model: |
|
- facebook/nllb-200-distilled-600M |
|
pipeline_tag: translation |
|
--- |
|
## Model description |
|
|
|
This model is a fine-tuned version of [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) on an Indonesian-English [CoVoST2](https://huggingface.co/datasets/cobrayyxx/COVOST2_ID-EN) dataset. |
|
|
|
## Intended uses & limitations |
|
|
|
This model is used to predict the translation of Indonesian Transcription. |
|
|
|
## How to Use |
|
This is how to use the model with Faster-Whisper. |
|
1. Convert the model into the CTranslate2 format with float16 quantization. |
|
``` |
|
!ct2-transformers-converter --model cobrayyxx/nllb-indo-en-covost2 --quantization float16 --output_dir ct2/ct2-nllb-indo-en-float16 |
|
``` |
|
2. Load the converted model using `ctranslate2` library. |
|
``` |
|
from faster_whisper import WhisperModel |
|
import os |
|
|
|
ct2_model_name = "ct2-nllb-indo-en-float16" |
|
|
|
ct_model_path = os.path.join("ct2", ct2_model_name) |
|
translator = ctranslate2.Translator(ct_model_path, device=device) |
|
``` |
|
3. Download the SentencePiece model |
|
``` |
|
!wget https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model |
|
``` |
|
4. Load the SentencePiece model |
|
``` |
|
import sentencepiece as spm |
|
|
|
sp_model_path = os.path.join(directory, "flores200_sacrebleu_tokenizer_spm.model") |
|
|
|
sp = spm.SentencePieceProcessor() |
|
sp.load(sp_model_path) |
|
``` |
|
5. Now, the loaded model can be used. |
|
``` |
|
src_lang = "ind_Latn" |
|
tgt_lang = "eng_Latn" |
|
|
|
beam_size = 5 |
|
|
|
source_sentences = lst_of_sentences |
|
|
|
source_sents = [sent.strip() for sent in source_sentences] |
|
target_prefix = [[tgt_lang]] * len(source_sents) |
|
|
|
# Chunk source sentences into subword |
|
source_sents_subworded = sp.encode_as_pieces(source_sents) |
|
source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded] |
|
|
|
# Translate the source sentences |
|
translations = translator.translate_batch(source_sents_subworded, |
|
batch_type="tokens", |
|
max_batch_size=2024, |
|
beam_size=beam_size, |
|
target_prefix=target_prefix) |
|
translations = [translation.hypotheses[0] for translation in translations] |
|
|
|
# Merge all of the subword in the target sentences |
|
translations_desubword = sp.decode(translations) |
|
translations_desubword = [sent[len(tgt_lang):].strip() for sent in translations_desubword] |
|
``` |
|
|
|
Note: If you faced the kernel error everytime running the code above. You have to install `nvidia-cublas` and `nvidia-cudnn` |
|
|
|
``` |
|
apt update |
|
apt install libcudnn9-cuda-12 |
|
``` |
|
|
|
and Install the library using pip. [Read The Documentation for more.](https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#gpu) |
|
``` |
|
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.* |
|
|
|
export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'` |
|
``` |
|
Special thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for her help in resolving this. |
|
|
|
## Training procedure |
|
|
|
### Training Results |
|
|
|
| Epoch | Training Loss | Validation Loss | BLEU | |
|
|-------|--------------|----------------|------| |
|
| 1 | 0.119100 | 0.048539 | 60.267190 | |
|
| 2 | 0.020900 | 0.044844 | 59.821654 | |
|
| 3 | 0.014600 | 0.048637 | 60.185481 | |
|
| 4 | 0.007200 | 0.052005 | 60.150045 | |
|
| 5 | 0.005100 | 0.054909 | 59.938441 | |
|
| 6 | 0.004500 | 0.056668 | 60.032409 | |
|
| 7 | 0.003800 | 0.058903 | 60.176242 | |
|
| 8 | 0.002900 | 0.059880 | 60.168394 | |
|
| 9 | 0.002400 | 0.060914 | 60.280851 | |
|
|
|
## Model Evaluation |
|
|
|
The performance of the baseline and fine-tuned model were evaluated using the BLEU and CHRF++ metrics on the validation dataset. |
|
This fine-tuned model shows some improvement over the baseline model. |
|
| Model | BLEU | ChrF++ | |
|
|-----------------------|------:|-------:| |
|
| Baseline | 50.91 | 68.1 | |
|
| Fine-Tuned |58.3 | 73.62 | |
|
### Evaluation details |
|
- BLEU: Measures the overlap between predicted and reference text based on n-grams. |
|
- CHRF: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages. |
|
|
|
# Credits |
|
Huge thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for mentoring me. |