cobrayyxx's picture
Update README.md
433dad0 verified
---
datasets:
- cobrayyxx/FLEURS_ID-EN
language:
- id
- en
metrics:
- bleu
- chrf
base_model:
- facebook/nllb-200-distilled-600M
pipeline_tag: translation
---
## Model description
This model is a fine-tuned version of [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) on an Indonesian-English [CoVoST2](https://huggingface.co/datasets/cobrayyxx/COVOST2_ID-EN) dataset.
## Intended uses & limitations
This model is used to predict the translation of Indonesian Transcription.
## How to Use
This is how to use the model with Faster-Whisper.
1. Convert the model into the CTranslate2 format with float16 quantization.
```
!ct2-transformers-converter --model cobrayyxx/nllb-indo-en-covost2 --quantization float16 --output_dir ct2/ct2-nllb-indo-en-float16
```
2. Load the converted model using `ctranslate2` library.
```
from faster_whisper import WhisperModel
import os
ct2_model_name = "ct2-nllb-indo-en-float16"
ct_model_path = os.path.join("ct2", ct2_model_name)
translator = ctranslate2.Translator(ct_model_path, device=device)
```
3. Download the SentencePiece model
```
!wget https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model
```
4. Load the SentencePiece model
```
import sentencepiece as spm
sp_model_path = os.path.join(directory, "flores200_sacrebleu_tokenizer_spm.model")
sp = spm.SentencePieceProcessor()
sp.load(sp_model_path)
```
5. Now, the loaded model can be used.
```
src_lang = "ind_Latn"
tgt_lang = "eng_Latn"
beam_size = 5
source_sentences = lst_of_sentences
source_sents = [sent.strip() for sent in source_sentences]
target_prefix = [[tgt_lang]] * len(source_sents)
# Chunk source sentences into subword
source_sents_subworded = sp.encode_as_pieces(source_sents)
source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]
# Translate the source sentences
translations = translator.translate_batch(source_sents_subworded,
batch_type="tokens",
max_batch_size=2024,
beam_size=beam_size,
target_prefix=target_prefix)
translations = [translation.hypotheses[0] for translation in translations]
# Merge all of the subword in the target sentences
translations_desubword = sp.decode(translations)
translations_desubword = [sent[len(tgt_lang):].strip() for sent in translations_desubword]
```
Note: If you faced the kernel error everytime running the code above. You have to install `nvidia-cublas` and `nvidia-cudnn`
```
apt update
apt install libcudnn9-cuda-12
```
and Install the library using pip. [Read The Documentation for more.](https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#gpu)
```
pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
```
Special thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for her help in resolving this.
## Training procedure
### Training Results
| Epoch | Training Loss | Validation Loss | BLEU |
|-------|--------------|----------------|------|
| 1 | 0.119100 | 0.048539 | 60.267190 |
| 2 | 0.020900 | 0.044844 | 59.821654 |
| 3 | 0.014600 | 0.048637 | 60.185481 |
| 4 | 0.007200 | 0.052005 | 60.150045 |
| 5 | 0.005100 | 0.054909 | 59.938441 |
| 6 | 0.004500 | 0.056668 | 60.032409 |
| 7 | 0.003800 | 0.058903 | 60.176242 |
| 8 | 0.002900 | 0.059880 | 60.168394 |
| 9 | 0.002400 | 0.060914 | 60.280851 |
## Model Evaluation
The performance of the baseline and fine-tuned model were evaluated using the BLEU and CHRF++ metrics on the validation dataset.
This fine-tuned model shows some improvement over the baseline model.
| Model | BLEU | ChrF++ |
|-----------------------|------:|-------:|
| Baseline | 50.91 | 68.1 |
| Fine-Tuned |58.3 | 73.62 |
### Evaluation details
- BLEU: Measures the overlap between predicted and reference text based on n-grams.
- CHRF: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.
# Credits
Huge thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for mentoring me.