Update README.md

433dad0 verified 8 days ago

4.75 kB

	---
	datasets:
	- cobrayyxx/FLEURS_ID-EN
	language:
	- id
	- en
	metrics:
	- bleu
	- chrf
	base_model:
	- facebook/nllb-200-distilled-600M
	pipeline_tag: translation
	---
	## Model description

	This model is a fine-tuned version of [facebook/nllb-200-distilled-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) on an Indonesian-English [CoVoST2](https://huggingface.co/datasets/cobrayyxx/COVOST2_ID-EN) dataset.

	## Intended uses & limitations

	This model is used to predict the translation of Indonesian Transcription.

	## How to Use
	This is how to use the model with Faster-Whisper.
	1. Convert the model into the CTranslate2 format with float16 quantization.
	```
	!ct2-transformers-converter --model cobrayyxx/nllb-indo-en-covost2 --quantization float16 --output_dir ct2/ct2-nllb-indo-en-float16
	```
	2. Load the converted model using `ctranslate2` library.
	```
	from faster_whisper import WhisperModel
	import os

	ct2_model_name = "ct2-nllb-indo-en-float16"

	ct_model_path = os.path.join("ct2", ct2_model_name)
	translator = ctranslate2.Translator(ct_model_path, device=device)
	```
	3. Download the SentencePiece model
	```
	!wget https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model
	```
	4. Load the SentencePiece model
	```
	import sentencepiece as spm

	sp_model_path = os.path.join(directory, "flores200_sacrebleu_tokenizer_spm.model")

	sp = spm.SentencePieceProcessor()
	sp.load(sp_model_path)
	```
	5. Now, the loaded model can be used.
	```
	src_lang = "ind_Latn"
	tgt_lang = "eng_Latn"

	beam_size = 5

	source_sentences = lst_of_sentences

	source_sents = [sent.strip() for sent in source_sentences]
	target_prefix = [[tgt_lang]] * len(source_sents)

	# Chunk source sentences into subword
	source_sents_subworded = sp.encode_as_pieces(source_sents)
	source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]

	# Translate the source sentences
	translations = translator.translate_batch(source_sents_subworded,
	batch_type="tokens",
	max_batch_size=2024,
	beam_size=beam_size,
	target_prefix=target_prefix)
	translations = [translation.hypotheses[0] for translation in translations]

	# Merge all of the subword in the target sentences
	translations_desubword = sp.decode(translations)
	translations_desubword = [sent[len(tgt_lang):].strip() for sent in translations_desubword]
	```

	Note: If you faced the kernel error everytime running the code above. You have to install `nvidia-cublas` and `nvidia-cudnn`

	```
	apt update
	apt install libcudnn9-cuda-12
	```

	and Install the library using pip. [Read The Documentation for more.](https://github.com/SYSTRAN/faster-whisper?tab=readme-ov-file#gpu)
	```
	pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*

	export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
	```
	Special thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for her help in resolving this.

	## Training procedure

	### Training Results

	\| Epoch \| Training Loss \| Validation Loss \| BLEU \|
	\|-------\|--------------\|----------------\|------\|
	\| 1 \| 0.119100 \| 0.048539 \| 60.267190 \|
	\| 2 \| 0.020900 \| 0.044844 \| 59.821654 \|
	\| 3 \| 0.014600 \| 0.048637 \| 60.185481 \|
	\| 4 \| 0.007200 \| 0.052005 \| 60.150045 \|
	\| 5 \| 0.005100 \| 0.054909 \| 59.938441 \|
	\| 6 \| 0.004500 \| 0.056668 \| 60.032409 \|
	\| 7 \| 0.003800 \| 0.058903 \| 60.176242 \|
	\| 8 \| 0.002900 \| 0.059880 \| 60.168394 \|
	\| 9 \| 0.002400 \| 0.060914 \| 60.280851 \|

	## Model Evaluation

	The performance of the baseline and fine-tuned model were evaluated using the BLEU and CHRF++ metrics on the validation dataset.
	This fine-tuned model shows some improvement over the baseline model.
	\| Model \| BLEU \| ChrF++ \|
	\|-----------------------\|------:\|-------:\|
	\| Baseline \| 50.91 \| 68.1 \|
	\| Fine-Tuned \|58.3 \| 73.62 \|
	### Evaluation details
	- BLEU: Measures the overlap between predicted and reference text based on n-grams.
	- CHRF: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.

	# Credits
	Huge thanks to [Yasmin Moslem](https://huggingface.co/ymoslem) for mentoring me.