dragoman / README.md

Update README.md

ea85d69 verified 11 months ago

5.41 kB

	---
	license: apache-2.0
	datasets:
	- Helsinki-NLP/opus_paracrawl
	- turuta/Multi30k-uk
	language:
	- uk
	- en
	metrics:
	- bleu
	library_name: peft
	pipeline_tag: text-generation
	base_model: mistralai/Mistral-7B-v0.1
	tags:
	- translation
	model-index:
	- name: Dragoman
	results:
	- task:
	type: translation # Required. Example: automatic-speech-recognition
	name: English-Ukrainian Translation # Optional. Example: Speech Recognition
	dataset:
	type: facebook/flores # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
	name: FLORES-101 # Required. A pretty name for the dataset. Example: Common Voice (French)
	config: eng_Latn-ukr_Cyrl # Optional. The name of the dataset configuration used in `load_dataset()`. Example: fr in `load_dataset("common_voice", "fr")`. See the `datasets` docs for more info: https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset.name
	split: devtest # Optional. Example: test
	metrics:
	- type: bleu # Required. Example: wer. Use metric id from https://hf.co/metrics
	value: 32.34 # Required. Example: 20.90
	name: Test BLEU # Optional. Example: Test WER
	widget:
	- text: "[INST] who holds this neighborhood? [/INST]"
	---

	# Dragoman: English-Ukrainian Machine Translation Model

	## Model Description

	The Dragoman is a sentence-level SOTA English-Ukrainian translation model. It's trained using a two-phase pipeline: pretraining on cleaned [Paracrawl](https://huggingface.co/datasets/Helsinki-NLP/opus_paracrawl) dataset and unsupervised data selection phase on [turuta/Multi30k-uk](https://huggingface.co/datasets/turuta/Multi30k-uk).

	By using a two-phase data cleaning and data selection approach we have achieved SOTA performance on FLORES-101 English-Ukrainian devtest subset with BLEU `32.34`.


	## Model Details

	- Developed by: Yurii Paniv, Dmytro Chaplynskyi, Nikita Trynus, Volodymyr Kyrylov
	- Model type: Translation model
	- Language(s):
	- Source Language: English
	- Target Language: Ukrainian
	- License: Apache 2.0

	## Model Use Cases

	We designed this model for sentence-level English -> Ukrainian translation.
	Performance on multi-sentence texts is not guaranteed, please be aware.


	#### Running the model


	```python
	# pip install bitsandbytes transformers peft torch
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	config = PeftConfig.from_pretrained("lang-uk/dragoman")
	quant_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=float16,
	bnb_4bit_use_double_quant=False,
	)

	model = MistralForCausalLM.from_pretrained(
	"mistralai/Mistral-7B-v0.1", quantization_config=quant_config
	)
	model = PeftModel.from_pretrained(model, "lang-uk/dragoman").to("cuda")
	tokenizer = AutoTokenizer.from_pretrained(
	"mistralai/Mistral-7B-v0.1", use_fast=False, add_bos_token=False
	)

	input_text = "[INST] who holds this neighborhood? [/INST]" # model input should adhere to this format
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```

	### Training Dataset and Resources

	Training code: [lang-uk/dragoman](https://github.com/lang-uk/dragoman)
	Cleaned Paracrawl: [lang-uk/paracrawl_3m](https://huggingface.co/datasets/lang-uk/paracrawl_3m)
	Cleaned Multi30K: [lang-uk/multi30k-extended-17k](https://huggingface.co/datasets/lang-uk/multi30k-extended-17k)



	### Benchmark Results against other models on FLORES-101 devset


	\| Model \| BLEU $\uparrow$ \| spBLEU \| chrF \| chrF++ \|
	\|---------------------------------------------\|---------------------\|-------------\|----------\|------------\|
	\| Finetuned \| \| \| \| \|
	\| Dragoman P, 10 beams \| 30.38 \| 37.93 \| 59.49 \| 56.41 \|
	\| Dragoman PT, 10 beams \| 32.34 \| 39.93 \| 60.72\| 57.82 \|
	\|---------------------------------------------\|---------------------\|-------------\|----------\|------------\|
	\| Zero shot and few shot \| \| \| \| \|
	\| LLaMa-2-7B 2-shot \| 20.1 \| 26.78 \| 49.22 \| 46.29 \|
	\| RWKV-5-World-7B 0-shot \| 21.06 \| 26.20 \| 49.46 \| 46.46 \|
	\| gpt-4 10-shot \| 29.48 \| 37.94 \| 58.37 \| 55.38 \|
	\| gpt-4-turbo-preview 0-shot \| 30.36 \| 36.75 \| 59.18 \| 56.19 \|
	\| Google Translate 0-shot \| 25.85 \| 32.49 \| 55.88 \| 52.48 \|
	\|---------------------------------------------\|---------------------\|-------------\|----------\|------------\|
	\| Pretrained \| \| \| \| \|
	\| NLLB 3B, 10 beams \| 30.46 \| 37.22 \| 58.11 \| 55.32 \|
	\| OPUS-MT, 10 beams \| 32.2 \| 39.76 \| 60.23 \| 57.38 \|


	## Citation

	TBD