Update README.md

17ddc46 verified 3 months ago

5.12 kB

	---
	base_model: AI-Sweden-Models/gpt-sw3-6.7b-v2
	library_name: peft
	datasets:
	- barbaroo/Sprotin_parallel
	language:
	- en
	- fo
	metrics:
	- bleu
	- chrf
	- bertscore
	pipeline_tag: text-generation
	---

	# Model Card: English–Faroese Translation Adapter

	## Model Details

	Model Description

	- Developed by: Barbara Scalvini
	- Model type: Language model adapter for English → Faroese translation
	- Language(s): English, Faroese
	- License: This adapter inherits the license from the original GPT-SW3 6.7B model.
	- Finetuned from model: [AI-Sweden-Models/gpt-sw3-6.7b-v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2)
	- Library used: [PEFT 0.13.0](https://github.com/huggingface/peft)

	### Model Sources

	- Paper: [COMING SOON]
	---

	## Uses

	### Direct Use
	This adapter is intended to perform English→Faroese translation, leveraging a parameter-efficient fine-tuning (PEFT) approach.

	### Downstream Use [optional]
	- Can be integrated into broader multilingual or localization workflows.


	### Out-of-Scope Use
	- Any uses that rely on languages other than English or Faroese will likely yield suboptimal results.
	- Other tasks (e.g., summarization, classification) may be unsupported or require further fine-tuning.

	---

	## Bias, Risks, and Limitations
	- Biases: The model could reflect biases present in the training data, such as historical or societal biases in English or Faroese texts.
	- Recommendation: Users should critically evaluate outputs, especially in sensitive or high-stakes applications.

	---

	## How to Get Started with the Model

	```python
	import torch
	from peft import AutoPeftModelForCausalLM
	from transformers import AutoTokenizer
	import pandas as pd

	ADAPTER_REPO = "barbaroo/gptsw3_translate_6.7B"
	BASE_MODEL = "AI-Sweden-Models/gpt-sw3-6.7b-v2"

	# 1. Load the tokenizer from the base model
	tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)

	model = AutoPeftModelForCausalLM.from_pretrained(
	ADAPTER_REPO,
	load_in_8bit=True, # Optional: 8-bit quantization for GPU memory efficiency
	device_map="auto", # Automatically spread layers across available GPUs
	)

	# Ensure the model is in evaluation mode
	model.eval()

	# Alpaca-style prompt template
	alpaca_prompt = """
	### Instruction:
	{}

	### Input:
	{}

	### Response:
	{}
	"""

	# EOS token from the tokenizer
	EOS_TOKEN = tokenizer.eos_token
	print(EOS_TOKEN)

	sentences = ['hello world']

	translations = []

	for sentence in sentences:
	# Tokenize the input sentence and prepare the prompt for each sentence
	inputs = tokenizer(
	[
	alpaca_prompt.format(
	"Translate this sentence from English to Faroese:", # instruction
	sentence, # input sentence to translate
	"", # output - leave blank for generation
	)
	],
	return_tensors="pt"
	).to("cuda")

	# Generate the output
	outputs = model.generate(**inputs,
	max_new_tokens=2000,
	eos_token_id=tokenizer.eos_token_id, # Ensure EOS token is used
	pad_token_id=tokenizer.pad_token_id, # Ensure padding token is used
	use_cache=True,
	do_sample = True,
	temperature = 0.1,
	top_p=1)

	# Decode the generated tokens into a string
	output_string = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
	#print(output_string)

	# Use a regular expression to extract the response part
	try:
	spl_word_1 = 'Response:\n'
	res = output_string.split(spl_word_1, 1)
	response = res[1]
	translation = response.replace(EOS_TOKEN, '')
	translations.append(translation)

	except:
	translation = ''
	translations.append(translation)



	print(translation)
	```


	## Training Details

	### Training Data

	We used the Sprotin parallel corpus for English–Faroese translation: [barbaroo/Sprotin_parallel](https://huggingface.co/datasets/barbaroo/Sprotin_parallel).


	### Training Procedure

	#### Preprocessing [optional]

	- Tokenization: We used the tokenizer from the base model `AI-Sweden-Models/gpt-sw3-6.7b-v2`.
	- The Alpaca prompt format was used, with Instruction, Input and Response.

	#### Training Hyperparameters

	- Epochs: 3 total, with an early stopping criterion monitoring validation loss.
	- Batch Size: 2, with 4 Gradient accumulation steps
	- Learning Rate: 2e-4
	- Optimizer: AdamW with a linear learning-rate scheduler and warm-up.

	---

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	- The model was evaluated on the [FLORES-200] benchmark, of ~1012 English–Faroese pairs.


	#### Metrics and Results

	- BLEU: [0.183]
	- chrF: [50.3]
	- BERTScore f1: [0.951]

	Human evaluation was also performed (see paper)


	## Citation []

	[COMING SOON]

	---
	## Framework versions

	- PEFT 0.13.0