|
--- |
|
base_model: AI-Sweden-Models/gpt-sw3-6.7b-v2 |
|
library_name: peft |
|
datasets: |
|
- barbaroo/Sprotin_parallel |
|
language: |
|
- en |
|
- fo |
|
metrics: |
|
- bleu |
|
- chrf |
|
- bertscore |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Model Card: English–Faroese Translation Adapter |
|
|
|
## Model Details |
|
|
|
**Model Description** |
|
|
|
- **Developed by:** Barbara Scalvini |
|
- **Model type:** Language model adapter for **English → Faroese** translation |
|
- **Language(s):** English, Faroese |
|
- **License:** This adapter inherits the license from the original GPT-SW3 6.7B model. |
|
- **Finetuned from model:** [AI-Sweden-Models/gpt-sw3-6.7b-v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2) |
|
- **Library used:** [PEFT 0.13.0](https://github.com/huggingface/peft) |
|
|
|
### Model Sources |
|
|
|
- **Paper:** [COMING SOON] |
|
--- |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
This adapter is intended to perform **English→Faroese** translation, leveraging a **parameter-efficient fine-tuning** (PEFT) approach. |
|
|
|
### Downstream Use [optional] |
|
- Can be integrated into broader **multilingual** or **localization** workflows. |
|
|
|
|
|
### Out-of-Scope Use |
|
- Any uses that rely on languages other than **English or Faroese** will likely yield suboptimal results. |
|
- Other tasks (e.g., summarization, classification) may be unsupported or require further fine-tuning. |
|
|
|
--- |
|
|
|
## Bias, Risks, and Limitations |
|
- **Biases:** The model could reflect **biases** present in the training data, such as historical or societal biases in English or Faroese texts. |
|
- **Recommendation:** Users should **critically evaluate** outputs, especially in sensitive or high-stakes applications. |
|
|
|
--- |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
import torch |
|
from peft import AutoPeftModelForCausalLM |
|
from transformers import AutoTokenizer |
|
import pandas as pd |
|
|
|
ADAPTER_REPO = "barbaroo/gptsw3_translate_6.7B" |
|
BASE_MODEL = "AI-Sweden-Models/gpt-sw3-6.7b-v2" |
|
|
|
# 1. Load the tokenizer from the base model |
|
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL) |
|
|
|
model = AutoPeftModelForCausalLM.from_pretrained( |
|
ADAPTER_REPO, |
|
load_in_8bit=True, # Optional: 8-bit quantization for GPU memory efficiency |
|
device_map="auto", # Automatically spread layers across available GPUs |
|
) |
|
|
|
# Ensure the model is in evaluation mode |
|
model.eval() |
|
|
|
# Alpaca-style prompt template |
|
alpaca_prompt = """ |
|
### Instruction: |
|
{} |
|
|
|
### Input: |
|
{} |
|
|
|
### Response: |
|
{} |
|
""" |
|
|
|
# EOS token from the tokenizer |
|
EOS_TOKEN = tokenizer.eos_token |
|
print(EOS_TOKEN) |
|
|
|
sentences = ['hello world'] |
|
|
|
translations = [] |
|
|
|
for sentence in sentences: |
|
# Tokenize the input sentence and prepare the prompt for each sentence |
|
inputs = tokenizer( |
|
[ |
|
alpaca_prompt.format( |
|
"Translate this sentence from English to Faroese:", # instruction |
|
sentence, # input sentence to translate |
|
"", # output - leave blank for generation |
|
) |
|
], |
|
return_tensors="pt" |
|
).to("cuda") |
|
|
|
# Generate the output |
|
outputs = model.generate(**inputs, |
|
max_new_tokens=2000, |
|
eos_token_id=tokenizer.eos_token_id, # Ensure EOS token is used |
|
pad_token_id=tokenizer.pad_token_id, # Ensure padding token is used |
|
use_cache=True, |
|
do_sample = True, |
|
temperature = 0.1, |
|
top_p=1) |
|
|
|
# Decode the generated tokens into a string |
|
output_string = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0] |
|
#print(output_string) |
|
|
|
# Use a regular expression to extract the response part |
|
try: |
|
spl_word_1 = 'Response:\n' |
|
res = output_string.split(spl_word_1, 1) |
|
response = res[1] |
|
translation = response.replace(EOS_TOKEN, '') |
|
translations.append(translation) |
|
|
|
except: |
|
translation = '' |
|
translations.append(translation) |
|
|
|
|
|
|
|
print(translation) |
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
We used the Sprotin parallel corpus for **English–Faroese** translation: [barbaroo/Sprotin_parallel](https://huggingface.co/datasets/barbaroo/Sprotin_parallel). |
|
|
|
|
|
### Training Procedure |
|
|
|
#### Preprocessing [optional] |
|
|
|
- **Tokenization**: We used the tokenizer from the base model `AI-Sweden-Models/gpt-sw3-6.7b-v2`. |
|
- The Alpaca prompt format was used, with Instruction, Input and Response. |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Epochs**: **3** total, with an **early stopping** criterion monitoring validation loss. |
|
- **Batch Size**: **2, with 4 Gradient accumulation steps** |
|
- **Learning Rate**: **2e-4** |
|
- **Optimizer**: **AdamW** with a linear learning-rate scheduler and warm-up. |
|
|
|
--- |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
- The model was evaluated on the **[FLORES-200]** benchmark, of ~1012 English–Faroese pairs. |
|
|
|
|
|
#### Metrics and Results |
|
|
|
- **BLEU**: **[0.183]** |
|
- **chrF**: **[50.3]** |
|
- **BERTScore f1**: **[0.951]** |
|
|
|
Human evaluation was also performed (see paper) |
|
|
|
|
|
## Citation [] |
|
|
|
[COMING SOON] |
|
|
|
--- |
|
## Framework versions |
|
|
|
- PEFT 0.13.0 |