barbaroo's picture
Update README.md
17ddc46 verified
---
base_model: AI-Sweden-Models/gpt-sw3-6.7b-v2
library_name: peft
datasets:
- barbaroo/Sprotin_parallel
language:
- en
- fo
metrics:
- bleu
- chrf
- bertscore
pipeline_tag: text-generation
---
# Model Card: English–Faroese Translation Adapter
## Model Details
**Model Description**
- **Developed by:** Barbara Scalvini
- **Model type:** Language model adapter for **English → Faroese** translation
- **Language(s):** English, Faroese
- **License:** This adapter inherits the license from the original GPT-SW3 6.7B model.
- **Finetuned from model:** [AI-Sweden-Models/gpt-sw3-6.7b-v2](https://huggingface.co/AI-Sweden-Models/gpt-sw3-6.7b-v2)
- **Library used:** [PEFT 0.13.0](https://github.com/huggingface/peft)
### Model Sources
- **Paper:** [COMING SOON]
---
## Uses
### Direct Use
This adapter is intended to perform **English→Faroese** translation, leveraging a **parameter-efficient fine-tuning** (PEFT) approach.
### Downstream Use [optional]
- Can be integrated into broader **multilingual** or **localization** workflows.
### Out-of-Scope Use
- Any uses that rely on languages other than **English or Faroese** will likely yield suboptimal results.
- Other tasks (e.g., summarization, classification) may be unsupported or require further fine-tuning.
---
## Bias, Risks, and Limitations
- **Biases:** The model could reflect **biases** present in the training data, such as historical or societal biases in English or Faroese texts.
- **Recommendation:** Users should **critically evaluate** outputs, especially in sensitive or high-stakes applications.
---
## How to Get Started with the Model
```python
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import pandas as pd
ADAPTER_REPO = "barbaroo/gptsw3_translate_6.7B"
BASE_MODEL = "AI-Sweden-Models/gpt-sw3-6.7b-v2"
# 1. Load the tokenizer from the base model
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
model = AutoPeftModelForCausalLM.from_pretrained(
ADAPTER_REPO,
load_in_8bit=True, # Optional: 8-bit quantization for GPU memory efficiency
device_map="auto", # Automatically spread layers across available GPUs
)
# Ensure the model is in evaluation mode
model.eval()
# Alpaca-style prompt template
alpaca_prompt = """
### Instruction:
{}
### Input:
{}
### Response:
{}
"""
# EOS token from the tokenizer
EOS_TOKEN = tokenizer.eos_token
print(EOS_TOKEN)
sentences = ['hello world']
translations = []
for sentence in sentences:
# Tokenize the input sentence and prepare the prompt for each sentence
inputs = tokenizer(
[
alpaca_prompt.format(
"Translate this sentence from English to Faroese:", # instruction
sentence, # input sentence to translate
"", # output - leave blank for generation
)
],
return_tensors="pt"
).to("cuda")
# Generate the output
outputs = model.generate(**inputs,
max_new_tokens=2000,
eos_token_id=tokenizer.eos_token_id, # Ensure EOS token is used
pad_token_id=tokenizer.pad_token_id, # Ensure padding token is used
use_cache=True,
do_sample = True,
temperature = 0.1,
top_p=1)
# Decode the generated tokens into a string
output_string = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
#print(output_string)
# Use a regular expression to extract the response part
try:
spl_word_1 = 'Response:\n'
res = output_string.split(spl_word_1, 1)
response = res[1]
translation = response.replace(EOS_TOKEN, '')
translations.append(translation)
except:
translation = ''
translations.append(translation)
print(translation)
```
## Training Details
### Training Data
We used the Sprotin parallel corpus for **English–Faroese** translation: [barbaroo/Sprotin_parallel](https://huggingface.co/datasets/barbaroo/Sprotin_parallel).
### Training Procedure
#### Preprocessing [optional]
- **Tokenization**: We used the tokenizer from the base model `AI-Sweden-Models/gpt-sw3-6.7b-v2`.
- The Alpaca prompt format was used, with Instruction, Input and Response.
#### Training Hyperparameters
- **Epochs**: **3** total, with an **early stopping** criterion monitoring validation loss.
- **Batch Size**: **2, with 4 Gradient accumulation steps**
- **Learning Rate**: **2e-4**
- **Optimizer**: **AdamW** with a linear learning-rate scheduler and warm-up.
---
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
- The model was evaluated on the **[FLORES-200]** benchmark, of ~1012 English–Faroese pairs.
#### Metrics and Results
- **BLEU**: **[0.183]**
- **chrF**: **[50.3]**
- **BERTScore f1**: **[0.951]**
Human evaluation was also performed (see paper)
## Citation []
[COMING SOON]
---
## Framework versions
- PEFT 0.13.0