Fine-Tuning Pre-Trained Model for English and Albanian
This project demonstrates the process of fine-tuning a pre-trained model for language tasks in both English and Albanian. We utilize transfer learning with a pre-trained model (e.g., BERT or multilingual BERT) to adapt it for specific tasks in these two languages, such as text classification, named entity recognition (NER), or sentiment analysis.
Requirements
Prerequisites
- Python 3.7+
- TensorFlow or PyTorch
- Hugging Face Transformers library
- CUDA-enabled GPU (recommended for faster training)
Dependencies
Install the following Python libraries using pip
:
pip install torch transformers datasets
pip install tensorflow # If using TensorFlow
pip install tqdm
pip install scikit-learn
Model Overview
We fine-tuned a pre-trained multilingual model (e.g., BERT Multilingual, mBERT, or XLM-RoBERTa) to perform NLP tasks in both English and Albanian. These models are pre-trained on multiple languages, including English and Albanian, and are then fine-tuned on a custom dataset tailored to your task.
Example Pre-Trained Models:
bert-base-multilingual-cased
xlm-roberta-base
Fine-Tuning Process
1. Load the Pre-Trained Model and Tokenizer
python
Copy code
from transformers import BertTokenizer, BertForSequenceClassification
# Load the pre-trained multilingual model
model_name = 'bert-base-multilingual-cased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2) # Adjust num_labels based on your task
2. Prepare the Dataset
You can fine-tune the model on your own dataset (in English and Albanian) using Hugging Face’s datasets library, or prepare your own dataset in CSV or JSON format.
Example:
python
Copy code
from datasets import load_dataset
# Load the dataset (replace with your own dataset)
dataset = load_dataset('csv', data_files='path_to_your_data.csv')
3. Preprocess the Data
Use the tokenizer to preprocess the dataset, converting text into token IDs compatible with the pre-trained model.
python
Copy code
def preprocess_function(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True)
# Apply preprocessing
tokenized_datasets = dataset.map(preprocess_function, batched=True)
4. Fine-Tuning the Model
Train the model on your dataset using either PyTorch or TensorFlow. Here's an example using PyTorch:
python
Copy code
from torch.utils.data import DataLoader
from transformers import AdamW
# Set training parameters
train_dataset = tokenized_datasets['train']
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
# Set optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)
# Training loop
model.train()
for epoch in range(3):
for batch in train_dataloader:
optimizer.zero_grad()
input_ids = batch['input_ids'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
print(f"Epoch {epoch}, Loss: {loss.item()}")
5. Evaluate the Model
After training, evaluate the model’s performance using the validation or test dataset.
python
Copy code
from sklearn.metrics import accuracy_score
model.eval()
# Example evaluation loop
predictions = []
labels = []
for batch in eval_dataloader:
with torch.no_grad():
input_ids = batch['input_ids'].to(device)
labels.append(batch['labels'].numpy())
outputs = model(input_ids)
preds = torch.argmax(outputs.logits, dim=-1)
predictions.append(preds.numpy())
accuracy = accuracy_score(labels, predictions)
print(f"Accuracy: {accuracy}")
Languages Supported
English: The model is fine-tuned on English text for the task at hand (e.g., text classification, sentiment analysis, etc.).
Albanian: The same model can be used for Albanian text, leveraging multilingual pre-trained weights. The performance may vary depending on the dataset, but mBERT and XLM-R are known to perform well for Albanian.
Results
This fine-tuned model provides state-of-the-art performance on both English and Albanian tasks. Results on the validation/test set should demonstrate good generalization across these two languages.
Example Results:
Accuracy: 85% on English dataset
Accuracy: 80% on Albanian dataset
Conclusion
By fine-tuning a pre-trained multilingual model, we significantly reduce the time and computational resources required for training a model from scratch. This approach leverages transfer learning, where the model has already learned general linguistic patterns from a wide variety of languages, allowing it to adapt to specific tasks in both English and Albanian.
License
This project is licensed under the MIT License - see the LICENSE file for details.
- Downloads last month
- 4