File size: 5,589 Bytes
269e35a
8579c16
819655b
 
 
 
 
 
 
269e35a
 
 
819655b
 
 
 
8579c16
269e35a
819655b
71afc1a
819655b
 
 
 
 
71afc1a
269e35a
 
 
 
 
 
 
819655b
269e35a
 
 
819655b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71afc1a
819655b
 
 
 
71afc1a
269e35a
 
819655b
 
269e35a
 
819655b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_name: llama-8b-south-africa
languages:
  - Xhosa
  - Zulu 
  - Tswana
  - Northern Sotho
  - Afrikaans
license: apache-2.0
tags:
  - african-languages
  - multilingual
  - instruction-tuning
  - transfer-learning
library_name: peft

model_description: |
  This model is a fine-tuned version of Meta's LLaMA-3.1-8B-Instruct model, specifically adapted for South African languages. The training data consists of the Alpaca Cleaned dataset translated into five South African languages: Xhosa, Zulu, Tswana, Northern Sotho, and Afrikaans using machine translation techniques.
  
  Key Features:
  - Base architecture: LLaMA-3.1-8B-Instruct
  - Training approach: Instruction tuning via translated datasets
  - Target languages: 5 South African languages
  - Cost-efficient: Total cost ~$1,870 ($370/language for translation + $15 for training)

training_details:
  hyperparameters:
    learning_rate: 0.0002
    train_batch_size: 4
    eval_batch_size: 8
    gradient_accumulation_steps: 2
    total_train_batch_size: 8
    optimizer: "Adam with betas=(0.9,0.999) and epsilon=1e-08"
    lr_scheduler_type: cosine
    lr_scheduler_warmup_ratio: 0.1
    num_epochs: 1
    seed: 42
    distributed_type: multi-GPU
  
  results:
    final_loss: 1.0959
    validation_loss: 0.0571
    total_steps: 5596
    completed_epochs: 0.9999

model_evaluation:
  xhosa:
    afrimgsm:
      accuracy: 0.02
    afrimmlu:
      accuracy: 0.29
    afrixnli:
      accuracy: 0.44
  zulu:
    afrimgsm:
      accuracy: 0.045
    afrimmlu:
      accuracy: 0.29
    afrixnli:
      accuracy: 0.43

limitations: |
  - Current evaluation metrics are limited to Xhosa and Zulu due to Iroko language availability
  - Machine translation was used for training data generation, which may impact quality
  - Low performance on certain tasks (particularly AfriMGSM) suggests room for improvement

framework_versions:
  pytorch: 2.4.1+cu121
  transformers: 4.44.2
  peft: 0.12.0
  datasets: 3.0.0
  tokenizers: 0.19.1

resources:
  benchmark_visualization: assets/Benchmarks_(1).pdf
  training_dataset: https://huggingface.co/datasets/yahma/alpaca-cleaned
---

# LLaMA-3.1-8B South African Languages Model

This model card provides detailed information about the LLaMA-3.1-8B model fine-tuned for South African languages. The model demonstrates cost-effective cross-lingual transfer learning for African language processing.

## Model Overview

The model is based on Meta's LLaMA-3.1-8B-Instruct architecture and has been fine-tuned on translated versions of the Alpaca Cleaned dataset. The training approach leverages machine translation to create instruction-tuning data in five South African languages, making it a cost-effective solution for multilingual AI development.

## Training Methodology

### Dataset Preparation
The training data was created by translating the Alpaca Cleaned dataset into five target languages:
- Xhosa
- Zulu
- Tswana
- Northern Sotho
- Afrikaans

Machine translation was used to generate the training data, with a cost of $370 per language.

### Training Process
The model was trained using the PEFT (Parameter-Efficient Fine-Tuning) library on the Akash Compute Network. Key aspects of the training process include:
- Single epoch training
- Multi-GPU distributed training setup
- Cosine learning rate schedule with 10% warmup
- Adam optimizer with β1=0.9, β2=0.999, ε=1e-08
- Total training cost: $15

## Performance Evaluation

### Evaluation Scope
Current evaluation metrics are available for two languages:
1. Xhosa (xho)
2. Zulu (zul)

Evaluation was conducted using three benchmark datasets:

### AfriMGSM Results
- Xhosa: 2.0% accuracy
- Zulu: 4.5% accuracy

### AfriMMIU Results
- Xhosa: 29.0% accuracy
- Zulu: 29.0% accuracy

### AfriXNLI Results
- Xhosa: 44.0% accuracy
- Zulu: 43.0% accuracy

## Limitations and Considerations

1. **Evaluation Coverage**
   - Only Xhosa and Zulu could be evaluated due to limitations in available benchmarking tools
   - Performance on other supported languages remains unknown

2. **Training Data Quality**
   - Reliance on machine translation may impact the quality of training data
   - Potential artifacts or errors from the translation process could affect model performance

3. **Performance Gaps**
   - Notably low performance on AfriMGSM tasks indicates room for improvement
   - Further investigation needed to understand performance disparities across tasks

## Technical Requirements

The model requires the following framework versions:
- PyTorch: 2.4.1+cu121
- Transformers: 4.44.2
- PEFT: 0.12.0
- Datasets: 3.0.0
- Tokenizers: 0.19.1

## Usage Example

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
model_name = "meta-llama/llama-8b-south-africa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example usage for text generation
text = "Translate to Xhosa: Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
```

## License

This model is released under the Apache 2.0 license. The full license text can be found at https://www.apache.org/licenses/LICENSE-2.0.txt

## Acknowledgments

- Meta AI for the base LLaMA-3.1-8B-Instruct model
- Akash Network for providing computing resources
- Contributors to the Alpaca Cleaned dataset
- The African NLP community for benchmark datasets and evaluation tools