Update README.md
Browse files
README.md
CHANGED
@@ -27,6 +27,18 @@ base_model: meta-llama/LLaMA-2-7B
|
|
27 |
|
28 |
This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher.
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
### Training Process
|
31 |
|
32 |
During each training step, the input data \( X \) is fed to both the teacher and student models. The student model calculates output logits and loss with the true labels, while the teacher model only generates logits. The total loss combines task-specific loss and distillation loss:
|
|
|
27 |
|
28 |
This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher.
|
29 |
|
30 |
+
### Model Architecture
|
31 |
+
|
32 |
+
Same architecture as Llama 2, but with the following changes
|
33 |
+
| Parameter | Value |
|
34 |
+
|-------------------------|-------|
|
35 |
+
| Hidden Dimension | 512 |
|
36 |
+
| Intermediate Dimension | 1536 |
|
37 |
+
| Max Positional Embeddings | 128 |
|
38 |
+
| Attention Heads | 8 |
|
39 |
+
| Transformer Layers | 16 |
|
40 |
+
|
41 |
+
|
42 |
### Training Process
|
43 |
|
44 |
During each training step, the input data \( X \) is fed to both the teacher and student models. The student model calculates output logits and loss with the true labels, while the teacher model only generates logits. The total loss combines task-specific loss and distillation loss:
|