HenryHHHH commited on
Commit
509ba5a
·
verified ·
1 Parent(s): 781b70f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -0
README.md CHANGED
@@ -27,6 +27,18 @@ base_model: meta-llama/LLaMA-2-7B
27
 
28
  This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher.
29
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ### Training Process
31
 
32
  During each training step, the input data \( X \) is fed to both the teacher and student models. The student model calculates output logits and loss with the true labels, while the teacher model only generates logits. The total loss combines task-specific loss and distillation loss:
 
27
 
28
  This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher.
29
 
30
+ ### Model Architecture
31
+
32
+ Same architecture as Llama 2, but with the following changes
33
+ | Parameter | Value |
34
+ |-------------------------|-------|
35
+ | Hidden Dimension | 512 |
36
+ | Intermediate Dimension | 1536 |
37
+ | Max Positional Embeddings | 128 |
38
+ | Attention Heads | 8 |
39
+ | Transformer Layers | 16 |
40
+
41
+
42
  ### Training Process
43
 
44
  During each training step, the input data \( X \) is fed to both the teacher and student models. The student model calculates output logits and loss with the true labels, while the teacher model only generates logits. The total loss combines task-specific loss and distillation loss: