HenryHHHH
/

DistilLlama

Text Generation

knowledge-distillation

transfer-learning

text-generation-inference

Model card Files Files and versions

HenryHHHH commited on Oct 24, 2024

Commit

509ba5a

·

verified ·

1 Parent(s): 781b70f

Update README.md

Files changed (1) hide show

README.md +12 -0

README.md CHANGED Viewed

@@ -27,6 +27,18 @@ base_model: meta-llama/LLaMA-2-7B
 This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher.
 ### Training Process
 During each training step, the input data \( X \) is fed to both the teacher and student models. The student model calculates output logits and loss with the true labels, while the teacher model only generates logits. The total loss combines task-specific loss and distillation loss:

 This model is a distilled version of LLaMA 2, containing approximately 80 million parameters. It was trained using a mix of OpenWebText and WikiText Raw V1 datasets. Knowledge distillation was employed to transfer knowledge from a larger "teacher" model—Meta’s 7B LLaMA 2—to help this smaller model mimic the behavior of the teacher.
+### Model Architecture
+Same architecture as Llama 2, but with the following changes
+| Parameter               | Value |
+|-------------------------|-------|
+| Hidden Dimension        | 512   |
+| Intermediate Dimension  | 1536  |
+| Max Positional Embeddings | 128   |
+| Attention Heads         | 8     |
+| Transformer Layers      | 16    |
 ### Training Process
 During each training step, the input data \( X \) is fed to both the teacher and student models. The student model calculates output logits and loss with the true labels, while the teacher model only generates logits. The total loss combines task-specific loss and distillation loss: