Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,88 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
tags:
|
6 |
+
- MoE
|
7 |
+
---
|
8 |
+
# LLaMA-MoE-v2-3.8B (1+1/7) SFT
|
9 |
+
|
10 |
+
[[π» Code]](https://github.com/OpenSparseLLMs/LLaMA-MoE-v2) | [[π Technical Report]](https://arxiv.org/pdf/2411.15708)
|
11 |
+
|
12 |
+
LLaMA-MoE-v2 is a series of open-sourced Mixture-of-Expert (MoE) models based on [LLaMA3](https://github.com/facebookresearch/llama).
|
13 |
+
We build LLaMA-MoE-v2 with the following two steps:
|
14 |
+
1. **Partition** LLaMA's FFN layers or Attention layers into sparse experts and insert top-K gate for each layer of experts.
|
15 |
+
2. Supervised fine-tuning the constructed MoE models using open-source data with a two-stage training.
|
16 |
+
|
17 |
+
|
18 |
+
| Model | \#Activated Experts | \#Experts | \#Activated Params | SFT Model |
|
19 |
+
| :-----------------------: | :-----------------: | :-------: | :----------------: | :------------------------------------------------------------------------: |
|
20 |
+
| **LLaMA-MLP-MoE (2/8)** | 2 | 8 | 3.8B | [π€ SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v2-3_8B-2_8-sft) |
|
21 |
+
| **LLaMA-MLP-MoE (1+1/7)** | 2 | 8 | 3.8B | [π€ SFT](https://huggingface.co/llama-moe/LLaMA-MoE-v2-3_8B-residual-sft) |
|
22 |
+
|
23 |
+
|
24 |
+
## π QuickStart
|
25 |
+
|
26 |
+
```python
|
27 |
+
# python>=3.10
|
28 |
+
|
29 |
+
import torch
|
30 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
31 |
+
|
32 |
+
model_dir = "llama-moe/LLaMA-MoE-v2-3_8B-residual-sft"
|
33 |
+
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
|
34 |
+
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
|
35 |
+
model.eval()
|
36 |
+
model.cuda()
|
37 |
+
|
38 |
+
input_text = "Could you recommend me some mystery novels?"
|
39 |
+
input_text = f"<|start_header_id|>user<|end_header_id|>\n\n{input_text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
|
40 |
+
inputs = tokenizer(input_text, return_tensors="pt")
|
41 |
+
input_ids = inputs["input_ids"].cuda()
|
42 |
+
|
43 |
+
pred = model.generate(input_ids, max_length=200, temperature=1.0, do_sample=True, use_cache=True)
|
44 |
+
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
|
45 |
+
"""
|
46 |
+
I'd be delighted to recommend some mystery novels to you! Here are a few suggestions across various sub-genres:
|
47 |
+
|
48 |
+
**Classic Whodunit**
|
49 |
+
|
50 |
+
1. "And Then There Were None" by Agatha Christie - A timeless tale of ten strangers who are invited to an isolated island, only to be killed off one by one.
|
51 |
+
2. "The Murder on the Orient Express" by Agatha Christie - A classic whodunit set on a luxurious train traveling from Istanbul to Paris, where a famous author goes missing.
|
52 |
+
3. "The Devil in the White City" by Erik Larson - A non-fiction book that combines historical events with a mystery, exploring the 1893 World's Columbian Exposition in Chicago and the serial killer H.H. Holmes.
|
53 |
+
|
54 |
+
**Modern Whodunits**
|
55 |
+
|
56 |
+
1. "Gone Girl" by Gillian Flynn - A twisty, psychological thriller about a couple whose seemingly perfect ...
|
57 |
+
"""
|
58 |
+
```
|
59 |
+
|
60 |
+
## π Performance
|
61 |
+
|
62 |
+
| Model | #Training Tokens | MMLU(5) | GSM8k(8) | HumanEval(pass@10) | IFEval | BoolQ(32) | SciQ | PIQA | ARC-c(25) | TruthfulQA | HellaSwag(10) |
|
63 |
+
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
64 |
+
| [LLaMA3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | 15T | 67.2 | 76.5 | 71.4 | 76.5 | 83.0 | 93.2 | 78.5 | 61.9 | 51.7 | 78.8 |
|
65 |
+
| [INCITE-3B](https://huggingface.co/togethercomputer/RedPajama-INCITE-Instruct-3B-v1) | 1T | 25.1 | 2.1 | 6.92 | 30.1 | 66.5 | 94.7 | 74.4 | 40.2 | 36.4 | 65.6 |
|
66 |
+
| [Sheared-LLaMA-2.7B](https://huggingface.co/princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT) | 50B | 28.2 | 1.9 | 3.2 | 28.8 | 67.6 | 75.8 | 41.1 | 47.6 | 71.2 | 39.0 |
|
67 |
+
| [Gemma-2-2b](https://huggingface.co/google/gemma-2-2b-it) | 2T | 53.0 | 26.3 | 46.1 | 34.9 | 72.3 | 75.8 | 67.5 | 52.6 | 50.8 | 69.0 |
|
68 |
+
| [Salamandra-2b](https://huggingface.co/BSC-LT/salamandra-2b-instruct) | 7.8T | 25.1 | 1.90 | 5.82 | 27.7 | 68.0 | 89.8 | 74.7 | 46.3 | 43.4 | 62.3 |
|
69 |
+
| [SmolLM2-1.7B](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) | 11T | 50.4 | 38.5 | 39.1 | 29.0 | 68.2 | 84.3 | 76.0 | 53.2 | 39.9 | 72.6 |
|
70 |
+
| [OpenMoE-3B-9B](https://huggingface.co/OrionZheng/openmoe-8b-chat) | 1T | 26.5 | 1.36 | 1.01 | 31.2 | 61.7 | 68.4 | 65.7 | 33.3 | 40.5 | 56.5 |
|
71 |
+
| [LLaMA-MoE-3B-7B](https://huggingface.co/llama-moe/LLaMA-MoE-v1-3_5B-2_8-sft) | 200B | 28.2 | 4.62 | 12.0 | 28.1 | 68.1 | 88.8 | 77.9 | 44.0 | 33.3 | 73.2 |
|
72 |
+
| [OLMoE-1B-7B](https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT) | 1T | 53.8 | 40.9 | 40.5 | 35.5 | 80.9 | 94.9 | 80.1 | 55.6 | 43.3 | 79.6 |
|
73 |
+
| **MLP-MoE (8top2)** | **7B** | 40.6 | 53.1 | 53.5 | 32.7 | 74.6 | 90.6 | 69.3 | 42.8 | 45.6 | 59.0 |
|
74 |
+
| **MLP-MoE (8top2)** | **8.4B** | 41.0 | **59.6** | **57.1** | 31.7 | 74.5 | 90.2 | 69.5 | 43.3 | 46.9 | 58.1 |
|
75 |
+
| **MLP-MoE (1+7top1)** | **7B** | 42.7 | 55.0 | 51.2 | **36.0** | 76.9 | 88.8 | 67.9 | 40.2 | 46.9 | 53.7 |
|
76 |
+
|
77 |
+
|
78 |
+
## π Citation
|
79 |
+
|
80 |
+
```bibtex
|
81 |
+
@misc{llama-moe-v2,
|
82 |
+
title={LLaMA-MoE v2: Exploring Sparsity of LLaMA from Perspective of Mixture-of-Experts with Post-Training},
|
83 |
+
author={Xiaoye Qu, Daize Dong, Xuyang Hu, Tong Zhu, Weigao Sun, Yu Cheng},
|
84 |
+
year={2024},
|
85 |
+
month={Nov},
|
86 |
+
url={https://arxiv.org/abs/2411.15708}
|
87 |
+
}
|
88 |
+
```
|