readme: Add model intro and testset rename

Browse files

Signed-off-by: eric <[email protected]>

Files changed (6) hide show

README.md +81 -51
README_zh.md +81 -55
assets/imgs/data_src_dist.png +0 -0
config.json +3 -3
configuration_orion.py +4 -4
modeling_orion.py +46 -46

README.md CHANGED Viewed

@@ -48,9 +48,35 @@ tags:
 - Orion-MOE8x7B-Base Large Language Model(LLM) is a pretrained generative Sparse Mixture of Experts, trained from scratch by OrionStarAI.  The base model is trained on multilingual corpus, including Chinese, English, Japanese, Korean, etc, and it exhibits superior performance in these languages.
-- The Orion-MOE8x7B series models exhibit the following features:
   - The model demonstrates excellent performance in comprehensive evaluations compared to other base models of the same parameter scale.
   - It has strong multilingual capabilities, significantly leading in Japanese and Korean test sets, and also performing comprehensively better in Arabic, German, French, and Spanish test sets.
 <a name="model-download"></a><br>
@@ -68,69 +94,71 @@ Model release and download links are provided in the table below:
 ## 3.1. Base Model Orion-MOE8x7B-Base Benchmarks
 ### 3.1.1. LLM evaluation results on examination and professional knowledge
-|TestSet | Mixtral 8*7B | Qwen1.5-32b | Qwen2.5-32b | Orion 14B | Orion 8*7B|
-| -- | -- | -- | -- | -- | -- |
-|ceval  | 54.0861 | 83.5 | 87.7414 | 72.8 | 89.74|
-|cmmlu  | 53.21 | 82.3 | 89.0088 | 70.57 | 89.1555|
-|mmlu  | 70.4 | 73.4 | 82.9 | 69.94 | 85.9|
-|mmlu_pro | 38.5 | 45.25 | 58.01 | 33.95 | 58.31|
-|ARC_c | 85.0847 | 90.1695 | 94.2373 | 79.66 | 91.8644|
-|hellaswag | 81.9458 | 81.9757 | 82.5134 | 78.53 | 89.19|
-|lambada | 76.7902 | 73.7434 | 75.3736 | 78.83 | 79.7399|
-|bbh | 50.87 | 57.28 | 67.69 | 50.35 | 55.82|
-|musr | 43.21 | 42.65 | 49.78 | 43.61 | 49.93|
-|piqa | 83.41 | 82.15 | 80.05 | 79.54 | 87.32|
-|commonsense_qa | 69.62 | 74.69 | 72.97 | 66.91 | 73.05|
-|IFEval | 24.15 | 32.97 | 41.59 | 29.08 | 30.06|
-|GQPA | 30.9 | 33.49 | 49.5 | 28.53 | 52.17|
-|human-eval | 33.5366 | 35.9756 | 46.9512 | 20.12 | 44.5122|
-|MBPP | 60.7 | 49.4 | 71 | 30 | 43.4|
-|math lv5 | 9 | 25 | 31.72 | 2.54 | 5.07|
-|gsm8k | 47.5 | 77.4 | 80.363 | 52.01 | 59.82|
-|math | 28.4 | 36.1 | 48.88 | 7.84 | 23.68|
 ### 3.1.2. Comparison of LLM performances on Japanese testsets
-| Model | jsquad | jcommonsenseqa | jnli | marc_ja | jaqket_v2 | paws_ja | avg |
-|-------|---------|--------|-------|----------|-------|-----------|-----|
-|Mixtral-8x7B | 0.8900 | 0.7873 | 0.3213 | 0.9544 | 0.7886 | 44.5000 | 8.0403 |
-|Qwen1.5-32B | 0.8986 | 0.8454 | 0.5099 | 0.9708 | 0.8214 | 0.4380 | 0.7474 |
-|Qwen2.5-32B | 0.8909 | 0.9383 | 0.7214 | 0.9786 | 0.8927 | 0.4215  | 0.8073 |
-|Orion-14B-Base | 0.7422 | 0.8820 | 0.7285 | 0.9406 | 0.6620 | 0.4990 | 0.7424 |
-|Orion 8x7B |0.9177 |0.9043 |0.9046 |0.9640 |0.8119 |0.4735 |0.8293 |
 ### 3.1.3. Comparison of LLM performances on Korean testsets
-|Model | haerae | kobest boolq | kobest copa | kobest hellaswag | kobest sentineg | kobest wic | paws_ko | avg |
-|--------|----|----|----|----|----|----|----|----|
-|Mixtral-8x7B | 53.16 | 78.56 | 66.2 | 56.6 | 77.08 | 49.37 | 44.05 | 60.71714286 |
-|Qwen1.5-32B | 46.38 | 76.28 | 60.4 | 53 | 78.34 | 52.14 | 43.4 | 58.56285714 |
-|Qwen2.5-32B | 70.67 | 80.27 | 76.7 | 61.2 | 96.47 | 77.22 | 37.05 | 71.36857143 |
-|Orion-14B-Base | 69.66 | 80.63 | 77.1 | 58.2 | 92.44 | 51.19 | 44.55 | 67.68142857 |
-|Orion 8x7B |65.17 |85.4 |80.4 |56 |96.98 |73.57 |46.35 |71.98142857 |
 ### 3.1.4. Comparison of LLM performances on Arabic, German, French, and Spanish testsets
 | Lang | ar |  | de |  | fr |  | es |  |
 |----|----|----|----|----|----|----|----|----|
-|**model**|**hellaswag**|**arc**|**hellaswag**|**arc**|**hellaswag**|**arc**|**hellaswag**|**arc**|
-|Mixtral-8x7B | 47.93 | 36.27 | 69.17 | 52.35 | 73.9 | 55.86 | 74.25 | 54.79 |
-|Qwen1.5-32B | 50.07 | 39.95 | 63.77 | 50.81 | 68.86 | 55.95 | 70.5 | 55.13 |
-|Qwen2.5-32B | 59.76 | 52.87 | 69.82 | 61.76 | 74.15 | 62.7 | 75.04 | 65.3 |
-|Orion-14B-Base | 42.26 | 33.88 | 54.65 | 38.92 | 60.21 | 42.34 | 62 | 44.62 |
-|Orion 8x7B |69.39 |54.32 |80.6 |63.47 |85.56 |68.78 |87.41 |70.09 |
 ### 3.1.5. Leakage Detection Benchmark
-The proportion of leakage data(from various evaluation benchmarks) in the pre-trained corpus; the higher the proportion, the more leakage it indicates.
- - Code: https://github.com/nishiwen1214/Benchmark-leakage-detection
- - Paper: https://web3.arxiv.org/pdf/2409.01790
- - English Test: mmlu
- - Chinese Test: ceval, cmmlu
-|Threshold 0.2 | qwen2.5 32b | qwen1.5 32b | orion 8x7b  | orion 14b | mixtral 8x7b |
 |------|------|------|------|------|------|
-|mmlu  | 0.3  | 0.27 | 0.22 | 0.28 | 0.25 |
-|ceval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
-|cmmlu | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
-### 3.1.6. Inference speed
 Based on 8x Nvidia RTX3090， in unit of tokens per second.
 |OrionLLM_V2.4.6.1 | 1para_out62 | 1para_out85 | 1para_out125 | 1para_out210 |
 |----|----|----|----|----|
@@ -197,6 +225,8 @@ device, you can use something like `export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python demo/text_generation_base.py --model OrionStarAI/Orion-MOE8x7B-Base --tokenizer OrionStarAI/Orion-MOE8x7B-Base --prompt hello
 ```
 <a name="declarations-license"></a><br>
 # 5. Declarations, License

 - Orion-MOE8x7B-Base Large Language Model(LLM) is a pretrained generative Sparse Mixture of Experts, trained from scratch by OrionStarAI.  The base model is trained on multilingual corpus, including Chinese, English, Japanese, Korean, etc, and it exhibits superior performance in these languages.
+- The Orion-MOE8x7B series models exhibit the following features
   - The model demonstrates excellent performance in comprehensive evaluations compared to other base models of the same parameter scale.
   - It has strong multilingual capabilities, significantly leading in Japanese and Korean test sets, and also performing comprehensively better in Arabic, German, French, and Spanish test sets.
+- Model Hyper-Parameters
+  - The architecture of the OrionMOE 8*7B models closely resembles that of Mixtral 8*7B, with specific details shown in the table below.
+    |Configuration      |OrionMOE 8*7B|
+    |-------------------|-------------|
+    |Hidden Size        | 4096        |
+    |# Layers           | 32          |
+    |# Query Heads      | 32          |
+    |# KV Heads         | 8           |
+    |Intermediate Size  | 14592       |
+    |# Experts          | 8           |
+    |# Activated Experts| 2           |
+    |Embedding Tying    | False       |
+    |Position embedding | RoPE        |
+    |seq_len            | 8192        |
+    |Vocabulary Size    | 1136664     |
+- Model pretrain hyper-parameters
+  - We use the AdamW optimizer with hyperparameters set to 𝛽1 = 0.9, 𝛽2 = 0.95, and a weight decay of 0.1.
+  - Training begins with a learning rate warm-up phase over 2000 iterations, where the learning rate is linearly increased to a peak of 3e-4. Afterward, a cosine schedule is applied to gradually reduce the learning rate to 3e-5 over the course of training.
+  - The model is trained using BF16/FP32 mixed precision, with a batch size of 2600, processing approximately 22 million tokens per step.
+- Model pretrain data distribution
+  - The training dataset is primarily composed of English, Chinese, and other languages, accounting for 50%, 25%, and 12% of the data, respectively. Additionally, code makes up 9%, while mathematical text accounts for 4%. The distribution by topics is detailed in the table below.
+<div align="center">
+  <img src="./assets/imgs/data_src_dist.png" alt="logo" width="80%" />
+</div>
 <a name="model-download"></a><br>
 ## 3.1. Base Model Orion-MOE8x7B-Base Benchmarks
 ### 3.1.1. LLM evaluation results on examination and professional knowledge
+|TestSet|Mixtral 8*7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion 8*7B|
+| ----------- | ----- | ----- | ----- | ----- | ----- |
+|CEval        | 54.09 | 83.50 | 87.74 | 72.80 | 89.74 |
+|CMMLU        | 53.21 | 82.30 | 89.01 | 70.57 | 89.16 |
+|MMLU         | 70.40 | 73.40 | 82.90 | 69.94 | 85.90 |
+|MMLU Pro     | 38.50 | 45.25 | 58.01 | 33.95 | 58.31 |
+|ARC_c        | 85.08 | 90.17 | 94.24 | 79.66 | 91.86 |
+|HellaSwag    | 81.95 | 81.98 | 82.51 | 78.53 | 89.19 |
+|LAMBADA      | 76.79 | 73.74 | 75.37 | 78.83 | 79.74 |
+|BBH          | 50.87 | 57.28 | 67.69 | 50.35 | 55.82 |
+|MuSR         | 43.21 | 42.65 | 49.78 | 43.61 | 49.93 |
+|PIQA         | 83.41 | 82.15 | 80.05 | 79.54 | 87.32 |
+|CommonSenseQA| 69.62 | 74.69 | 72.97 | 66.91 | 73.05 |
+|IFEval       | 24.15 | 32.97 | 41.59 | 29.08 | 30.06 |
+|GPQA         | 30.90 | 33.49 | 49.50 | 28.53 | 52.17 |
+|HumanEval    | 33.54 | 35.98 | 46.95 | 20.12 | 44.51 |
+|MBPP         | 60.70 | 49.40 | 71.00 | 30.00 | 43.40 |
+|MATH Lv5     |  9.00 | 25.00 | 31.72 |  2.54 |  5.07 |
+|GSM8K        | 47.50 | 77.40 | 80.36 | 52.01 | 59.82 |
+|MATH         | 28.40 | 36.10 | 48.88 |  7.84 | 23.68 |
 ### 3.1.2. Comparison of LLM performances on Japanese testsets
+| Model | JSQuAD | JCommonSenseQA | JNLI | MARC-ja | JAQKET v2 | PAWS-ja | avg |
+|--------------|-------|-------|-------|-------|-------|-------|-------|
+|Mixtral-8x7B  | 89.00 | 78.73 | 32.13 | 95.44 | 78.86 | 44.50 | 69.78 |
+|Qwen1.5-32B   | 89.86 | 84.54 | 50.99 | 97.08 | 82.14 | 43.80 | 74.74 |
+|Qwen2.5-32B   | 89.09 | 93.83 | 72.14 | 97.86 | 89.27 | 42.15 | 80.73 |
+|Orion-14B-Base| 74.22 | 88.20 | 72.85 | 94.06 | 66.20 | 49.90 | 74.24 |
+|Orion 8x7B    | 91.77 | 90.43 | 90.46 | 96.40 | 81.19 | 47.35 | 82.93 |
 ### 3.1.3. Comparison of LLM performances on Korean testsets
+|Model | HAE-RAE | KoBEST BoolQ | KoBEST COPA | KoBEST HellaSwag | KoBEST SentiNeg | KoBEST WiC | PAWS-ko | avg |
+|--------------|-------|-------|-------|-------|-------|-------|-------|-------|
+|Mixtral-8x7B  | 53.16 | 78.56 | 66.20 | 56.60 | 77.08 | 49.37 | 44.05 | 60.72 |
+|Qwen1.5-32B   | 46.38 | 76.28 | 60.40 | 53.00 | 78.34 | 52.14 | 43.40 | 58.56 |
+|Qwen2.5-32B   | 70.67 | 80.27 | 76.70 | 61.20 | 96.47 | 77.22 | 37.05 | 71.37 |
+|Orion-14B-Base| 69.66 | 80.63 | 77.10 | 58.20 | 92.44 | 51.19 | 44.55 | 67.68 |
+|Orion 8x7B    | 65.17 | 85.40 | 80.40 | 56.00 | 96.98 | 73.57 | 46.35 | 71.98 |
 ### 3.1.4. Comparison of LLM performances on Arabic, German, French, and Spanish testsets
 | Lang | ar |  | de |  | fr |  | es |  |
 |----|----|----|----|----|----|----|----|----|
+|**Model**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|
+|Mixtral-8x7B  | 53.16 | 78.56 | 66.20 | 56.60 | 77.08 | 49.37 | 44.05 | 60.72 |
+|Qwen1.5-32B   | 46.38 | 76.28 | 60.40 | 53.00 | 78.34 | 52.14 | 43.40 | 58.56 |
+|Qwen2.5-32B   | 70.67 | 80.27 | 76.70 | 61.20 | 96.47 | 77.22 | 37.05 | 71.37 |
+|Orion-14B-Base| 69.66 | 80.63 | 77.10 | 58.20 | 92.44 | 51.19 | 44.55 | 67.68 |
+|Orion 8x7B    | 65.17 | 85.40 | 80.40 | 56.00 | 96.98 | 73.57 | 46.35 | 71.98 |
 ### 3.1.5. Leakage Detection Benchmark
+When the pre-training data of a large language model contains content from a specific dataset, the model’s performance on that dataset may be artificially enhanced, leading to inaccurate performance evaluations. To address this issue, researchers from the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, and other institutions have proposed a simple and effective method for detecting data leakage. This method leverages the interchangeable nature of multiple-choice options by shuffling the options in the original dataset to generate derived data. The log-probability distribution of the derived dataset is then computed using the model to detect whether the original dataset has been leaked.
+We conducted data leakage detection experiments on three benchmark datasets: MMLU, CMMLU, and C-Eval.
+More details can be found in the paper: https://web3.arxiv.org/pdf/2409.01790.
+Test code: https://github.com/nishiwen1214/Benchmark-leakage-detection.
+|Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion 8x7B|Orion 14B|Mixtral 8x7B|
 |------|------|------|------|------|------|
+|MMLU  | 0.30 | 0.27 | 0.22 | 0.28 | 0.25 |
+|CEval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
+|CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
+### 3.1.6. Inference speed[Todo]
 Based on 8x Nvidia RTX3090， in unit of tokens per second.
 |OrionLLM_V2.4.6.1 | 1para_out62 | 1para_out85 | 1para_out125 | 1para_out210 |
 |----|----|----|----|----|
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python demo/text_generation_base.py --model OrionStarAI/Orion-MOE8x7B-Base --tokenizer OrionStarAI/Orion-MOE8x7B-Base --prompt hello
 ```
+## 4.3. [Todo] vLLM inference code
 <a name="declarations-license"></a><br>
 # 5. Declarations, License

README_zh.md CHANGED Viewed

@@ -44,8 +44,32 @@
   - 同规模参数级别基座大模型综合评测效果表现优异
   - 多语言能力强，在日语、韩语测试集上显著领先，在阿拉伯语、德语、法语、西班牙语测试集上也全面领先
 <a name="zh_model-download"></a><br>
@@ -65,76 +89,78 @@
 ## 3.1. 基座模型Orion-MOE8x7B-Base评估
 ### 3.1.1. 基座模型基准测试对比
-|TestSet | Mixtral 8*7B | Qwen1.5-32b | Qwen2.5-32b | Orion 14B | Orion 8*7B|
-| -- | -- | -- | -- | -- | -- |
-|ceval  | 54.0861 | 83.5 | 87.7414 | 72.8 | 89.74|
-|cmmlu  | 53.21 | 82.3 | 89.0088 | 70.57 | 89.1555|
-|mmlu  | 70.4 | 73.4 | 82.9 | 69.94 | 85.9|
-|mmlu_pro | 38.5 | 45.25 | 58.01 | 33.95 | 58.31|
-|ARC_c | 85.0847 | 90.1695 | 94.2373 | 79.66 | 91.8644|
-|hellaswag | 81.9458 | 81.9757 | 82.5134 | 78.53 | 89.19|
-|lambada | 76.7902 | 73.7434 | 75.3736 | 78.83 | 79.7399|
-|bbh | 50.87 | 57.28 | 67.69 | 50.35 | 55.82|
-|musr | 43.21 | 42.65 | 49.78 | 43.61 | 49.93|
-|piqa | 83.41 | 82.15 | 80.05 | 79.54 | 87.32|
-|commonsense_qa | 69.62 | 74.69 | 72.97 | 66.91 | 73.05|
-|IFEval | 24.15 | 32.97 | 41.59 | 29.08 | 30.06|
-|GQPA | 30.9 | 33.49 | 49.5 | 28.53 | 52.17|
-|human-eval | 33.5366 | 35.9756 | 46.9512 | 20.12 | 44.5122|
-|MBPP | 60.7 | 49.4 | 71 | 30 | 43.4|
-|math lv5 | 9 | 25 | 31.72 | 2.54 | 5.07|
-|gsm8k | 47.5 | 77.4 | 80.363 | 52.01 | 59.82|
-|math | 28.4 | 36.1 | 48.88 | 7.84 | 23.68|
 ### 3.1.2. 小语种： 日文
-| Model | jsquad | jcommonsenseqa | jnli | marc_ja | jaqket_v2 | paws_ja | avg |
-|-------|---------|--------|-------|----------|-------|-----------|-----|
-|Mixtral-8x7B | 0.8900 | 0.7873 | 0.3213 | 0.9544 | 0.7886 | 44.5000 | 8.0403 |
-|Qwen1.5-32B | 0.8986 | 0.8454 | 0.5099 | 0.9708 | 0.8214 | 0.4380 | 0.7474 |
-|Qwen2.5-32B | 0.8909 | 0.9383 | 0.7214 | 0.9786 | 0.8927 | 0.4215  | 0.8073 |
-|Orion-14B-Base | 0.7422 | 0.8820 | 0.7285 | 0.9406 | 0.6620 | 0.4990 | 0.7424 |
-|Orion 8x7B |0.9177 |0.9043 |0.9046 |0.9640 |0.8119 |0.4735 |0.8293 |
 ### 3.1.3. 小语种�� 韩文
-|Model | haerae | kobest boolq | kobest copa | kobest hellaswag | kobest sentineg | kobest wic | paws_ko | avg |
-|--------|----|----|----|----|----|----|----|----|
-|Mixtral-8x7B | 53.16 | 78.56 | 66.2 | 56.6 | 77.08 | 49.37 | 44.05 | 60.71714286 |
-|Qwen1.5-32B | 46.38 | 76.28 | 60.4 | 53 | 78.34 | 52.14 | 43.4 | 58.56285714 |
-|Qwen2.5-32B | 70.67 | 80.27 | 76.7 | 61.2 | 96.47 | 77.22 | 37.05 | 71.36857143 |
-|Orion-14B-Base | 69.66 | 80.63 | 77.1 | 58.2 | 92.44 | 51.19 | 44.55 | 67.68142857 |
-|Orion 8x7B |65.17 |85.4 |80.4 |56 |96.98 |73.57 |46.35 |71.98142857 |
 ### 3.1.4. 小语种： 阿拉伯语，德语，法语，西班牙语
 | Lang | ar |  | de |  | fr |  | es |  |
-|------|----|--|----|--|----|--|----|--|
-|**model**|**hellaswag**|**arc**|**hellaswag**|**arc**|**hellaswag**|**arc**|**hellaswag**|**arc**|
-|Mixtral-8x7B | 47.93 | 36.27 | 69.17 | 52.35 | 73.9 | 55.86 | 74.25 | 54.79 |
-|Qwen1.5-32B | 50.07 | 39.95 | 63.77 | 50.81 | 68.86 | 55.95 | 70.5 | 55.13 |
-|Qwen2.5-32B | 59.76 | 52.87 | 69.82 | 61.76 | 74.15 | 62.7 | 75.04 | 65.3 |
-|Orion-14B-Base | 42.26 | 33.88 | 54.65 | 38.92 | 60.21 | 42.34 | 62 | 44.62 |
-|Orion 8x7B |69.39 |54.32 |80.6 |63.47 |85.56 |68.78 |87.41 |70.09 |
 ### 3.1.5. 泄漏检测结果
-检测测试题目的泄露程度，值越大泄露的越严重
- - 检测代码: https://github.com/nishiwen1214/Benchmark-leakage-detection
- - 论文： https://web3.arxiv.org/pdf/2409.01790
- - 英文测试：mmlu
- - 中文测试：ceval, cmmlu
-|Threshold 0.2 | qwen2.5 32b | qwen1.5 32b | orion 8x7b  | orion 14b | mixtral 8x7b |
-|----|----|----|----|----|----|
-|mmlu  | 0.3  | 0.27 | 0.22 | 0.28 | 0.25 |
-|ceval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
-|cmmlu | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
-### 3.1.6. 推理速度
 基于8卡Nvidia RTX3090，单位是令牌每秒
 |OrionLLM_V2.4.6.1 | 1并发_输出62 | 1并发_输出85 | 1并发_输出125 | 1并发_输出210 |
 |----|----|----|----|----|
@@ -198,7 +224,7 @@ print(response)
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python demo/text_generation_base.py --model OrionStarAI/Orion-MOE8x7B-Base --tokenizer OrionStarAI/Orion-MOE8x7B-Base --prompt 你好,你叫什么名字
 ```
 <a name="zh_declarations-license"></a><br>

   - 同规模参数级别基座大模型综合评测效果表现优异
   - 多语言能力强，在日语、韩语测试集上显著领先，在阿拉伯语、德语、法语、西班牙语测试集上也全面领先
+- Orion-MOE8x7B-Base模型超参
+  - Orion-MOE8x7B-Base模型架构接近Mixtral 8x7B,超参细节请参考下表
+    |Configuration      |OrionMOE 8*7B|
+    |-------------------|-------------|
+    |Hidden Size        | 4096        |
+    |# Layers           | 32          |
+    |# Query Heads      | 32          |
+    |# KV Heads         | 8           |
+    |Intermediate Size  | 14592       |
+    |# Experts          | 8           |
+    |# Activated Experts| 2           |
+    |Embedding Tying    | False       |
+    |Position embedding | RoPE        |
+    |seq_len            | 8192        |
+    |Vocabulary Size    | 1136664     |
+- Orion-MOE8x7B-Base训练超参
+  - 我们使用AdamW优化器将超参数设置为 𝛽1 = 0.9, 𝛽2 = 0.95，权重衰减为0.1。
+  - 训练开始时进行2000次预热阶段迭代，学习率线性增加至峰值3e-4，之后采用余弦调度，逐渐将学习率降低到3e-5以完成整个训练过程。
+  - 模型训练采用BF16/FP32混合精度，批量大小为2600，每步处理大约2200万个token。
+- Orion-MOE8x7B-Base训练数据组成
+  - 预训练数据语种上主要由英语、中文和其他多语言语言组成，分别占比50%、25%和12%。数据分类上，代码占9%，数学文本占4%，分布参考下图。
+<div align="center">
+  <img src="./assets/imgs/data_src_dist.png" alt="logo" width="80%" />
+</div>
 <a name="zh_model-download"></a><br>
 ## 3.1. 基座模型Orion-MOE8x7B-Base评估
 ### 3.1.1. 基座模型基准测试对比
+|TestSet|Mixtral 8*7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion 8*7B|
+| ----------- | ----- | ----- | ----- | ----- | ----- |
+|CEval        | 54.09 | 83.50 | 87.74 | 72.80 | 89.74 |
+|CMMLU        | 53.21 | 82.30 | 89.01 | 70.57 | 89.16 |
+|MMLU         | 70.40 | 73.40 | 82.90 | 69.94 | 85.90 |
+|MMLU Pro     | 38.50 | 45.25 | 58.01 | 33.95 | 58.31 |
+|ARC_c        | 85.08 | 90.17 | 94.24 | 79.66 | 91.86 |
+|HellaSwag    | 81.95 | 81.98 | 82.51 | 78.53 | 89.19 |
+|LAMBADA      | 76.79 | 73.74 | 75.37 | 78.83 | 79.74 |
+|BBH          | 50.87 | 57.28 | 67.69 | 50.35 | 55.82 |
+|MuSR         | 43.21 | 42.65 | 49.78 | 43.61 | 49.93 |
+|PIQA         | 83.41 | 82.15 | 80.05 | 79.54 | 87.32 |
+|CommonSenseQA| 69.62 | 74.69 | 72.97 | 66.91 | 73.05 |
+|IFEval       | 24.15 | 32.97 | 41.59 | 29.08 | 30.06 |
+|GPQA         | 30.90 | 33.49 | 49.50 | 28.53 | 52.17 |
+|HumanEval    | 33.54 | 35.98 | 46.95 | 20.12 | 44.51 |
+|MBPP         | 60.70 | 49.40 | 71.00 | 30.00 | 43.40 |
+|MATH Lv5     |  9.00 | 25.00 | 31.72 |  2.54 |  5.07 |
+|GSM8K        | 47.50 | 77.40 | 80.36 | 52.01 | 59.82 |
+|MATH         | 28.40 | 36.10 | 48.88 |  7.84 | 23.68 |
 ### 3.1.2. 小语种： 日文
+| Model | JSQuAD | JCommonSenseQA | JNLI | MARC-ja | JAQKET v2 | PAWS-ja | avg |
+|--------------|-------|-------|-------|-------|-------|-------|-------|
+|Mixtral-8x7B  | 89.00 | 78.73 | 32.13 | 95.44 | 78.86 | 44.50 | 69.78 |
+|Qwen1.5-32B   | 89.86 | 84.54 | 50.99 | 97.08 | 82.14 | 43.80 | 74.74 |
+|Qwen2.5-32B   | 89.09 | 93.83 | 72.14 | 97.86 | 89.27 | 42.15 | 80.73 |
+|Orion-14B-Base| 74.22 | 88.20 | 72.85 | 94.06 | 66.20 | 49.90 | 74.24 |
+|Orion 8x7B    | 91.77 | 90.43 | 90.46 | 96.40 | 81.19 | 47.35 | 82.93 |
 ### 3.1.3. 小语种�� 韩文
+|Model | HAE-RAE | KoBEST BoolQ | KoBEST COPA | KoBEST HellaSwag | KoBEST SentiNeg | KoBEST WiC | PAWS-ko | avg |
+|--------------|-------|-------|-------|-------|-------|-------|-------|-------|
+|Mixtral-8x7B  | 53.16 | 78.56 | 66.20 | 56.60 | 77.08 | 49.37 | 44.05 | 60.72 |
+|Qwen1.5-32B   | 46.38 | 76.28 | 60.40 | 53.00 | 78.34 | 52.14 | 43.40 | 58.56 |
+|Qwen2.5-32B   | 70.67 | 80.27 | 76.70 | 61.20 | 96.47 | 77.22 | 37.05 | 71.37 |
+|Orion-14B-Base| 69.66 | 80.63 | 77.10 | 58.20 | 92.44 | 51.19 | 44.55 | 67.68 |
+|Orion 8x7B    | 65.17 | 85.40 | 80.40 | 56.00 | 96.98 | 73.57 | 46.35 | 71.98 |
 ### 3.1.4. 小语种： 阿拉伯语，德语，法语，西班牙语
 | Lang | ar |  | de |  | fr |  | es |  |
+|----|----|----|----|----|----|----|----|----|
+|**Model**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|**HellaSwag**|**ARC**|
+|Mixtral-8x7B  | 53.16 | 78.56 | 66.20 | 56.60 | 77.08 | 49.37 | 44.05 | 60.72 |
+|Qwen1.5-32B   | 46.38 | 76.28 | 60.40 | 53.00 | 78.34 | 52.14 | 43.40 | 58.56 |
+|Qwen2.5-32B   | 70.67 | 80.27 | 76.70 | 61.20 | 96.47 | 77.22 | 37.05 | 71.37 |
+|Orion-14B-Base| 69.66 | 80.63 | 77.10 | 58.20 | 92.44 | 51.19 | 44.55 | 67.68 |
+|Orion 8x7B    | 65.17 | 85.40 | 80.40 | 56.00 | 96.98 | 73.57 | 46.35 | 71.98 |
 ### 3.1.5. 泄漏检测结果
+当大型语言模型的预训练数据包含特定数据集的内容时，该模型在该数据集上的表现可能会被人为提高，从而导致不准确的性能评估。为了解决这个问题，来自中国科学院深圳先进技术研究院和其他机构的研究人员提出了一种简单有效的数据泄露检测方法。该方法利用多选项的可互换性，通过打乱原始数据集中的选项生成派生数据。然后，使用模型计算派生数据集的对数概率分布，以检测原始数据集是否存在泄露。
+我们在三个基准数据集上进行了数据泄露检测实验：MMLU、CMMLU 和 C-Eval。
+更多细节可以在论文中找到：https://web3.arxiv.org/pdf/2409.01790。
+测试代码：https://github.com/nishiwen1214/Benchmark-leakage-detection。
+|Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion 8x7B|Orion 14B|Mixtral 8x7B|
+|------|------|------|------|------|------|
+|MMLU  | 0.30 | 0.27 | 0.22 | 0.28 | 0.25 |
+|CEval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
+|CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
+### 3.1.6. 推理速度[Todo: Remove result of 14B, add more description of result]
 基于8卡Nvidia RTX3090，单位是令牌每秒
 |OrionLLM_V2.4.6.1 | 1并发_输出62 | 1并发_输出85 | 1并发_输出125 | 1并发_输出210 |
 |----|----|----|----|----|
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python demo/text_generation_base.py --model OrionStarAI/Orion-MOE8x7B-Base --tokenizer OrionStarAI/Orion-MOE8x7B-Base --prompt 你好,你叫什么名字
 ```
+## 4.3. [Todo] vLLM推理代码
 <a name="zh_declarations-license"></a><br>

assets/imgs/data_src_dist.png ADDED Viewed

config.json CHANGED Viewed

@@ -1,13 +1,13 @@
 {
   "_name_or_path": "Orion-MoE 8x7b",
   "architectures": [
-    "OrionCausalLM"
   ],
   "attention_bias": false,
   "attention_dropout": 0.0,
   "auto_map": {
-    "AutoConfig": "configuration_orion.OrionConfig",
-    "AutoModelForCausalLM": "modeling_orion.OrionForCausalLM"
   },
   "bos_token_id": 1,
   "eos_token_id": 2,

 {
   "_name_or_path": "Orion-MoE 8x7b",
   "architectures": [
+    "OrionMOECausalLM"
   ],
   "attention_bias": false,
   "attention_dropout": 0.0,
   "auto_map": {
+    "AutoConfig": "configuration_orion.OrionMOEConfig",
+    "AutoModelForCausalLM": "modeling_orion.OrionMOEForCausalLM"
   },
   "bos_token_id": 1,
   "eos_token_id": 2,

configuration_orion.py CHANGED Viewed

@@ -8,12 +8,12 @@ from transformers.utils import logging
 logger = logging.get_logger(__name__)
-class OrionConfig(PretrainedConfig):
     """
     Args:
         vocab_size (`int`, *optional*, defaults to 113664):
-            Vocabulary size of the Orion model. Defines the number of different tokens that can be represented by the
-            `inputs_ids` passed when calling [`OrionModel`]
         hidden_size (`int`, *optional*, defaults to 4096):
             Dimension of the hidden representations.
         intermediate_size (`int`, *optional*, defaults to 14592):
@@ -32,7 +32,7 @@ class OrionConfig(PretrainedConfig):
         hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
             The non-linear activation function (function or string) in the decoder.
         max_position_embeddings (`int`, *optional*, defaults to `8192`):
-            The maximum sequence length that this model might ever be used with. Orion's sliding window attention
             allows sequence of up to 4096*32 tokens.
         initializer_range (`float`, *optional*, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

 logger = logging.get_logger(__name__)
+class OrionMOEConfig(PretrainedConfig):
     """
     Args:
         vocab_size (`int`, *optional*, defaults to 113664):
+            Vocabulary size of the OrionMOE model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`OrionMOEModel`]
         hidden_size (`int`, *optional*, defaults to 4096):
             Dimension of the hidden representations.
         intermediate_size (`int`, *optional*, defaults to 14592):
         hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
             The non-linear activation function (function or string) in the decoder.
         max_position_embeddings (`int`, *optional*, defaults to `8192`):
+            The maximum sequence length that this model might ever be used with. OrionMOE's sliding window attention
             allows sequence of up to 4096*32 tokens.
         initializer_range (`float`, *optional*, defaults to 0.02):
             The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

modeling_orion.py CHANGED Viewed

@@ -34,7 +34,7 @@ from transformers.utils import (
     replace_return_docstrings,
 )
 from transformers.utils.import_utils import is_torch_fx_available
-from .configuration_orion import OrionConfig
 if is_flash_attn_2_available():
@@ -54,7 +54,7 @@ if is_torch_fx_available():
 logger = logging.get_logger(__name__)
-_CONFIG_FOR_DOC = "OrionConfig"
 def load_balancing_loss_func(
     gate_logits: torch.Tensor, num_experts: torch.Tensor = None, top_k=2, attention_mask: Optional[torch.Tensor] = None
@@ -145,10 +145,10 @@ def _get_unpad_data(attention_mask):
     )
-class OrionRMSNorm(nn.Module):
     def __init__(self, hidden_size, eps=1e-6):
         """
-        OrionRMSNorm is equivalent to T5LayerNorm
         """
         super().__init__()
         self.weight = nn.Parameter(torch.ones(hidden_size))
@@ -162,7 +162,7 @@ class OrionRMSNorm(nn.Module):
         return self.weight * hidden_states.to(input_dtype)
-class OrionRotaryEmbedding(nn.Module):
     def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
         super().__init__()
@@ -248,13 +248,13 @@ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
     return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
-class OrionAttention(nn.Module):
     """
     Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
     and "Generating Long Sequences with Sparse Transformers".
     """
-    def __init__(self, config: OrionConfig, layer_idx: Optional[int] = None):
         super().__init__()
         self.config = config
         self.layer_idx = layer_idx
@@ -285,7 +285,7 @@ class OrionAttention(nn.Module):
         self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
         self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
-        self.rotary_emb = OrionRotaryEmbedding(
             self.head_dim,
             max_position_embeddings=self.max_position_embeddings,
             base=self.rope_theta,
@@ -376,9 +376,9 @@ class OrionAttention(nn.Module):
         return attn_output, attn_weights, past_key_value
-class OrionFlashAttention2(OrionAttention):
     """
-    Orion flash attention module. This module inherits from `OrionAttention` as the weights of the module stays
     untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
     flash attention and deal with padding tokens in case the input contains any of them.
     """
@@ -670,14 +670,14 @@ class OrionFlashAttention2(OrionAttention):
         )
-class OrionSdpaAttention(OrionAttention):
     """
-    Orion attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
-    `OrionAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
     SDPA API.
     """
-    # Adapted from OrionAttention.forward
     def forward(
         self,
         hidden_states: torch.Tensor,
@@ -690,7 +690,7 @@ class OrionSdpaAttention(OrionAttention):
         if output_attentions:
             # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
             logger.warning_once(
-                "OrionModel is using OrionSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
                 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
             )
             return super().forward(
@@ -758,14 +758,14 @@ class OrionSdpaAttention(OrionAttention):
 ORION_ATTENTION_CLASSES = {
-    "eager": OrionAttention,
-    "flash_attention_2": OrionFlashAttention2,
-    "sdpa": OrionSdpaAttention,
 }
-class OrionBlockSparseTop2MLP(nn.Module):
-    def __init__(self, config: OrionConfig):
         super().__init__()
         self.ffn_dim = config.intermediate_size
         self.hidden_dim = config.hidden_size
@@ -782,7 +782,7 @@ class OrionBlockSparseTop2MLP(nn.Module):
         return current_hidden_states
-class OrionSparseMoeBlock(nn.Module):
     """
     This implementation is
     strictly equivalent to standard MoE with full capacity (no
@@ -804,7 +804,7 @@ class OrionSparseMoeBlock(nn.Module):
         # gating
         self.gate = nn.Linear(self.hidden_dim, self.num_experts, bias=False)
-        self.experts = nn.ModuleList([OrionBlockSparseTop2MLP(config) for _ in range(self.num_experts)])
         # Jitter parameters
         self.jitter_noise = config.router_jitter_noise
@@ -847,16 +847,16 @@ class OrionSparseMoeBlock(nn.Module):
         return final_hidden_states, router_logits
-class OrionDecoderLayer(nn.Module):
-    def __init__(self, config: OrionConfig, layer_idx: int):
         super().__init__()
         self.hidden_size = config.hidden_size
         self.self_attn = ORION_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
-        self.block_sparse_moe = OrionSparseMoeBlock(config)
-        self.input_layernorm = OrionRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
-        self.post_attention_layernorm = OrionRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
     def forward(
         self,
@@ -935,7 +935,7 @@ ORION_START_DOCSTRING = r"""
     and behavior.
     Parameters:
-        config ([`OrionConfig`]):
             Model configuration class with all the parameters of the model. Initializing with a config file does not
             load the weights associated with the model, only the configuration. Check out the
             [`~PreTrainedModel.from_pretrained`] method to load the model weights.
@@ -943,15 +943,15 @@ ORION_START_DOCSTRING = r"""
 @add_start_docstrings(
-    "The bare Orion Model outputting raw hidden-states without any specific head on top.",
     ORION_START_DOCSTRING,
 )
-class OrionPreTrainedModel(PreTrainedModel):
-    config_class = OrionConfig
     base_model_prefix = "model"
     supports_gradient_checkpointing = True
-    _no_split_modules = ["OrionDecoderLayer"]
     _skip_keys_device_placement = "past_key_values"
     _supports_flash_attn_2 = True
     _supports_sdpa = True
@@ -1037,28 +1037,28 @@ ORION_INPUTS_DOCSTRING = r"""
 @add_start_docstrings(
-    "The bare Orion Model outputting raw hidden-states without any specific head on top.",
     ORION_START_DOCSTRING,
 )
-class OrionModel(OrionPreTrainedModel):
     """
-    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`OrionDecoderLayer`]
     Args:
-        config: OrionConfig
     """
-    def __init__(self, config: OrionConfig):
         super().__init__(config)
         self.padding_idx = config.pad_token_id
         self.vocab_size = config.vocab_size
         self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
         self.layers = nn.ModuleList(
-            [OrionDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
         )
         self._attn_implementation = config._attn_implementation
-        self.norm = OrionRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
         self.gradient_checkpointing = False
         # Initialize weights and apply final processing
@@ -1138,7 +1138,7 @@ class OrionModel(OrionPreTrainedModel):
             if is_padding_right:
                 raise ValueError(
                     "You are attempting to perform batched generation with padding_side='right'"
-                    " this may lead to unexpected behaviour for Flash Attention version of Orion. Make sure to "
                     " call `tokenizer.padding_side  = 'left'` before tokenizing the input. "
                 )
@@ -1235,12 +1235,12 @@ class OrionModel(OrionPreTrainedModel):
         )
-class OrionForCausalLM(OrionPreTrainedModel):
     _tied_weights_keys = ["lm_head.weight"]
     def __init__(self, config):
         super().__init__(config)
-        self.model = OrionModel(config)
         self.vocab_size = config.vocab_size
         self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
         self.router_aux_loss_coef = config.router_aux_loss_coef
@@ -1438,9 +1438,9 @@ class OrionForCausalLM(OrionPreTrainedModel):
 @add_start_docstrings(
     """
-    The Orion Model transformer with a sequence classification head on top (linear layer).
-    [`OrionForSequenceClassification`] uses the last token in order to do the classification, as other causal models
     (e.g. GPT-2) do.
     Since it does classification on the last token, it requires to know the position of the last token. If a
@@ -1451,11 +1451,11 @@ class OrionForCausalLM(OrionPreTrainedModel):
     """,
     ORION_START_DOCSTRING,
 )
-class OrionForSequenceClassification(OrionPreTrainedModel):
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
-        self.model = OrionModel(config)
         self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
         # Initialize weights and apply final processing

     replace_return_docstrings,
 )
 from transformers.utils.import_utils import is_torch_fx_available
+from .configuration_orion import OrionMOEConfig
 if is_flash_attn_2_available():
 logger = logging.get_logger(__name__)
+_CONFIG_FOR_DOC = "OrionMOEConfig"
 def load_balancing_loss_func(
     gate_logits: torch.Tensor, num_experts: torch.Tensor = None, top_k=2, attention_mask: Optional[torch.Tensor] = None
     )
+class OrionMOERMSNorm(nn.Module):
     def __init__(self, hidden_size, eps=1e-6):
         """
+        OrionMOERMSNorm is equivalent to T5LayerNorm
         """
         super().__init__()
         self.weight = nn.Parameter(torch.ones(hidden_size))
         return self.weight * hidden_states.to(input_dtype)
+class OrionMOERotaryEmbedding(nn.Module):
     def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
         super().__init__()
     return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
+class OrionMOEAttention(nn.Module):
     """
     Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
     and "Generating Long Sequences with Sparse Transformers".
     """
+    def __init__(self, config: OrionMOEConfig, layer_idx: Optional[int] = None):
         super().__init__()
         self.config = config
         self.layer_idx = layer_idx
         self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
         self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
+        self.rotary_emb = OrionMOERotaryEmbedding(
             self.head_dim,
             max_position_embeddings=self.max_position_embeddings,
             base=self.rope_theta,
         return attn_output, attn_weights, past_key_value
+class OrionMOEFlashAttention2(OrionMOEAttention):
     """
+    OrionMOE flash attention module. This module inherits from `OrionMOEAttention` as the weights of the module stays
     untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
     flash attention and deal with padding tokens in case the input contains any of them.
     """
         )
+class OrionMOESdpaAttention(OrionMOEAttention):
     """
+    OrionMOE attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
+    `OrionMOEAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
     SDPA API.
     """
+    # Adapted from OrionMOEAttention.forward
     def forward(
         self,
         hidden_states: torch.Tensor,
         if output_attentions:
             # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
             logger.warning_once(
+                "OrionMOEModel is using OrionMOESdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
                 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
             )
             return super().forward(
 ORION_ATTENTION_CLASSES = {
+    "eager": OrionMOEAttention,
+    "flash_attention_2": OrionMOEFlashAttention2,
+    "sdpa": OrionMOESdpaAttention,
 }
+class OrionMOEBlockSparseTop2MLP(nn.Module):
+    def __init__(self, config: OrionMOEConfig):
         super().__init__()
         self.ffn_dim = config.intermediate_size
         self.hidden_dim = config.hidden_size
         return current_hidden_states
+class OrionMOESparseMoeBlock(nn.Module):
     """
     This implementation is
     strictly equivalent to standard MoE with full capacity (no
         # gating
         self.gate = nn.Linear(self.hidden_dim, self.num_experts, bias=False)
+        self.experts = nn.ModuleList([OrionMOEBlockSparseTop2MLP(config) for _ in range(self.num_experts)])
         # Jitter parameters
         self.jitter_noise = config.router_jitter_noise
         return final_hidden_states, router_logits
+class OrionMOEDecoderLayer(nn.Module):
+    def __init__(self, config: OrionMOEConfig, layer_idx: int):
         super().__init__()
         self.hidden_size = config.hidden_size
         self.self_attn = ORION_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
+        self.block_sparse_moe = OrionMOESparseMoeBlock(config)
+        self.input_layernorm = OrionMOERMSNorm(config.hidden_size, eps=config.rms_norm_eps)
+        self.post_attention_layernorm = OrionMOERMSNorm(config.hidden_size, eps=config.rms_norm_eps)
     def forward(
         self,
     and behavior.
     Parameters:
+        config ([`OrionMOEConfig`]):
             Model configuration class with all the parameters of the model. Initializing with a config file does not
             load the weights associated with the model, only the configuration. Check out the
             [`~PreTrainedModel.from_pretrained`] method to load the model weights.
 @add_start_docstrings(
+    "The bare OrionMOE Model outputting raw hidden-states without any specific head on top.",
     ORION_START_DOCSTRING,
 )
+class OrionMOEPreTrainedModel(PreTrainedModel):
+    config_class = OrionMOEConfig
     base_model_prefix = "model"
     supports_gradient_checkpointing = True
+    _no_split_modules = ["OrionMOEDecoderLayer"]
     _skip_keys_device_placement = "past_key_values"
     _supports_flash_attn_2 = True
     _supports_sdpa = True
 @add_start_docstrings(
+    "The bare OrionMOE Model outputting raw hidden-states without any specific head on top.",
     ORION_START_DOCSTRING,
 )
+class OrionMOEModel(OrionMOEPreTrainedModel):
     """
+    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`OrionMOEDecoderLayer`]
     Args:
+        config: OrionMOEConfig
     """
+    def __init__(self, config: OrionMOEConfig):
         super().__init__(config)
         self.padding_idx = config.pad_token_id
         self.vocab_size = config.vocab_size
         self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
         self.layers = nn.ModuleList(
+            [OrionMOEDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
         )
         self._attn_implementation = config._attn_implementation
+        self.norm = OrionMOERMSNorm(config.hidden_size, eps=config.rms_norm_eps)
         self.gradient_checkpointing = False
         # Initialize weights and apply final processing
             if is_padding_right:
                 raise ValueError(
                     "You are attempting to perform batched generation with padding_side='right'"
+                    " this may lead to unexpected behaviour for Flash Attention version of OrionMOE. Make sure to "
                     " call `tokenizer.padding_side  = 'left'` before tokenizing the input. "
                 )
         )
+class OrionMOEForCausalLM(OrionMOEPreTrainedModel):
     _tied_weights_keys = ["lm_head.weight"]
     def __init__(self, config):
         super().__init__(config)
+        self.model = OrionMOEModel(config)
         self.vocab_size = config.vocab_size
         self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
         self.router_aux_loss_coef = config.router_aux_loss_coef
 @add_start_docstrings(
     """
+    The OrionMOE Model transformer with a sequence classification head on top (linear layer).
+    [`OrionMOEForSequenceClassification`] uses the last token in order to do the classification, as other causal models
     (e.g. GPT-2) do.
     Since it does classification on the last token, it requires to know the position of the last token. If a
     """,
     ORION_START_DOCSTRING,
 )
+class OrionMOEForSequenceClassification(OrionMOEPreTrainedModel):
     def __init__(self, config):
         super().__init__(config)
         self.num_labels = config.num_labels
+        self.model = OrionMOEModel(config)
         self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
         # Initialize weights and apply final processing