readme: Update inference code

Browse files

Signed-off-by: eric <[email protected]>

Files changed (5) hide show

README.md +45 -33
README_zh.md +42 -38
assets/imgs/inf_spd.png +0 -0
assets/imgs/inf_spd_en.png +0 -0
assets/imgs/inf_spd_zh.png +0 -0

README.md CHANGED Viewed

@@ -40,9 +40,11 @@ tags:
 - [📖 Model Introduction](#model-introduction)
 - [🔗 Model Download](#model-download)
 - [🔖 Model Benchmark](#model-benchmark)
 - [📜 Declarations & License](#declarations-license)
 - [🥇 Company Introduction](#company-introduction)
 <a name="model-introduction"></a><br>
 # 1. Model Introduction
@@ -52,9 +54,9 @@ tags:
   - The model demonstrates excellent performance in comprehensive evaluations compared to other base models of the same parameter scale.
   - It has strong multilingual capabilities, significantly leading in Japanese and Korean test sets, and also performing comprehensively better in Arabic, German, French, and Spanish test sets.
 - Model Hyper-Parameters
-  - The architecture of the OrionMOE 8*7B models closely resembles that of Mixtral 8*7B, with specific details shown in the table below.
-    |Configuration      |OrionMOE 8*7B|
     |-------------------|-------------|
     |Hidden Size        | 4096        |
     |# Layers           | 32          |
@@ -75,7 +77,7 @@ tags:
 - Model pretrain data distribution
   - The training dataset is primarily composed of English, Chinese, and other languages, accounting for 50%, 25%, and 12% of the data, respectively. Additionally, code makes up 9%, while mathematical text accounts for 4%. The distribution by topics is detailed in the table below.
 <div align="center">
-  <img src="./assets/imgs/data_src_dist.png" alt="logo" width="80%" />
 </div>
@@ -84,8 +86,8 @@ tags:
 Model release and download links are provided in the table below:
-| Model Name           | HuggingFace Download Links                                                  | ModelScope Download Links                                                                 |
-|----------------------|-----------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
 | ⚾Orion-MOE8x7B-Base | [Orion-MOE8x7B-Base](https://huggingface.co/OrionStarAI/Orion-MOE8x7B-Base) | [Orion-MOE8x7B-Base](https://modelscope.cn/models/OrionStarAI/Orion-MOE8x7B-Base/summary) |
@@ -94,7 +96,7 @@ Model release and download links are provided in the table below:
 ## 3.1. Base Model Orion-MOE8x7B-Base Benchmarks
 ### 3.1.1. LLM evaluation results on examination and professional knowledge
-|TestSet|Mixtral 8*7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion 8*7B|
 | ----------- | ----- | ----- | ----- | ----- | ----- |
 |CEval        | 54.09 | 83.50 | 87.74 | 72.80 | 89.74 |
 |CMMLU        | 53.21 | 82.30 | 89.01 | 70.57 | 89.16 |
@@ -146,10 +148,8 @@ Model release and download links are provided in the table below:
 ### 3.1.5. Leakage Detection Benchmark
 When the pre-training data of a large language model contains content from a specific dataset, the model’s performance on that dataset may be artificially enhanced, leading to inaccurate performance evaluations. To address this issue, researchers from the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, and other institutions have proposed a simple and effective method for detecting data leakage. This method leverages the interchangeable nature of multiple-choice options by shuffling the options in the original dataset to generate derived data. The log-probability distribution of the derived dataset is then computed using the model to detect whether the original dataset has been leaked.
-We conducted data leakage detection experiments on three benchmark datasets: MMLU, CMMLU, and C-Eval.
-More details can be found in the paper: https://web3.arxiv.org/pdf/2409.01790.
 Test code: https://github.com/nishiwen1214/Benchmark-leakage-detection.
 |Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion 8x7B|Orion 14B|Mixtral 8x7B|
@@ -158,31 +158,29 @@ Test code: https://github.com/nishiwen1214/Benchmark-leakage-detection.
 |CEval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
 |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
-### 3.1.6. Inference speed[Todo]
-Based on 8x Nvidia RTX3090， in unit of tokens per second.
-|OrionLLM_V2.4.6.1 | 1para_out62 | 1para_out85 | 1para_out125 | 1para_out210 |
-|----|----|----|----|----|
-|OrionMOE | 33.03544296 | 33.43113606 | 33.53014102 | 33.58693529 |
-|Qwen32B | 26.46267188 | 26.72846906 | 26.80413838 | 27.03123611 |
-|Orion14B | 41.69121312 | 41.77423491 | 41.76050902 | 42.26096669 |
-|OrionLLM_V2.4.6.1 | 4para_out62 | 4para_out90 | 4para_out125 | 4para_out220 |
-|----|----|----|----|----|
-|OrionMOE | 29.45015743 | 30.4472947 | 31.03748516 | 31.45783599 |
-|Qwen32B | 23.60912215 | 24.30431956 | 24.86132023 | 25.16827535 |
-|Orion14B | 38.08240373 | 38.8572788 | 39.50040645 | 40.44875947 |
-|OrionLLM_V2.4.6.1 | 8para_out62 | 8para_out85 | 8para_out125 | 8para_out220 |
-|----|----|----|----|----|
-|OrionMOE | 25.71006327 | 27.13446743 | 28.89463226 | 29.70440167 |
-|Qwen32B | 21.15920951 | 21.92001035 | 23.13867947 | 23.5649106 |
-|Orion14B | 34.4151923 | 36.05635893 | 37.0874908 | 37.91705944 |
-<div align="center">
-  <img src="./assets/imgs/inf_spd_en.png" alt="inf_speed" width="100%" />
-</div>
 <a name="model-inference"></a><br>
@@ -225,7 +223,21 @@ device, you can use something like `export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python demo/text_generation_base.py --model OrionStarAI/Orion-MOE8x7B-Base --tokenizer OrionStarAI/Orion-MOE8x7B-Base --prompt hello
 ```
-## 4.3. [Todo] vLLM inference code
 <a name="declarations-license"></a><br>

 - [📖 Model Introduction](#model-introduction)
 - [🔗 Model Download](#model-download)
 - [🔖 Model Benchmark](#model-benchmark)
+- [📊 Model Inference](#model-inference)
 - [📜 Declarations & License](#declarations-license)
 - [🥇 Company Introduction](#company-introduction)
 <a name="model-introduction"></a><br>
 # 1. Model Introduction
   - The model demonstrates excellent performance in comprehensive evaluations compared to other base models of the same parameter scale.
   - It has strong multilingual capabilities, significantly leading in Japanese and Korean test sets, and also performing comprehensively better in Arabic, German, French, and Spanish test sets.
 - Model Hyper-Parameters
+  - The architecture of the OrionMOE 8x7B models closely resembles that of Mixtral 8x7B, with specific details shown in the table below.
+    |Configuration      |OrionMOE 8x7B|
     |-------------------|-------------|
     |Hidden Size        | 4096        |
     |# Layers           | 32          |
 - Model pretrain data distribution
   - The training dataset is primarily composed of English, Chinese, and other languages, accounting for 50%, 25%, and 12% of the data, respectively. Additionally, code makes up 9%, while mathematical text accounts for 4%. The distribution by topics is detailed in the table below.
 <div align="center">
+  <img src="./assets/imgs/data_src_dist.png" alt="logo" width="70%" />
 </div>
 Model release and download links are provided in the table below:
+| Model Name | HuggingFace Download Links | ModelScope Download Links |
+|------------|----------------------------|---------------------------|
 | ⚾Orion-MOE8x7B-Base | [Orion-MOE8x7B-Base](https://huggingface.co/OrionStarAI/Orion-MOE8x7B-Base) | [Orion-MOE8x7B-Base](https://modelscope.cn/models/OrionStarAI/Orion-MOE8x7B-Base/summary) |
 ## 3.1. Base Model Orion-MOE8x7B-Base Benchmarks
 ### 3.1.1. LLM evaluation results on examination and professional knowledge
+|TestSet|Mixtral 8x7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion 8x7B|
 | ----------- | ----- | ----- | ----- | ----- | ----- |
 |CEval        | 54.09 | 83.50 | 87.74 | 72.80 | 89.74 |
 |CMMLU        | 53.21 | 82.30 | 89.01 | 70.57 | 89.16 |
 ### 3.1.5. Leakage Detection Benchmark
 When the pre-training data of a large language model contains content from a specific dataset, the model’s performance on that dataset may be artificially enhanced, leading to inaccurate performance evaluations. To address this issue, researchers from the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, and other institutions have proposed a simple and effective method for detecting data leakage. This method leverages the interchangeable nature of multiple-choice options by shuffling the options in the original dataset to generate derived data. The log-probability distribution of the derived dataset is then computed using the model to detect whether the original dataset has been leaked.
+We conducted data leakage detection experiments on three benchmark datasets: MMLU, CMMLU, and C-Eval.<br>
+More details can be found in the paper: https://web3.arxiv.org/pdf/2409.01790.<br>
 Test code: https://github.com/nishiwen1214/Benchmark-leakage-detection.
 |Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion 8x7B|Orion 14B|Mixtral 8x7B|
 |CEval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
 |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
+### 3.1.6. Inference speed
+Setup inference server on 8x Nvidia RTX3090， and get results from client in unit of tokens per second.
+|OrionLLM_V2.4.6.1|1para_out62|1para_out85|1para_out125|1para_out210|
+|---------|-------|-------|-------|-------|
+|OrionMOE | 33.04 | 33.43 | 33.53 | 33.59 |
+|Qwen32   | 26.46 | 26.73 | 26.80 | 27.03 |
+|OrionLLM_V2.4.6.1|4para_out62|4para_out90|4para_out125|4para_out220|
+|---------|-------|-------|-------|-------|
+|OrionMOE | 29.45 | 30.45 | 31.04 | 31.46 |
+|Qwen32   | 23.61 | 24.30 | 24.86 | 25.17 |
+|OrionLLM_V2.4.6.1|8para_out62|8para_out85|8para_out125|8para_out220|
+|---------|-------|-------|-------|-------|
+|OrionMOE | 25.71 | 27.13 | 28.89 | 29.70 |
+|Qwen32   | 21.16 | 21.92 | 23.14 | 23.56 |
+We found that the inference speed results vary based on the number of concurrent requests and the length of output. To facilitate horizontal comparisons, we conducted multiple sets of tests. Each set of test data has a specific format: \<n>para_out\<m>. For example, "4para_out220" indicates the inference speed when there are 4 concurrent requests from the client and the average output token length is 220.
+<div align="center">
+  <img src="./assets/imgs/inf_spd.png" alt="inf_speed" width="100%" />
+</div>
 <a name="model-inference"></a><br>
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python demo/text_generation_base.py --model OrionStarAI/Orion-MOE8x7B-Base --tokenizer OrionStarAI/Orion-MOE8x7B-Base --prompt hello
 ```
+## 4.3. vLLM Inference Service
+Download project(https://github.com/OrionStarAI/vllm_server), follow the instructions to build up the vLLM service docker image.
+```shell
+git clone [email protected]:OrionStarAI/vllm_server.git
+cd vllm_server
+docker build -t vllm_server:0.0.0.0 -f Dockerfile .
+```
+Start docker service
+```shell
+docker run --gpus all -it -p 9999:9999 -v $(pwd)/logs:/workspace/logs:rw -v $HOME/Downloads:/workspace/models -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -e MODEL_DIR=Orion-MOE8x7B-Base -e MODEL_NAME=orion-moe vllm_server:0.0.0.0
+```
+Run inference
+```shell
+curl http://0.0.0.0:9999/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "orion-moe","temperature": 0.2,"stream": false, "messages": [{"role": "user", "content":"Which company developed you as an AI agent?"}]}'
+```
 <a name="declarations-license"></a><br>

README_zh.md CHANGED Viewed

@@ -31,6 +31,7 @@
 - [📖 模型介绍](#zh_model-introduction)
 - [🔗 下载路径](#zh_model-download)
 - [🔖 评估结果](#zh_model-benchmark)
 - [📜 声明协议](#zh_declarations-license)
 - [🥇 企业介绍](#zh_company-introduction)
@@ -47,7 +48,7 @@
 - Orion-MOE8x7B-Base模型超参
   - Orion-MOE8x7B-Base模型架构接近Mixtral 8x7B,超参细节请参考下表
-    |Configuration      |OrionMOE 8*7B|
     |-------------------|-------------|
     |Hidden Size        | 4096        |
     |# Layers           | 32          |
@@ -68,7 +69,7 @@
 - Orion-MOE8x7B-Base训练数据组成
   - 预训练数据语种上主要由英语、中文和其他多语言语言组成，分别占比50%、25%和12%。数据分类上，代码占9%，数学文本占4%，分布参考下图。
 <div align="center">
-  <img src="./assets/imgs/data_src_dist.png" alt="logo" width="80%" />
 </div>
@@ -77,10 +78,9 @@
 发布模型和下载链接见下表：
-| 模型名称              | HuggingFace下载链接                                                                | ModelScope下载链接                                                                               |
-|---------------------|-----------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
-| ⚾ 基座模型           | [Orion-MOE8x7B-Base](https://huggingface.co/OrionStarAI/Orion-MOE8x7B-Base)               | [Orion-MOE8x7B-Base](https://modelscope.cn/models/OrionStarAI/Orion-MOE8x7B-Base/summary)              |
 <a name="zh_model-benchmark"></a><br>
@@ -89,7 +89,7 @@
 ## 3.1. 基座模型Orion-MOE8x7B-Base评估
 ### 3.1.1. 基座模型基准测试对比
-|TestSet|Mixtral 8*7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion 8*7B|
 | ----------- | ----- | ----- | ----- | ----- | ----- |
 |CEval        | 54.09 | 83.50 | 87.74 | 72.80 | 89.74 |
 |CMMLU        | 53.21 | 82.30 | 89.01 | 70.57 | 89.16 |
@@ -110,8 +110,6 @@
 |GSM8K        | 47.50 | 77.40 | 80.36 | 52.01 | 59.82 |
 |MATH         | 28.40 | 36.10 | 48.88 |  7.84 | 23.68 |
 ### 3.1.2. 小语种： 日文
 | Model | JSQuAD | JCommonSenseQA | JNLI | MARC-ja | JAQKET v2 | PAWS-ja | avg |
 |--------------|-------|-------|-------|-------|-------|-------|-------|
@@ -121,7 +119,6 @@
 |Orion-14B-Base| 74.22 | 88.20 | 72.85 | 94.06 | 66.20 | 49.90 | 74.24 |
 |Orion 8x7B    | 91.77 | 90.43 | 90.46 | 96.40 | 81.19 | 47.35 | 82.93 |
 ### 3.1.3. 小语种： 韩文
 |Model | HAE-RAE | KoBEST BoolQ | KoBEST COPA | KoBEST HellaSwag | KoBEST SentiNeg | KoBEST WiC | PAWS-ko | avg |
 |--------------|-------|-------|-------|-------|-------|-------|-------|-------|
@@ -131,8 +128,6 @@
 |Orion-14B-Base| 69.66 | 80.63 | 77.10 | 58.20 | 92.44 | 51.19 | 44.55 | 67.68 |
 |Orion 8x7B    | 65.17 | 85.40 | 80.40 | 56.00 | 96.98 | 73.57 | 46.35 | 71.98 |
 ### 3.1.4. 小语种： 阿拉伯语，德语，法语，西班牙语
 | Lang | ar |  | de |  | fr |  | es |  |
 |----|----|----|----|----|----|----|----|----|
@@ -143,14 +138,11 @@
 |Orion-14B-Base| 69.66 | 80.63 | 77.10 | 58.20 | 92.44 | 51.19 | 44.55 | 67.68 |
 |Orion 8x7B    | 65.17 | 85.40 | 80.40 | 56.00 | 96.98 | 73.57 | 46.35 | 71.98 |
 ### 3.1.5. 泄漏检测结果
 当大型语言模型的预训练数据包含特定数据集的内容时，该模型在该数据集上的表现可能会被人为提高，从而导致不准确的性能评估。为了解决这个问题，来自中国科学院深圳先进技术研究院和其他机构的研究人员提出了一种简单有效的数据泄露检测方法。该方法利用多选项的可互换性，通过打乱原始数据集中的选项生成派生数据。然后，使用模型计算派生数据集的对数概率分布，以检测原始数据集是否存在泄露。
-我们在三个基准数据集上进行了数据泄露检测实验：MMLU、CMMLU 和 C-Eval。
-更多细节可以在论文中找到：https://web3.arxiv.org/pdf/2409.01790。
 测试代码：https://github.com/nishiwen1214/Benchmark-leakage-detection。
 |Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion 8x7B|Orion 14B|Mixtral 8x7B|
@@ -159,32 +151,32 @@
 |CEval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
 |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
-### 3.1.6. 推理速度[Todo: Remove result of 14B, add more description of result]
-基于8卡Nvidia RTX3090，单位是令牌每秒
-|OrionLLM_V2.4.6.1 | 1并发_输出62 | 1并发_输出85 | 1并发_输出125 | 1并发_输出210 |
-|----|----|----|----|----|
-|OrionMOE | 33.03544296 | 33.43113606 | 33.53014102 | 33.58693529 |
-|Qwen32B | 26.46267188 | 26.72846906 | 26.80413838 | 27.03123611 |
-|Orion14B | 41.69121312 | 41.77423491 | 41.76050902 | 42.26096669 |
-|OrionLLM_V2.4.6.1 | 4并发_输出62 | 4并发_输出90 | 4并发_输出125 | 4并发_输出220 |
-|----|----|----|----|----|
-|OrionMOE | 29.45015743 | 30.4472947 | 31.03748516 | 31.45783599 |
-|Qwen32B | 23.60912215 | 24.30431956 | 24.86132023 | 25.16827535 |
-|Orion14B | 38.08240373 | 38.8572788 | 39.50040645 | 40.44875947 |
-|OrionLLM_V2.4.6.1 | 8并发_输出62 | 8并发_输出85 | 8并发_输出125 | 8并发_输出220 |
-|----|----|----|----|----|
-|OrionMOE | 25.71006327 | 27.13446743 | 28.89463226 | 29.70440167 |
-|Qwen32B | 21.15920951 | 21.92001035 | 23.13867947 | 23.5649106 |
-|Orion14B | 34.4151923 | 36.05635893 | 37.0874908 | 37.91705944 |
 <div align="center">
-  <img src="./assets/imgs/inf_spd_zh.png" alt="inf_speed" width="100%" />
 </div>
 # 4. 模型推理
 推理所需的模型权重、源码、配置已发布在 Hugging Face，下载链接见本文档最开始的表格。我们在此示范多种推理方式。程序会自动从
@@ -211,11 +203,9 @@ response = model.chat(tokenizer, messages, streaming=Flase)
 print(response)
 ```
 在上述两段代码中，模型加载指定 `device_map='auto'`
 ，会使用所有可用显卡。如需指定使用的设备，可以使用类似 `export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`（使用了0、1、2、3、4、5、6、7号显卡）的方式控制。
 ## 4.2. 脚本直接推理
 ```shell
@@ -224,7 +214,21 @@ print(response)
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python demo/text_generation_base.py --model OrionStarAI/Orion-MOE8x7B-Base --tokenizer OrionStarAI/Orion-MOE8x7B-Base --prompt 你好,你叫什么名字
 ```
-## 4.3. [Todo] vLLM推理代码
 <a name="zh_declarations-license"></a><br>

 - [📖 模型介绍](#zh_model-introduction)
 - [🔗 下载路径](#zh_model-download)
 - [🔖 评估结果](#zh_model-benchmark)
+- [📊 模型推理](#zh_model-inference)
 - [📜 声明协议](#zh_declarations-license)
 - [🥇 企业介绍](#zh_company-introduction)
 - Orion-MOE8x7B-Base模型超参
   - Orion-MOE8x7B-Base模型架构接近Mixtral 8x7B,超参细节请参考下表
+    |Configuration      |OrionMOE 8x7B|
     |-------------------|-------------|
     |Hidden Size        | 4096        |
     |# Layers           | 32          |
 - Orion-MOE8x7B-Base训练数据组成
   - 预训练数据语种上主要由英语、中文和其他多语言语言组成，分别占比50%、25%和12%。数据分类上，代码占9%，数学文本占4%，分布参考下图。
 <div align="center">
+  <img src="./assets/imgs/data_src_dist.png" alt="logo" width="70%" />
 </div>
 发布模型和下载链接见下表：
+| 模型名称 | HuggingFace下载链接 | ModelScope下载链接 |
+|---------|-------------------|-------------------|
+| ⚾ 基座模型 | [Orion-MOE8x7B-Base](https://huggingface.co/OrionStarAI/Orion-MOE8x7B-Base) | [Orion-MOE8x7B-Base](https://modelscope.cn/models/OrionStarAI/Orion-MOE8x7B-Base/summary) |
 <a name="zh_model-benchmark"></a><br>
 ## 3.1. 基座模型Orion-MOE8x7B-Base评估
 ### 3.1.1. 基座模型基准测试对比
+|TestSet|Mixtral 8x7B|Qwen1.5-32b|Qwen2.5-32b|Orion 14B|Orion 8x7B|
 | ----------- | ----- | ----- | ----- | ----- | ----- |
 |CEval        | 54.09 | 83.50 | 87.74 | 72.80 | 89.74 |
 |CMMLU        | 53.21 | 82.30 | 89.01 | 70.57 | 89.16 |
 |GSM8K        | 47.50 | 77.40 | 80.36 | 52.01 | 59.82 |
 |MATH         | 28.40 | 36.10 | 48.88 |  7.84 | 23.68 |
 ### 3.1.2. 小语种： 日文
 | Model | JSQuAD | JCommonSenseQA | JNLI | MARC-ja | JAQKET v2 | PAWS-ja | avg |
 |--------------|-------|-------|-------|-------|-------|-------|-------|
 |Orion-14B-Base| 74.22 | 88.20 | 72.85 | 94.06 | 66.20 | 49.90 | 74.24 |
 |Orion 8x7B    | 91.77 | 90.43 | 90.46 | 96.40 | 81.19 | 47.35 | 82.93 |
 ### 3.1.3. 小语种： 韩文
 |Model | HAE-RAE | KoBEST BoolQ | KoBEST COPA | KoBEST HellaSwag | KoBEST SentiNeg | KoBEST WiC | PAWS-ko | avg |
 |--------------|-------|-------|-------|-------|-------|-------|-------|-------|
 |Orion-14B-Base| 69.66 | 80.63 | 77.10 | 58.20 | 92.44 | 51.19 | 44.55 | 67.68 |
 |Orion 8x7B    | 65.17 | 85.40 | 80.40 | 56.00 | 96.98 | 73.57 | 46.35 | 71.98 |
 ### 3.1.4. 小语种： 阿拉伯语，德语，法语，西班牙语
 | Lang | ar |  | de |  | fr |  | es |  |
 |----|----|----|----|----|----|----|----|----|
 |Orion-14B-Base| 69.66 | 80.63 | 77.10 | 58.20 | 92.44 | 51.19 | 44.55 | 67.68 |
 |Orion 8x7B    | 65.17 | 85.40 | 80.40 | 56.00 | 96.98 | 73.57 | 46.35 | 71.98 |
 ### 3.1.5. 泄漏检测结果
 当大型语言模型的预训练数据包含特定数据集的内容时，该模型在该数据集上的表现可能会被人为提高，从而导致不准确的性能评估。为了解决这个问题，来自中国科学院深圳先进技术研究院和其他机构的研究人员提出了一种简单有效的数据泄露检测方法。该方法利用多选项的可互换性，通过打乱原始数据集中的选项生成派生数据。然后，使用模型计算派生数据集的对数概率分布，以检测原始数据集是否存在泄露。
+我们在三个基准数据集上进行了数据泄露检测实验：MMLU、CMMLU 和 C-Eval。<br>
+更多细节可以在论文中找到：https://web3.arxiv.org/pdf/2409.01790。<br>
 测试代码：https://github.com/nishiwen1214/Benchmark-leakage-detection。
 |Threshold 0.2|Qwen2.5 32B|Qwen1.5 32B|Orion 8x7B|Orion 14B|Mixtral 8x7B|
 |CEval | 0.39 | 0.38 | 0.27 | 0.26 | 0.26 |
 |CMMLU | 0.38 | 0.39 | 0.23 | 0.27 | 0.22 |
+### 3.1.6. 推理速度
+搭建基于8卡Nvidia RTX3090，采���"token/秒"为单位，从客户端统计测试结果。
+|OrionLLM_V2.4.6.1|1并发_输出62|1并发_输出85|1并发_输出125|1并发_输出210|
+|---------|-------|-------|-------|-------|
+|OrionMOE | 33.04 | 33.43 | 33.53 | 33.59 |
+|Qwen32   | 26.46 | 26.73 | 26.80 | 27.03 |
+|OrionLLM_V2.4.6.1|4并发_输出62|4并发_输出90|4并发_输出125|4并发_220|
+|---------|-------|-------|-------|-------|
+|OrionMOE | 29.45 | 30.45 | 31.04 | 31.46 |
+|Qwen32   | 23.61 | 24.30 | 24.86 | 25.17 |
+|OrionLLM_V2.4.6.1|8并发_输出62|8并发_输出85|8并发_输出125|8并发_输出220|
+|---------|-------|-------|-------|-------|
+|OrionMOE | 25.71 | 27.13 | 28.89 | 29.70 |
+|Qwen32   | 21.16 | 21.92 | 23.14 | 23.56 |
+我们发现根据推理的并发数以及模型输出长度的不同，推理速度的结果会有变化，为了方便横向对比，我们做了多组数据的测试，每一组测试数据的格式含义：客户端并发数_每次推理输出token数，在此前提条件下的推理速度，例如:4para_out220，表示客户端4并发打请求，输出token数平均在220个token时的推理速度。
 <div align="center">
+  <img src="./assets/imgs/inf_spd.png" alt="inf_speed" width="100%" />
 </div>
+<a name="zh_model-inference"></a><br>
 # 4. 模型推理
 推理所需的模型权重、源码、配置已发布在 Hugging Face，下载链接见本文档最开始的表格。我们在此示范多种推理方式。程序会自动从
 print(response)
 ```
 在上述两段代码中，模型加载指定 `device_map='auto'`
 ，会使用所有可用显卡。如需指定使用的设备，可以使用类似 `export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`（使用了0、1、2、3、4、5、6、7号显卡）的方式控制。
 ## 4.2. 脚本直接推理
 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python demo/text_generation_base.py --model OrionStarAI/Orion-MOE8x7B-Base --tokenizer OrionStarAI/Orion-MOE8x7B-Base --prompt 你好,你叫什么名字
 ```
+## 4.3. vLLM推理服务
+下载工程(https://github.com/OrionStarAI/vllm_server), 搭建基于vLLM的推理服务镜像.
+```shell
+git clone [email protected]:OrionStarAI/vllm_server.git
+cd vllm_server
+docker build -t vllm_server:0.0.0.0 -f Dockerfile .
+```
+开启docker镜像服务
+```shell
+docker run --gpus all -it -p 9999:9999 -v $(pwd)/logs:/workspace/logs:rw -v $HOME/Downloads:/workspace/models -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -e MODEL_DIR=Orion-MOE8x7B-Base -e MODEL_NAME=orion-moe vllm_server:0.0.0.0
+```
+运行推理
+```shell
+curl http://0.0.0.0:9999/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "orion-moe","temperature": 0.2,"stream": false, "messages": [{"role": "user", "content":"Which company developed you as an AI agent?"}]}'
+```
 <a name="zh_declarations-license"></a><br>

assets/imgs/inf_spd.png ADDED Viewed

assets/imgs/inf_spd_en.png DELETED Viewed

Binary file (140 kB)

assets/imgs/inf_spd_zh.png DELETED Viewed

Binary file (56.7 kB)