SocialLocalMobile commited on
Commit
15cca67
·
verified ·
1 Parent(s): c285ed8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -1
README.md CHANGED
@@ -119,7 +119,53 @@ TODO
119
  TODO
120
 
121
  # 6. Model Performance
122
- TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
124
  # 7. Disclaimer
125
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
 
119
  TODO
120
 
121
  # 6. Model Performance
122
+
123
+ ## Results (H100 machine)
124
+
125
+ | Benchmark | | |
126
+ |----------------------------------|----------------|-------------------------------|
127
+ | | Qwen3-32B | Qwen3-32B-float8dq |
128
+ | latency (batch_size=1) | 9.1s | TODO |
129
+ | latency (batch_size=128) | 12.45s | TODO |
130
+ | serving (num_prompts=1) | TODO | TODO |
131
+ | serving (num_prompts=1000) | TODO | TODO |
132
+
133
+ <details>
134
+ <summary> Reproduce latency benchmarks </summary>
135
+
136
+ **1. Setup**
137
+ ```Shell
138
+ git clone [email protected]:vllm-project/vllm.git
139
+ cd vllm
140
+ VLLM_USE_PRECOMPILED=1 pip install --editable .
141
+ ```
142
+
143
+ **2. Latency benchmarking**
144
+ ```Shell
145
+ export MODEL=Qwen/Qwen3-32B # or pytorch/Qwen3-32B-float8dq
146
+ VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model MODEL --batch-size 1
147
+ ```
148
+
149
+ **3. Serving benchmarking**
150
+
151
+ Setup:
152
+ ```Shell
153
+ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
154
+ ```
155
+
156
+ Server:
157
+ ```Shell
158
+ export MODEL=Qwen/Qwen3-32B # or pytorch/Qwen3-32B-float8dq
159
+ VLLM_DISABLE_COMPILE_CACHE=1 vllm serve MODEL --tokenizer Qwen/Qwen3-32B -O3
160
+ ```
161
+
162
+ Client:
163
+ ```Shell
164
+ export MODEL=Qwen/Qwen3-32B # or pytorch/Qwen3-32B-float8dq
165
+ python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer Qwen/Qwen3-32B --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model MODEL --num-prompts 1
166
+ ```
167
+
168
+ </details>
169
 
170
  # 7. Disclaimer
171
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.