SocialLocalMobile commited on
Commit
fd13ccd
·
verified ·
1 Parent(s): 6642812

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -32
README.md CHANGED
@@ -15,7 +15,7 @@ pipeline_tag: text-generation
15
 
16
  [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with TODO VRAM reduction, TODO speedup and little to no accuracy impact on H100.
17
 
18
- # 1. Inference with vLLM
19
  ```Shell
20
  # Server
21
  VLLM_DISABLE_COMPILE_CACHE=1 vllm serve SocialLocalMobile/Qwen3-32B-float8dq --tokenizer Qwen/Qwen3-32B -O3
@@ -36,10 +36,7 @@ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/jso
36
  ```
37
 
38
 
39
- # 2. Inference with Transformers
40
- TODO
41
-
42
- # 3. Quantization Recipe
43
 
44
  Install the required packages:
45
 
@@ -112,23 +109,20 @@ quantized_model.push_to_hub(save_to, safe_serialization=False)
112
  tokenizer.push_to_hub(save_to)
113
  ```
114
 
115
- # 4. Model Quality
116
  TODO
117
 
118
- # 5. Peak Memory Usage
119
  TODO
120
 
121
- # 6. Model Performance
122
 
123
- ## Results (H100 machine)
124
 
125
  | Benchmark | | |
126
  |----------------------------------|----------------|-------------------------------|
127
  | | Qwen3-32B | Qwen3-32B-float8dq |
128
  | latency (batch_size=1) | 9.1s | 5.77s (-36.6%) |
129
  | latency (batch_size=128) | 12.45s | 8.40s (-32.5%) |
130
- | serving (num_prompts=1) | TODO | TODO |
131
- | serving (num_prompts=1000) | TODO | TODO |
132
 
133
  <details>
134
  <summary> Reproduce latency benchmarks </summary>
@@ -145,29 +139,9 @@ VLLM_USE_PRECOMPILED=1 pip install --editable .
145
  export MODEL=Qwen/Qwen3-32B # or pytorch/Qwen3-32B-float8dq
146
  VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model MODEL --batch-size 1
147
  ```
148
-
149
- **3. Serving benchmarking**
150
-
151
- Setup:
152
- ```Shell
153
- wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
154
- ```
155
-
156
- Server:
157
- ```Shell
158
- export MODEL=Qwen/Qwen3-32B # or pytorch/Qwen3-32B-float8dq
159
- VLLM_DISABLE_COMPILE_CACHE=1 vllm serve $MODEL --tokenizer Qwen/Qwen3-32B -O3
160
- ```
161
-
162
- Client:
163
- ```Shell
164
- export MODEL=Qwen/Qwen3-32B # or pytorch/Qwen3-32B-float8dq
165
- python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer Qwen/Qwen3-32B --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model MODEL --num-prompts 1
166
- ```
167
-
168
  </details>
169
 
170
- # 7. Disclaimer
171
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
172
 
173
  Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.
 
15
 
16
  [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with TODO VRAM reduction, TODO speedup and little to no accuracy impact on H100.
17
 
18
+ # Inference with vLLM
19
  ```Shell
20
  # Server
21
  VLLM_DISABLE_COMPILE_CACHE=1 vllm serve SocialLocalMobile/Qwen3-32B-float8dq --tokenizer Qwen/Qwen3-32B -O3
 
36
  ```
37
 
38
 
39
+ # Quantization Recipe
 
 
 
40
 
41
  Install the required packages:
42
 
 
109
  tokenizer.push_to_hub(save_to)
110
  ```
111
 
112
+ # Model Quality
113
  TODO
114
 
115
+ # Peak Memory Usage
116
  TODO
117
 
118
+ # Model Performance
119
 
 
120
 
121
  | Benchmark | | |
122
  |----------------------------------|----------------|-------------------------------|
123
  | | Qwen3-32B | Qwen3-32B-float8dq |
124
  | latency (batch_size=1) | 9.1s | 5.77s (-36.6%) |
125
  | latency (batch_size=128) | 12.45s | 8.40s (-32.5%) |
 
 
126
 
127
  <details>
128
  <summary> Reproduce latency benchmarks </summary>
 
139
  export MODEL=Qwen/Qwen3-32B # or pytorch/Qwen3-32B-float8dq
140
  VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model MODEL --batch-size 1
141
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
  </details>
143
 
144
+ # Disclaimer
145
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
146
 
147
  Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.