SocialLocalMobile commited on
Commit
ebfd887
·
verified ·
1 Parent(s): fd13ccd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -2
README.md CHANGED
@@ -13,7 +13,7 @@ base_model:
13
  pipeline_tag: text-generation
14
  ---
15
 
16
- [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with TODO VRAM reduction, TODO speedup and little to no accuracy impact on H100.
17
 
18
  # Inference with vLLM
19
  ```Shell
@@ -113,7 +113,71 @@ tokenizer.push_to_hub(save_to)
113
  TODO
114
 
115
  # Peak Memory Usage
116
- TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
  # Model Performance
119
 
 
13
  pipeline_tag: text-generation
14
  ---
15
 
16
+ [Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with 47% VRAM reduction, 32%-36% speedup and little to no accuracy impact on H100.
17
 
18
  # Inference with vLLM
19
  ```Shell
 
113
  TODO
114
 
115
  # Peak Memory Usage
116
+
117
+ | | | |
118
+ |----------------------------------|----------------|-------------------------------|
119
+ | | Qwen3-32B | Qwen3-32B-float8dq |
120
+ | Peak Memory | 65.72 GB | 34.54 GB (-47.44%) |
121
+
122
+ <details>
123
+ <summary> Reproduce peak memory usage </summary>
124
+
125
+ Code
126
+ ```Py
127
+ import torch
128
+ from transformers import AutoModelForCausalLM, AutoTokenizer
129
+
130
+ model_name = "Qwen/Qwen3-32B" # pytorch/Qwen3-32B-float8dq
131
+
132
+ # load the tokenizer and the model
133
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
134
+ model = AutoModelForCausalLM.from_pretrained(
135
+ model_name,
136
+ torch_dtype="auto",
137
+ device_map="auto"
138
+ )
139
+
140
+ torch.cuda.reset_peak_memory_stats()
141
+
142
+ # prepare the model input
143
+ prompt = "Give me a short introduction to large language model."
144
+ messages = [
145
+ {"role": "user", "content": prompt}
146
+ ]
147
+ text = tokenizer.apply_chat_template(
148
+ messages,
149
+ tokenize=False,
150
+ add_generation_prompt=True,
151
+ enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
152
+ )
153
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
154
+
155
+ # conduct text completion
156
+ generated_ids = model.generate(
157
+ **model_inputs,
158
+ max_new_tokens=32768
159
+ )
160
+ output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
161
+
162
+ # parsing thinking content
163
+ try:
164
+ # rindex finding 151668 (</think>)
165
+ index = len(output_ids) - output_ids[::-1].index(151668)
166
+ except ValueError:
167
+ index = 0
168
+
169
+ thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
170
+ content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
171
+
172
+ print("thinking content:", thinking_content)
173
+ print("content:", content)
174
+
175
+
176
+ mem = torch.cuda.max_memory_reserved() / 1e9
177
+ print(f"Peak Memory Usage: {mem:.02f} GB")
178
+ ```
179
+ </details>
180
+
181
 
182
  # Model Performance
183