jerryzh168 commited on
Commit
070762d
·
verified ·
1 Parent(s): 04e792e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -13
README.md CHANGED
@@ -140,16 +140,65 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
140
  | **Overall** | **TODO** | **TODO** |
141
 
142
 
143
- # Model Performance
144
 
145
- Our int4wo is only optimized for batch size 1, so we'll see slowdown in larger batch sizes, we expect this to be used in local server deployment for single or a few users
146
- and decode tokens per second will be more important than time to first token.
 
 
 
 
 
 
 
 
 
147
 
148
- Need to install vllm nightly to get some recent changes
149
  ```
150
- pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
  ```
152
 
 
 
 
 
 
153
 
154
  ## Results (A100 machine)
155
  | Benchmark (Latency) | | |
@@ -163,14 +212,10 @@ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
163
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
164
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
165
 
166
-
167
- | Benchmark (Memory, TODO) | | |
168
- |----------------------------------|----------------|--------------------------|
169
- | | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
170
- | latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
171
- | latency (batch_size=128) | 6.55s | 17s (60% slowdown) |
172
- | serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
173
- | serving (num_prompts=1000) | 24.15 req/s | 5.64 req/s (77% slowdown)|
174
 
175
  ## Download dataset
176
  Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
 
140
  | **Overall** | **TODO** | **TODO** |
141
 
142
 
143
+ # Peak Memory Usage
144
 
145
+ We can use the following code to get a sense of peak memory usage during inference:
146
+
147
+ ## Results
148
+
149
+ | Benchmark | | |
150
+ |-----------------|----------------|--------------------------------|
151
+ | | Phi-4 mini-Ins | Phi-4-mini-instruct-int4wo-hqq |
152
+ | Peak Memory | 8.91GB | 2.98GB |
153
+
154
+
155
+ ## Benchmark Peak Memory
156
 
 
157
  ```
158
+ import torch
159
+ from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
160
+
161
+ # use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-int4wo-hqq"
162
+ model_id = "microsoft/Phi-4-mini-instruct"
163
+ quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
164
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
165
+
166
+ torch.cuda.reset_peak_memory_stats()
167
+
168
+ prompt = "Hey, are you conscious? Can you talk to me?"
169
+ messages = [
170
+ {
171
+ "role": "system",
172
+ "content": "",
173
+ },
174
+ {"role": "user", "content": prompt},
175
+ ]
176
+ templated_prompt = tokenizer.apply_chat_template(
177
+ messages,
178
+ tokenize=False,
179
+ add_generation_prompt=True,
180
+ )
181
+ print("Prompt:", prompt)
182
+ print("Templated prompt:", templated_prompt)
183
+ inputs = tokenizer(
184
+ templated_prompt,
185
+ return_tensors="pt",
186
+ ).to("cuda")
187
+ generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
188
+ output_text = tokenizer.batch_decode(
189
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
190
+ )
191
+ print("Response:", output_text[0][len(prompt):])
192
+
193
+ mem = torch.cuda.max_memory_reserved() / 1e9
194
+ print(f"Peak Memory Usage: {mem:.02f} GB")
195
  ```
196
 
197
+ # Model Performance
198
+
199
+ Our int4wo is only optimized for batch size 1, so we'll see slowdown in larger batch sizes, we expect this to be used in local server deployment for single or a few users
200
+ and decode tokens per second will be more important than time to first token.
201
+
202
 
203
  ## Results (A100 machine)
204
  | Benchmark (Latency) | | |
 
212
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
213
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
214
 
215
+ Need to install vllm nightly to get some recent changes
216
+ ```
217
+ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
218
+ ```
 
 
 
 
219
 
220
  ## Download dataset
221
  Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`