Update README.md
Browse files
README.md
CHANGED
@@ -140,16 +140,65 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
|
|
140 |
| **Overall** | **TODO** | **TODO** |
|
141 |
|
142 |
|
143 |
-
#
|
144 |
|
145 |
-
|
146 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
147 |
|
148 |
-
Need to install vllm nightly to get some recent changes
|
149 |
```
|
150 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
151 |
```
|
152 |
|
|
|
|
|
|
|
|
|
|
|
153 |
|
154 |
## Results (A100 machine)
|
155 |
| Benchmark (Latency) | | |
|
@@ -163,14 +212,10 @@ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
|
|
163 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
164 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
165 |
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
| latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
|
171 |
-
| latency (batch_size=128) | 6.55s | 17s (60% slowdown) |
|
172 |
-
| serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
|
173 |
-
| serving (num_prompts=1000) | 24.15 req/s | 5.64 req/s (77% slowdown)|
|
174 |
|
175 |
## Download dataset
|
176 |
Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
|
|
|
140 |
| **Overall** | **TODO** | **TODO** |
|
141 |
|
142 |
|
143 |
+
# Peak Memory Usage
|
144 |
|
145 |
+
We can use the following code to get a sense of peak memory usage during inference:
|
146 |
+
|
147 |
+
## Results
|
148 |
+
|
149 |
+
| Benchmark | | |
|
150 |
+
|-----------------|----------------|--------------------------------|
|
151 |
+
| | Phi-4 mini-Ins | Phi-4-mini-instruct-int4wo-hqq |
|
152 |
+
| Peak Memory | 8.91GB | 2.98GB |
|
153 |
+
|
154 |
+
|
155 |
+
## Benchmark Peak Memory
|
156 |
|
|
|
157 |
```
|
158 |
+
import torch
|
159 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
|
160 |
+
|
161 |
+
# use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-int4wo-hqq"
|
162 |
+
model_id = "microsoft/Phi-4-mini-instruct"
|
163 |
+
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
|
164 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
165 |
+
|
166 |
+
torch.cuda.reset_peak_memory_stats()
|
167 |
+
|
168 |
+
prompt = "Hey, are you conscious? Can you talk to me?"
|
169 |
+
messages = [
|
170 |
+
{
|
171 |
+
"role": "system",
|
172 |
+
"content": "",
|
173 |
+
},
|
174 |
+
{"role": "user", "content": prompt},
|
175 |
+
]
|
176 |
+
templated_prompt = tokenizer.apply_chat_template(
|
177 |
+
messages,
|
178 |
+
tokenize=False,
|
179 |
+
add_generation_prompt=True,
|
180 |
+
)
|
181 |
+
print("Prompt:", prompt)
|
182 |
+
print("Templated prompt:", templated_prompt)
|
183 |
+
inputs = tokenizer(
|
184 |
+
templated_prompt,
|
185 |
+
return_tensors="pt",
|
186 |
+
).to("cuda")
|
187 |
+
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
|
188 |
+
output_text = tokenizer.batch_decode(
|
189 |
+
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
190 |
+
)
|
191 |
+
print("Response:", output_text[0][len(prompt):])
|
192 |
+
|
193 |
+
mem = torch.cuda.max_memory_reserved() / 1e9
|
194 |
+
print(f"Peak Memory Usage: {mem:.02f} GB")
|
195 |
```
|
196 |
|
197 |
+
# Model Performance
|
198 |
+
|
199 |
+
Our int4wo is only optimized for batch size 1, so we'll see slowdown in larger batch sizes, we expect this to be used in local server deployment for single or a few users
|
200 |
+
and decode tokens per second will be more important than time to first token.
|
201 |
+
|
202 |
|
203 |
## Results (A100 machine)
|
204 |
| Benchmark (Latency) | | |
|
|
|
212 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
213 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
214 |
|
215 |
+
Need to install vllm nightly to get some recent changes
|
216 |
+
```
|
217 |
+
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
|
218 |
+
```
|
|
|
|
|
|
|
|
|
219 |
|
220 |
## Download dataset
|
221 |
Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
|