Update README.md

Browse files

Files changed (1) hide show

README.md +58 -13

README.md CHANGED Viewed

@@ -140,16 +140,65 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
 | **Overall**                      | **TODO**       | **TODO**            |
-# Model Performance
-Our int4wo is only optimized for batch size 1, so we'll see slowdown in larger batch sizes, we expect this to be used in local server deployment for single or a few users
-and decode tokens per second will be more important than time to first token.
-Need to install vllm nightly to get some recent changes
 ```
-pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 ```
 ## Results (A100 machine)
 | Benchmark (Latency)              |                |                          |
@@ -163,14 +212,10 @@ pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
-| Benchmark (Memory, TODO)         |                |                          |
-|----------------------------------|----------------|--------------------------|
-|                                  | Phi-4 mini-Ins | phi4-mini-int4wo-hqq     |
-| latency (batch_size=1)           | 2.46s          | 2.2s (12% speedup)       |
-| latency (batch_size=128)         | 6.55s          | 17s (60% slowdown)       |
-| serving (num_prompts=1)          | 0.87 req/s     | 1.05 req/s (20% speedup) |
-| serving (num_prompts=1000)       | 24.15 req/s    | 5.64 req/s (77% slowdown)|
 ## Download dataset
 Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`

 | **Overall**                      | **TODO**       | **TODO**            |
+# Peak Memory Usage
+We can use the following code to get a sense of peak memory usage during inference:
+## Results
+| Benchmark       |                |                                |
+|-----------------|----------------|--------------------------------|
+|                 | Phi-4 mini-Ins | Phi-4-mini-instruct-int4wo-hqq |
+| Peak Memory     | 8.91GB         | 2.98GB                         |
+## Benchmark Peak Memory
 ```
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
+# use "microsoft/Phi-4-mini-instruct" or "pytorch/Phi-4-mini-instruct-int4wo-hqq"
+model_id = "microsoft/Phi-4-mini-instruct"
+quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+torch.cuda.reset_peak_memory_stats()
+prompt = "Hey, are you conscious? Can you talk to me?"
+messages = [
+    {
+        "role": "system",
+        "content": "",
+    },
+    {"role": "user", "content": prompt},
+]
+templated_prompt = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+print("Prompt:", prompt)
+print("Templated prompt:", templated_prompt)
+inputs = tokenizer(
+    templated_prompt,
+    return_tensors="pt",
+).to("cuda")
+generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
+output_text = tokenizer.batch_decode(
+    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print("Response:", output_text[0][len(prompt):])
+mem = torch.cuda.max_memory_reserved() / 1e9
+print(f"Peak Memory Usage: {mem:.02f} GB")
 ```
+# Model Performance
+Our int4wo is only optimized for batch size 1, so we'll see slowdown in larger batch sizes, we expect this to be used in local server deployment for single or a few users
+and decode tokens per second will be more important than time to first token.
 ## Results (A100 machine)
 | Benchmark (Latency)              |                |                          |
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
+Need to install vllm nightly to get some recent changes
+```
+pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
+```
 ## Download dataset
 Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`