Update README.md

Browse files

Files changed (1) hide show

README.md +34 -18

README.md CHANGED Viewed

@@ -19,20 +19,15 @@ pipeline_tag: text-generation
 [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, by PyTorch team.
-# Installation
 ```
 pip install git+https://github.com/huggingface/transformers
 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
-pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
 ```
-Also need to install lm-eval from source:
-https://github.com/EleutherAI/lm-evaluation-harness#install
-# Quantization Recipe
 We used following code to get the quantized model:
 ```
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
@@ -99,9 +94,19 @@ torchao.quantization.utils.recommended_inductor_config_setter()
 quantized_model = torch.compile(quantized_model, mode="max-autotune")
 print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))
 ```
 # Model Quality
 We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
 ## baseline
 ```
 lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
@@ -134,22 +139,39 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
 | mathqa (0-shot)                  |                |  42.75              |
 | **Overall**                      | **TODO**       | **TODO**            |
 # Model Performance
-Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm.
 ## Results (A100 machine)
-| Benchmark                        |                |                          |
 |----------------------------------|----------------|--------------------------|
 |                                  | Phi-4 mini-Ins | phi4-mini-int4wo-hqq     |
-| latency (batch_size=1)           | 2.46s          | 2.2s (12% speedup)      |
-| latency (batch_size=128)         | 6.55s          | 17s (60% slowdown)      |
 | serving (num_prompts=1)          | 0.87 req/s     | 1.05 req/s (20% speedup) |
 | serving (num_prompts=1000)       | 24.15 req/s    | 5.64 req/s (77% slowdown)|
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
 ## Download dataset
 Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
@@ -195,10 +217,4 @@ vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mi
 Client:
 ```
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
-```
-# Serving with vllm
-We can use the same command we used in serving benchmarks to serve the model with vllm
-```
-vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```

 [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, by PyTorch team.
+# Quantization Recipe
+First need to install the required packages:
 ```
 pip install git+https://github.com/huggingface/transformers
 pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu126
 ```
 We used following code to get the quantized model:
 ```
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
 quantized_model = torch.compile(quantized_model, mode="max-autotune")
 print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_new_tokens=128))
 ```
+# Serving with vllm
+We can use the same command we used in serving benchmarks to serve the model with vllm
+```
+vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
+```
 # Model Quality
 We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
+Need to install lm-eval from source:
+https://github.com/EleutherAI/lm-evaluation-harness#install
 ## baseline
 ```
 lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
 | mathqa (0-shot)                  |                |  42.75              |
 | **Overall**                      | **TODO**       | **TODO**            |
 # Model Performance
+Our int4wo is only optimized for batch size 1, so we'll see slowdown in larger batch sizes, we expect this to be used in local server deployment for single or a few users
+and decode tokens per second will be more important than time to first token.
+Need to install vllm nightly to get some recent changes
+```
+pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
+```
 ## Results (A100 machine)
+| Benchmark (Latency)              |                |                          |
 |----------------------------------|----------------|--------------------------|
 |                                  | Phi-4 mini-Ins | phi4-mini-int4wo-hqq     |
+| latency (batch_size=1)           | 2.46s          | 2.2s (12% speedup)       |
+| latency (batch_size=128)         | 6.55s          | 17s (60% slowdown)       |
 | serving (num_prompts=1)          | 0.87 req/s     | 1.05 req/s (20% speedup) |
 | serving (num_prompts=1000)       | 24.15 req/s    | 5.64 req/s (77% slowdown)|
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
+| Benchmark (Memory)               |                |                          |
+|----------------------------------|----------------|--------------------------|
+|                                  | Phi-4 mini-Ins | phi4-mini-int4wo-hqq     |
+| latency (batch_size=1)           | 2.46s          | 2.2s (12% speedup)       |
+| latency (batch_size=128)         | 6.55s          | 17s (60% slowdown)       |
+| serving (num_prompts=1)          | 0.87 req/s     | 1.05 req/s (20% speedup) |
+| serving (num_prompts=1000)       | 24.15 req/s    | 5.64 req/s (77% slowdown)|
 ## Download dataset
 Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
 Client:
 ```
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
 ```