Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -127,7 +127,18 @@ lm_eval --model hf --model_args pretrained=jerryzh168/phi4-mini-int4wo-hqq --tas
 # Model Performance
 Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm.
-For batch size N, please see our [gemlite checkpoint](https://huggingface.co/jerryzh168/phi4-mini-int4wo-gemlite).
 ## Download vllm source code and install vllm
 ```

 # Model Performance
 Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm.
+## Results (A100 machine)
+| Benchmark                        |                |                          |
+|----------------------------------|----------------|--------------------------|
+|                                  | Phi-4 mini-Ins | phi4-mini-int4wo-hqq     |
+| latency (batch_size=1)           | 2.46s          | 2.2s (12% speedup)      |
+| latency (batch_size=128)         | 6.55s          | 17s (60% slowdown)      |
+| serving (num_prompts=1)          | 0.87 req/s     | 1.05 req/s (20% speedup) |
+| serving (num_prompts=1000)       | 24.15 req/s    | 5.64 req/s (77% slowdown)|
+Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
+Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
 ## Download vllm source code and install vllm
 ```