Update README.md
Browse files
README.md
CHANGED
@@ -127,7 +127,18 @@ lm_eval --model hf --model_args pretrained=jerryzh168/phi4-mini-int4wo-hqq --tas
|
|
127 |
# Model Performance
|
128 |
|
129 |
Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm.
|
130 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
131 |
|
132 |
## Download vllm source code and install vllm
|
133 |
```
|
|
|
127 |
# Model Performance
|
128 |
|
129 |
Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm.
|
130 |
+
|
131 |
+
## Results (A100 machine)
|
132 |
+
| Benchmark | | |
|
133 |
+
|----------------------------------|----------------|--------------------------|
|
134 |
+
| | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
|
135 |
+
| latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
|
136 |
+
| latency (batch_size=128) | 6.55s | 17s (60% slowdown) |
|
137 |
+
| serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
|
138 |
+
| serving (num_prompts=1000) | 24.15 req/s | 5.64 req/s (77% slowdown)|
|
139 |
+
|
140 |
+
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
141 |
+
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
142 |
|
143 |
## Download vllm source code and install vllm
|
144 |
```
|