jerryzh168 commited on
Commit
4e1c7a0
·
verified ·
1 Parent(s): a9f7231

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -1
README.md CHANGED
@@ -127,7 +127,18 @@ lm_eval --model hf --model_args pretrained=jerryzh168/phi4-mini-int4wo-hqq --tas
127
  # Model Performance
128
 
129
  Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm.
130
- For batch size N, please see our [gemlite checkpoint](https://huggingface.co/jerryzh168/phi4-mini-int4wo-gemlite).
 
 
 
 
 
 
 
 
 
 
 
131
 
132
  ## Download vllm source code and install vllm
133
  ```
 
127
  # Model Performance
128
 
129
  Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm.
130
+
131
+ ## Results (A100 machine)
132
+ | Benchmark | | |
133
+ |----------------------------------|----------------|--------------------------|
134
+ | | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
135
+ | latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
136
+ | latency (batch_size=128) | 6.55s | 17s (60% slowdown) |
137
+ | serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
138
+ | serving (num_prompts=1000) | 24.15 req/s | 5.64 req/s (77% slowdown)|
139
+
140
+ Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
141
+ Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
142
 
143
  ## Download vllm source code and install vllm
144
  ```