jerryzh168 commited on
Commit
3737ba9
·
verified ·
1 Parent(s): 0e7b2e9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -11
README.md CHANGED
@@ -65,17 +65,17 @@ print(f"{save_to} model:", benchmark_fn(quantized_model.generate, **inputs, max_
65
  # Model Quality
66
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
67
 
68
- # Installing the nightly version to get most recent updates
69
  ```
70
  pip install git+https://github.com/EleutherAI/lm-evaluation-harness
71
  ```
72
 
73
- # baseline
74
  ```
75
  lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
76
  ```
77
 
78
- # int4wo-hqq
79
  ```
80
  lm_eval --model hf --model_args pretrained=jerryzh168/phi4-mini-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
81
  ```
@@ -98,38 +98,38 @@ lm_eval --model hf --model_args pretrained=jerryzh168/phi4-mini-int4wo-hqq --tas
98
  Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm.
99
  For batch size N, please see our [gemlite checkpoint](https://huggingface.co/jerryzh168/phi4-mini-int4wo-gemlite).
100
 
101
- # Download vllm source code and install vllm
102
  ```
103
  git clone [email protected]:vllm-project/vllm.git
104
  VLLM_USE_PRECOMPILED=1 pip install .
105
  ```
106
 
107
- # Download dataset
108
  Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
109
 
110
  Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
111
- # benchmark_latency
112
 
113
  Run the following under `vllm` source code root folder:
114
 
115
- ## baseline
116
  ```
117
  python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
118
  ```
119
 
120
- ## int4wo-hqq
121
  ```
122
  python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model jerryzh168/phi4-mini-int4wo-hqq --batch-size 1
123
  ```
124
 
125
- # benchmark_serving
126
 
127
  We also benchmarked the throughput in a serving environment.
128
 
129
 
130
  Run the following under `vllm` source code root folder:
131
 
132
- ## baseline
133
  Server:
134
  ```
135
  vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
@@ -140,7 +140,7 @@ Client:
140
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
141
  ```
142
 
143
- ## int4wo-hqq
144
  Server:
145
  ```
146
  vllm serve jerryzh168/phi4-mini-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
 
65
  # Model Quality
66
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
67
 
68
+ ## Installing the nightly version to get most recent updates
69
  ```
70
  pip install git+https://github.com/EleutherAI/lm-evaluation-harness
71
  ```
72
 
73
+ ## baseline
74
  ```
75
  lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
76
  ```
77
 
78
+ ## int4wo-hqq
79
  ```
80
  lm_eval --model hf --model_args pretrained=jerryzh168/phi4-mini-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
81
  ```
 
98
  Our int4wo is only optimized for batch size 1, so we'll only benchmark the batch size 1 performance with vllm.
99
  For batch size N, please see our [gemlite checkpoint](https://huggingface.co/jerryzh168/phi4-mini-int4wo-gemlite).
100
 
101
+ ## Download vllm source code and install vllm
102
  ```
103
  git clone [email protected]:vllm-project/vllm.git
104
  VLLM_USE_PRECOMPILED=1 pip install .
105
  ```
106
 
107
+ ## Download dataset
108
  Download sharegpt dataset: `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json`
109
 
110
  Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
111
+ ## benchmark_latency
112
 
113
  Run the following under `vllm` source code root folder:
114
 
115
+ ### baseline
116
  ```
117
  python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
118
  ```
119
 
120
+ ### int4wo-hqq
121
  ```
122
  python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model jerryzh168/phi4-mini-int4wo-hqq --batch-size 1
123
  ```
124
 
125
+ ## benchmark_serving
126
 
127
  We also benchmarked the throughput in a serving environment.
128
 
129
 
130
  Run the following under `vllm` source code root folder:
131
 
132
+ ### baseline
133
  Server:
134
  ```
135
  vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
 
140
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
141
  ```
142
 
143
+ ### int4wo-hqq
144
  Server:
145
  ```
146
  vllm serve jerryzh168/phi4-mini-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3