jayr014 commited on
Commit
39abe33
·
1 Parent(s): d8b76bf

adding in NO and YES sampling

Browse files
Files changed (1) hide show
  1. README.md +14 -4
README.md CHANGED
@@ -169,14 +169,24 @@ index fc903d5..5450236 100644
169
 
170
  ```
171
 
172
- Running command for bf16
173
  ```
174
- python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
175
  ```
176
- Running command for int8 (sub optimal performance, but fast inference time):
177
  ```
178
- python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
179
  ```
 
 
 
 
 
 
 
 
 
 
180
  **DISCLAIMER:** When using int8, the results will be subpar compared to bf16 as the model is being [quantized](https://huggingface.co/blog/hf-bitsandbytes-integration#introduction-to-model-quantization).
181
 
182
  ### Suggested Inference Parameters
 
169
 
170
  ```
171
 
172
+ Running command for bf16, NO sampling
173
  ```
174
+ python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "max_new_tokens": 512}'
175
  ```
176
+ Running command for bf16, YES sampling
177
  ```
178
+ python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": true, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
179
  ```
180
+ ---
181
+ Running command for int8 (sub optimal performance, but fast inference time) NO sampling:
182
+ ```
183
+ python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "max_new_tokens": 512}'
184
+ ```
185
+ Running command for int8 (sub optimal performance, but fast inference time) YES sampling:
186
+ ```
187
+ python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": true, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
188
+ ```
189
+
190
  **DISCLAIMER:** When using int8, the results will be subpar compared to bf16 as the model is being [quantized](https://huggingface.co/blog/hf-bitsandbytes-integration#introduction-to-model-quantization).
191
 
192
  ### Suggested Inference Parameters