adding in NO and YES sampling
Browse files
README.md
CHANGED
@@ -169,14 +169,24 @@ index fc903d5..5450236 100644
|
|
169 |
|
170 |
```
|
171 |
|
172 |
-
Running command for bf16
|
173 |
```
|
174 |
-
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "
|
175 |
```
|
176 |
-
Running command for
|
177 |
```
|
178 |
-
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype
|
179 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
180 |
**DISCLAIMER:** When using int8, the results will be subpar compared to bf16 as the model is being [quantized](https://huggingface.co/blog/hf-bitsandbytes-integration#introduction-to-model-quantization).
|
181 |
|
182 |
### Suggested Inference Parameters
|
|
|
169 |
|
170 |
```
|
171 |
|
172 |
+
Running command for bf16, NO sampling
|
173 |
```
|
174 |
+
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "max_new_tokens": 512}'
|
175 |
```
|
176 |
+
Running command for bf16, YES sampling
|
177 |
```
|
178 |
+
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": true, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|
179 |
```
|
180 |
+
---
|
181 |
+
Running command for int8 (sub optimal performance, but fast inference time) NO sampling:
|
182 |
+
```
|
183 |
+
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "max_new_tokens": 512}'
|
184 |
+
```
|
185 |
+
Running command for int8 (sub optimal performance, but fast inference time) YES sampling:
|
186 |
+
```
|
187 |
+
python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": true, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
|
188 |
+
```
|
189 |
+
|
190 |
**DISCLAIMER:** When using int8, the results will be subpar compared to bf16 as the model is being [quantized](https://huggingface.co/blog/hf-bitsandbytes-integration#introduction-to-model-quantization).
|
191 |
|
192 |
### Suggested Inference Parameters
|