Update README.md
Browse files
README.md
CHANGED
@@ -98,6 +98,8 @@ vllm serve scb10x/llama3.1-typhoon2-70b-instruct --tensor-parallel-size 2 --gpu-
|
|
98 |
# using at least 2 80GB gpu eg A100, H100 for hosting 70b model
|
99 |
# to serving longer context (90k), 4 gpu is required (and you can omit --enforce-eager to improve throughput)
|
100 |
# see more information at https://docs.vllm.ai/
|
|
|
|
|
101 |
```
|
102 |
|
103 |
|
|
|
98 |
# using at least 2 80GB gpu eg A100, H100 for hosting 70b model
|
99 |
# to serving longer context (90k), 4 gpu is required (and you can omit --enforce-eager to improve throughput)
|
100 |
# see more information at https://docs.vllm.ai/
|
101 |
+
# If you have access to two H100 GPUs, here is our serving command at opentyphoon.ai, which uses FP8 (enabling larger context lengths and faster performance). On two H100 GPUs, we achieved approximately 2000 tokens/s decoding performance. supporting around 40 concurrent requests.
|
102 |
+
# vllm serve scb10x/llama3.1-typhoon2-70b-instruct --max-num-batched-tokens 32768 --enable-chunked-prefill --max-model-len 32768 --tensor-parallel-size 2 --gpu-memory-utilization 0.90 --quantization fp8
|
103 |
```
|
104 |
|
105 |
|