kunato commited on
Commit
350d9a3
·
verified ·
1 Parent(s): ea00cf4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -98,6 +98,8 @@ vllm serve scb10x/llama3.1-typhoon2-70b-instruct --tensor-parallel-size 2 --gpu-
98
  # using at least 2 80GB gpu eg A100, H100 for hosting 70b model
99
  # to serving longer context (90k), 4 gpu is required (and you can omit --enforce-eager to improve throughput)
100
  # see more information at https://docs.vllm.ai/
 
 
101
  ```
102
 
103
 
 
98
  # using at least 2 80GB gpu eg A100, H100 for hosting 70b model
99
  # to serving longer context (90k), 4 gpu is required (and you can omit --enforce-eager to improve throughput)
100
  # see more information at https://docs.vllm.ai/
101
+ # If you have access to two H100 GPUs, here is our serving command at opentyphoon.ai, which uses FP8 (enabling larger context lengths and faster performance). On two H100 GPUs, we achieved approximately 2000 tokens/s decoding performance. supporting around 40 concurrent requests.
102
+ # vllm serve scb10x/llama3.1-typhoon2-70b-instruct --max-num-batched-tokens 32768 --enable-chunked-prefill --max-model-len 32768 --tensor-parallel-size 2 --gpu-memory-utilization 0.90 --quantization fp8
103
  ```
104
 
105