scb10x
/

llama3.1-typhoon2-70b-instruct

Text Generation

Model card Files Files and versions Community

kunato commited on Feb 7

Commit

350d9a3

·

verified ·

1 Parent(s): ea00cf4

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -98,6 +98,8 @@ vllm serve scb10x/llama3.1-typhoon2-70b-instruct --tensor-parallel-size 2 --gpu-
 # using at least 2 80GB gpu eg A100, H100 for hosting 70b model
 # to serving longer context (90k), 4 gpu is required (and you can omit --enforce-eager to improve throughput)
 # see more information at https://docs.vllm.ai/
 ```

 # using at least 2 80GB gpu eg A100, H100 for hosting 70b model
 # to serving longer context (90k), 4 gpu is required (and you can omit --enforce-eager to improve throughput)
 # see more information at https://docs.vllm.ai/
+# If you have access to two H100 GPUs, here is our serving command at opentyphoon.ai, which uses FP8 (enabling larger context lengths and faster performance). On two H100 GPUs, we achieved approximately 2000 tokens/s decoding performance. supporting around 40 concurrent requests.
+# vllm serve scb10x/llama3.1-typhoon2-70b-instruct --max-num-batched-tokens 32768 --enable-chunked-prefill --max-model-len 32768 --tensor-parallel-size 2 --gpu-memory-utilization 0.90 --quantization fp8
 ```