Update README.md
Browse files
README.md
CHANGED
@@ -13,7 +13,23 @@ pipeline_tag: text-generation
|
|
13 |
[Qwen3-32B](https://huggingface.co/Qwen3/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with TODO VRAM reduction, TODO speedup and little to no accuracy impact on H100.
|
14 |
|
15 |
# 1. Inference with vLLM
|
16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
# 2. Inference with Transformers
|
19 |
TODO
|
|
|
13 |
[Qwen3-32B](https://huggingface.co/Qwen3/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with TODO VRAM reduction, TODO speedup and little to no accuracy impact on H100.
|
14 |
|
15 |
# 1. Inference with vLLM
|
16 |
+
```Shell
|
17 |
+
VLLM_DISABLE_COMPILE_CACHE=1 vllm serve SocialLocalMobile/Qwen3-32B-float8dq --tokenizer Qwen/Qwen3-32B -O3
|
18 |
+
```
|
19 |
+
|
20 |
+
```Shell
|
21 |
+
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
22 |
+
"model": "SocialLocalMobile/Qwen3-32B-float8dq",
|
23 |
+
"messages": [
|
24 |
+
{"role": "user", "content": "Give me a short introduction to large language models."}
|
25 |
+
],
|
26 |
+
"temperature": 0.6,
|
27 |
+
"top_p": 0.95,
|
28 |
+
"top_k": 20,
|
29 |
+
"max_tokens": 32768
|
30 |
+
}'
|
31 |
+
```
|
32 |
+
|
33 |
|
34 |
# 2. Inference with Transformers
|
35 |
TODO
|