SocialLocalMobile commited on
Commit
09188dc
·
verified ·
1 Parent(s): 392f1ec

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -1
README.md CHANGED
@@ -13,7 +13,23 @@ pipeline_tag: text-generation
13
  [Qwen3-32B](https://huggingface.co/Qwen3/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with TODO VRAM reduction, TODO speedup and little to no accuracy impact on H100.
14
 
15
  # 1. Inference with vLLM
16
- TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
  # 2. Inference with Transformers
19
  TODO
 
13
  [Qwen3-32B](https://huggingface.co/Qwen3/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with TODO VRAM reduction, TODO speedup and little to no accuracy impact on H100.
14
 
15
  # 1. Inference with vLLM
16
+ ```Shell
17
+ VLLM_DISABLE_COMPILE_CACHE=1 vllm serve SocialLocalMobile/Qwen3-32B-float8dq --tokenizer Qwen/Qwen3-32B -O3
18
+ ```
19
+
20
+ ```Shell
21
+ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
22
+ "model": "SocialLocalMobile/Qwen3-32B-float8dq",
23
+ "messages": [
24
+ {"role": "user", "content": "Give me a short introduction to large language models."}
25
+ ],
26
+ "temperature": 0.6,
27
+ "top_p": 0.95,
28
+ "top_k": 20,
29
+ "max_tokens": 32768
30
+ }'
31
+ ```
32
+
33
 
34
  # 2. Inference with Transformers
35
  TODO