pytorch
/

Qwen3-32B-float8dq

Text Generation

text-generation-inference

Model card Files Files and versions Community

SocialLocalMobile commited on May 7

Commit

09188dc

·

verified ·

1 Parent(s): 392f1ec

Update README.md

Files changed (1) hide show

README.md +17 -1

README.md CHANGED Viewed

@@ -13,7 +13,23 @@ pipeline_tag: text-generation
 [Qwen3-32B](https://huggingface.co/Qwen3/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with TODO VRAM reduction, TODO speedup and little to no accuracy impact on H100.
 # 1. Inference with vLLM
-TODO
 # 2. Inference with Transformers
 TODO

 [Qwen3-32B](https://huggingface.co/Qwen3/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with TODO VRAM reduction, TODO speedup and little to no accuracy impact on H100.
 # 1. Inference with vLLM
+```Shell
+VLLM_DISABLE_COMPILE_CACHE=1 vllm serve SocialLocalMobile/Qwen3-32B-float8dq --tokenizer Qwen/Qwen3-32B -O3
+```
+```Shell
+curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "SocialLocalMobile/Qwen3-32B-float8dq",
+  "messages": [
+    {"role": "user", "content": "Give me a short introduction to large language models."}
+  ],
+  "temperature": 0.6,
+  "top_p": 0.95,
+  "top_k": 20,
+  "max_tokens": 32768
+}'
+```
 # 2. Inference with Transformers
 TODO