pytorch
/

Qwen3-32B-float8dq

Text Generation

text-generation-inference

Model card Files Files and versions Community

SocialLocalMobile commited on May 14

Commit

ebfd887

·

verified ·

1 Parent(s): fd13ccd

Update README.md

Files changed (1) hide show

README.md +66 -2

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ base_model:
 pipeline_tag: text-generation
 ---
-[Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with TODO VRAM reduction, TODO speedup and little to no accuracy impact on H100.
 # Inference with vLLM
 ```Shell
@@ -113,7 +113,71 @@ tokenizer.push_to_hub(save_to)
 TODO
 # Peak Memory Usage
-TODO
 # Model Performance

 pipeline_tag: text-generation
 ---
+[Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with 47% VRAM reduction, 32%-36% speedup and little to no accuracy impact on H100.
 # Inference with vLLM
 ```Shell
 TODO
 # Peak Memory Usage
+|                                  |                |                               |
+|----------------------------------|----------------|-------------------------------|
+|                                  | Qwen3-32B      | Qwen3-32B-float8dq            |
+| Peak Memory                      | 65.72 GB       | 34.54 GB (-47.44%)            |
+<details>
+<summary> Reproduce peak memory usage </summary>
+Code
+```Py
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "Qwen/Qwen3-32B" # pytorch/Qwen3-32B-float8dq
+# load the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+torch.cuda.reset_peak_memory_stats()
+# prepare the model input
+prompt = "Give me a short introduction to large language model."
+messages = [
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+# conduct text completion
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=32768
+)
+output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
+# parsing thinking content
+try:
+    # rindex finding 151668 (</think>)
+    index = len(output_ids) - output_ids[::-1].index(151668)
+except ValueError:
+    index = 0
+thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
+content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
+print("thinking content:", thinking_content)
+print("content:", content)
+mem = torch.cuda.max_memory_reserved() / 1e9
+print(f"Peak Memory Usage: {mem:.02f} GB")
+```
+</details>
 # Model Performance