supriyar commited on
Commit
b6de200
·
verified ·
1 Parent(s): b548de8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -3
README.md CHANGED
@@ -17,7 +17,7 @@ base_model:
17
  pipeline_tag: text-generation
18
  ---
19
 
20
- [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, using [hqq](https://mobiusml.github.io/hqq_blog/) algorithm for improved accuracy, by PyTorch team.
21
 
22
  # Quantization Recipe
23
 
@@ -149,8 +149,6 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
149
 
150
  # Peak Memory Usage
151
 
152
- We can use the following code to get a sense of peak memory usage during inference:
153
-
154
  ## Results
155
 
156
  | Benchmark | | |
 
17
  pipeline_tag: text-generation
18
  ---
19
 
20
+ [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, using [hqq](https://mobiusml.github.io/hqq_blog/) algorithm for improved accuracy, by PyTorch team. Use it directly or serve using [vLLM](https://docs.vllm.ai/en/latest/) for 67% VRAM reduction and 12-20% speedup on A100 GPUs.
21
 
22
  # Quantization Recipe
23
 
 
149
 
150
  # Peak Memory Usage
151
 
 
 
152
  ## Results
153
 
154
  | Benchmark | | |