Update README.md
Browse files
README.md
CHANGED
@@ -17,7 +17,7 @@ base_model:
|
|
17 |
pipeline_tag: text-generation
|
18 |
---
|
19 |
|
20 |
-
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, using [hqq](https://mobiusml.github.io/hqq_blog/) algorithm for improved accuracy, by PyTorch team.
|
21 |
|
22 |
# Quantization Recipe
|
23 |
|
@@ -149,8 +149,6 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hq
|
|
149 |
|
150 |
# Peak Memory Usage
|
151 |
|
152 |
-
We can use the following code to get a sense of peak memory usage during inference:
|
153 |
-
|
154 |
## Results
|
155 |
|
156 |
| Benchmark | | |
|
|
|
17 |
pipeline_tag: text-generation
|
18 |
---
|
19 |
|
20 |
+
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) int4 weight only quantization, using [hqq](https://mobiusml.github.io/hqq_blog/) algorithm for improved accuracy, by PyTorch team. Use it directly or serve using [vLLM](https://docs.vllm.ai/en/latest/) for 67% VRAM reduction and 12-20% speedup on A100 GPUs.
|
21 |
|
22 |
# Quantization Recipe
|
23 |
|
|
|
149 |
|
150 |
# Peak Memory Usage
|
151 |
|
|
|
|
|
152 |
## Results
|
153 |
|
154 |
| Benchmark | | |
|