Update README.md
Browse files
README.md
CHANGED
@@ -20,6 +20,12 @@ pipeline_tag: text-generation
|
|
20 |
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings, and 8-bit dynamic activation with int4 weights (8da4w), by PyTorch team.
|
21 |
You can export the quantized model to an [ExecuTorch](https://github.com/pytorch/executorch) pte file, or use the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) file directly to run on a mobile device, see [Running in a mobile app](#running-in-a-mobile-app).
|
22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
# Quantization Recipe
|
24 |
|
25 |
First need to install the required packages:
|
@@ -211,13 +217,7 @@ python -m executorch.examples.models.llama.export_llama \
|
|
211 |
--output_name="phi4-mini-8da4w.pte"
|
212 |
```
|
213 |
|
214 |
-
|
215 |
-
The PTE file can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
216 |
-
On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
|
217 |
-
|
218 |
-

|
219 |
-
|
220 |
-
|
221 |
|
222 |
# Disclaimer
|
223 |
PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
|
|
|
20 |
[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings, and 8-bit dynamic activation with int4 weights (8da4w), by PyTorch team.
|
21 |
You can export the quantized model to an [ExecuTorch](https://github.com/pytorch/executorch) pte file, or use the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) file directly to run on a mobile device, see [Running in a mobile app](#running-in-a-mobile-app).
|
22 |
|
23 |
+
# Running in a mobile app
|
24 |
+
The PTE file can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
|
25 |
+
On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
|
26 |
+
|
27 |
+

|
28 |
+
|
29 |
# Quantization Recipe
|
30 |
|
31 |
First need to install the required packages:
|
|
|
217 |
--output_name="phi4-mini-8da4w.pte"
|
218 |
```
|
219 |
|
220 |
+
After that you can run the model in a mobile app (see start of README).
|
|
|
|
|
|
|
|
|
|
|
|
|
221 |
|
222 |
# Disclaimer
|
223 |
PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
|