jerryzh168 commited on
Commit
3716f73
·
verified ·
1 Parent(s): 59a0be3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -7
README.md CHANGED
@@ -20,6 +20,12 @@ pipeline_tag: text-generation
20
  [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings, and 8-bit dynamic activation with int4 weights (8da4w), by PyTorch team.
21
  You can export the quantized model to an [ExecuTorch](https://github.com/pytorch/executorch) pte file, or use the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) file directly to run on a mobile device, see [Running in a mobile app](#running-in-a-mobile-app).
22
 
 
 
 
 
 
 
23
  # Quantization Recipe
24
 
25
  First need to install the required packages:
@@ -211,13 +217,7 @@ python -m executorch.examples.models.llama.export_llama \
211
  --output_name="phi4-mini-8da4w.pte"
212
  ```
213
 
214
- ## Running in a mobile app
215
- The PTE file can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
216
- On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
217
-
218
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/521rXwIlYS9HIAEBAPJjw.png)
219
-
220
-
221
 
222
  # Disclaimer
223
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
 
20
  [Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) is quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings, and 8-bit dynamic activation with int4 weights (8da4w), by PyTorch team.
21
  You can export the quantized model to an [ExecuTorch](https://github.com/pytorch/executorch) pte file, or use the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) file directly to run on a mobile device, see [Running in a mobile app](#running-in-a-mobile-app).
22
 
23
+ # Running in a mobile app
24
+ The PTE file can be run with ExecuTorch on a mobile phone. See the [instructions](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) for doing this in iOS.
25
+ On iPhone 15 Pro, the model runs at 17.3 tokens/sec and uses 3206 Mb of memory.
26
+
27
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/521rXwIlYS9HIAEBAPJjw.png)
28
+
29
  # Quantization Recipe
30
 
31
  First need to install the required packages:
 
217
  --output_name="phi4-mini-8da4w.pte"
218
  ```
219
 
220
+ After that you can run the model in a mobile app (see start of README).
 
 
 
 
 
 
221
 
222
  # Disclaimer
223
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.