guangy10 commited on
Commit
008afd2
·
verified ·
1 Parent(s): 7095ff6

Updated model card

Browse files
Files changed (1) hide show
  1. README.md +12 -14
README.md CHANGED
@@ -11,25 +11,23 @@ tags:
11
  - smollm
12
  ---
13
 
14
- # Run on-device with ExecuTorch
15
 
16
- This optimized model is exported to ExecuTorch and can run on edge devices.
17
- Once ExecuTorch is [set-up](https://pytorch.org/executorch/main/getting-started.html), you can directly download the `*.pte` and tokenizer file and run the model in a mobile app (see [Running in a mobile app](#running-in-a-mobile-app)).
18
 
 
 
19
 
20
- ## Export to ExecuTorch
21
 
22
- First need to install the required packages:
23
- ```Shell
24
- pip install git+https://github.com/huggingface/optimum-executorch@main
25
- cd optimum-executorch
26
- ```
27
- Then update the dependencies to latest in order to work on the SmolLM3-3B:
28
- ```Py
29
- python install_dev.py
30
- ```
31
 
32
- Use `optimum-cli` to export the model to ExecuTorch:
33
  ```Shell
34
  optimum-cli export executorch \
35
  --model HuggingFaceTB/SmolLM3-3B \
 
11
  - smollm
12
  ---
13
 
14
+ [HuggingFaceTB/SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) is quantized using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (`8da4w`). It is then lowered to [ExecuTorch](https://github.com/pytorch/executorch) with several optimizations—custom SPDA, custom KV cache, and parallel prefill—to achieve high performance on the CPU backend, making it well-suited for mobile deployment.
15
 
16
+ We provide the [.pte file](https://huggingface.co/pytorch/SmolLM3-3B-8da4w/blob/main/smollm3-3b-8da4w.pte) for direct use in ExecuTorch. *(The provided pte file is exported with the default max_seq_length/max_context_length of 2k.)*
 
17
 
18
+ # Running in a mobile app
19
+ The [.pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-8da4w/blob/main/phi4-mini-8da4w.pte) can be run with ExecuTorch on a mobile phone. See the instructions for doing this in [iOS](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) and [Android](https://docs.pytorch.org/executorch/main/llm/llama-demo-android.html).
20
 
21
+ On Google's Pixel 8 Pro, the model runs at 12.7 tokens/s.
22
 
23
+ # Running with ExecuTorch’s sample runner
24
+ You can also run this model using ExecuTorch’s sample runner following [Step 3&4 in this instruction](https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#step-3-run-on-your-computer-to-validate)
25
+
26
+
27
+ # Export Recipe
28
+ You can re-create the `.pte` file from eager source using this export recipe.
 
 
 
29
 
30
+ First install `optimum-executorch` by following this [instruction](https://github.com/huggingface/optimum-executorch?tab=readme-ov-file#-quick-installation), then you can use `optimum-cli` to export the model to ExecuTorch:
31
  ```Shell
32
  optimum-cli export executorch \
33
  --model HuggingFaceTB/SmolLM3-3B \