🦄 Upload NPU+iGPU unicorn-execution-engine-models model
Browse files- README.md +14 -0
- model_placeholder.txt +42 -0
README.md
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
tags:
|
3 |
+
- unicorn-execution-engine
|
4 |
+
- npu
|
5 |
+
- igpu
|
6 |
+
- framework
|
7 |
+
- documentation
|
8 |
+
---
|
9 |
+
|
10 |
+
# 🦄 Unicorn Execution Engine Model Collection
|
11 |
+
|
12 |
+
Hardware Requirements: NPU Phoenix + AMD Radeon 780M
|
13 |
+
Size: 0.001GB
|
14 |
+
Framework: documentation
|
model_placeholder.txt
ADDED
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Model Placeholder
|
2 |
+
|
3 |
+
This repository is ready to host optimized model variants for the Unicorn Execution Engine.
|
4 |
+
|
5 |
+
## Planned Model Files:
|
6 |
+
|
7 |
+
### Gemma 3n E2B Variants
|
8 |
+
- `gemma3n-e2b-fp16-npu.safetensors` (MatFormer FP16 optimized)
|
9 |
+
- `gemma3n-e2b-int8-npu.safetensors` (MatFormer INT8 quantized)
|
10 |
+
- `gemma3n-e2b-config.json` (Model configuration)
|
11 |
+
- `gemma3n-e2b-tokenizer.json` (Tokenizer configuration)
|
12 |
+
|
13 |
+
### Qwen2.5-7B Variants
|
14 |
+
- `qwen25-7b-fp16-hybrid.safetensors` (Hybrid execution FP16)
|
15 |
+
- `qwen25-7b-int8-hybrid.safetensors` (Hybrid execution INT8)
|
16 |
+
- `qwen25-7b-config.json` (Model configuration)
|
17 |
+
- `qwen25-7b-tokenizer.json` (Tokenizer configuration)
|
18 |
+
|
19 |
+
### NPU Optimization Files
|
20 |
+
- `npu_attention_kernels.mlir` (MLIR-AIE kernels)
|
21 |
+
- `igpu_optimization_configs.json` (ROCm configurations)
|
22 |
+
- `performance_profiles.json` (Turbo mode profiles)
|
23 |
+
|
24 |
+
## Model Sizes (Estimated)
|
25 |
+
- **Gemma 3n E2B FP16**: ~4GB
|
26 |
+
- **Gemma 3n E2B INT8**: ~2GB
|
27 |
+
- **Qwen2.5-7B FP16**: ~14GB
|
28 |
+
- **Qwen2.5-7B INT8**: ~7GB
|
29 |
+
|
30 |
+
## Performance Targets
|
31 |
+
- **Gemma 3n E2B**: 100+ TPS with turbo mode
|
32 |
+
- **Qwen2.5-7B**: 60+ TPS with hybrid execution
|
33 |
+
- **Memory Usage**: <10GB total system budget
|
34 |
+
- **Latency**: <30ms time to first token
|
35 |
+
|
36 |
+
To create actual optimized models, run the Unicorn Execution Engine quantization pipeline:
|
37 |
+
|
38 |
+
```bash
|
39 |
+
cd Unicorn-Execution-Engine
|
40 |
+
python quantization_engine.py --model gemma3n-e2b --precision fp16 --target npu
|
41 |
+
python quantization_engine.py --model qwen25-7b --precision int8 --target hybrid
|
42 |
+
```
|