|
# Model Placeholder |
|
|
|
This repository is ready to host optimized model variants for the Unicorn Execution Engine. |
|
|
|
## Planned Model Files: |
|
|
|
### Gemma 3n E2B Variants |
|
- `gemma3n-e2b-fp16-npu.safetensors` (MatFormer FP16 optimized) |
|
- `gemma3n-e2b-int8-npu.safetensors` (MatFormer INT8 quantized) |
|
- `gemma3n-e2b-config.json` (Model configuration) |
|
- `gemma3n-e2b-tokenizer.json` (Tokenizer configuration) |
|
|
|
### Qwen2.5-7B Variants |
|
- `qwen25-7b-fp16-hybrid.safetensors` (Hybrid execution FP16) |
|
- `qwen25-7b-int8-hybrid.safetensors` (Hybrid execution INT8) |
|
- `qwen25-7b-config.json` (Model configuration) |
|
- `qwen25-7b-tokenizer.json` (Tokenizer configuration) |
|
|
|
### NPU Optimization Files |
|
- `npu_attention_kernels.mlir` (MLIR-AIE kernels) |
|
- `igpu_optimization_configs.json` (ROCm configurations) |
|
- `performance_profiles.json` (Turbo mode profiles) |
|
|
|
## Model Sizes (Estimated) |
|
- **Gemma 3n E2B FP16**: ~4GB |
|
- **Gemma 3n E2B INT8**: ~2GB |
|
- **Qwen2.5-7B FP16**: ~14GB |
|
- **Qwen2.5-7B INT8**: ~7GB |
|
|
|
## Performance Targets |
|
- **Gemma 3n E2B**: 100+ TPS with turbo mode |
|
- **Qwen2.5-7B**: 60+ TPS with hybrid execution |
|
- **Memory Usage**: <10GB total system budget |
|
- **Latency**: <30ms time to first token |
|
|
|
To create actual optimized models, run the Unicorn Execution Engine quantization pipeline: |
|
|
|
```bash |
|
cd Unicorn-Execution-Engine |
|
python quantization_engine.py --model gemma3n-e2b --precision fp16 --target npu |
|
python quantization_engine.py --model qwen25-7b --precision int8 --target hybrid |
|
``` |