Phi3 Mini 128k 4 Bit Quantized
- 4 Bit Quantized version of Microsoft's Phi3 Mini 128k: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
- Quantized the model with HuggingFace's 🤗 GPTQQuanizer
Flash Attention
- The Phi3 family supports Flash Attenion 2, this mechanism allows for faster inference with lower resource use.
- When quantizing Phi3 on a 4090 (24G) with Flash Attention disabled Quantization would fail due to insufficient VRAM
- Enabling Flash Attention allowed Quantization to complete with an extra 10 Giagbaytes of VRAM available on the GPU
Metrics
Total Size:
- Before: 7.64G
- After: 2.28G
VRAM Size:
- Before: 11.47G
- After: 6.57G
Average Inference Time:
- Before: 12ms/token
- After: 5ms/token
- Downloads last month
- 104
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.