Phi3 Mini 128k 4 Bit Quantized


Flash Attention

  • The Phi3 family supports Flash Attenion 2, this mechanism allows for faster inference with lower resource use.
  • When quantizing Phi3 on a 4090 (24G) with Flash Attention disabled Quantization would fail due to insufficient VRAM
  • Enabling Flash Attention allowed Quantization to complete with an extra 10 Giagbaytes of VRAM available on the GPU

Metrics

Total Size:
  • Before: 7.64G
  • After: 2.28G
VRAM Size:
  • Before: 11.47G
  • After: 6.57G
Average Inference Time:
  • Before: 12ms/token
  • After: 5ms/token
Downloads last month
104
Safetensors
Model size
683M params
Tensor type
I32
·
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.