Granther
/

Phi3-128k-Instruct-4Bit-GPTQ

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

Phi3 Mini 128k 4 Bit Quantized

4 Bit Quantized version of Microsoft's Phi3 Mini 128k: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
Quantized the model with HuggingFace's 🤗 GPTQQuanizer

Flash Attention

The Phi3 family supports Flash Attenion 2, this mechanism allows for faster inference with lower resource use.
When quantizing Phi3 on a 4090 (24G) with Flash Attention disabled Quantization would fail due to insufficient VRAM
Enabling Flash Attention allowed Quantization to complete with an extra 10 Giagbaytes of VRAM available on the GPU

Metrics

Total Size:

Before: 7.64G
After: 2.28G

VRAM Size:

Before: 11.47G
After: 6.57G

Average Inference Time:

Before: 12ms/token
After: 5ms/token

Downloads last month: 104

Safetensors

Model size

683M params

Tensor type

I32

·

FP16

·

Inference Providers NEW

Text Generation

This model is not currently available via any of the supported Inference Providers.