Vezora
/

Qwen2.5-Coder-32B-Instruct-fp8-W8A16

Model card Files Files and versions Community

Vezora commited on 4 days ago

Commit

54cfd4b

•

1 Parent(s): 7761c1c

Update README.md

Files changed (1) hide show

README.md +11 -1

README.md CHANGED Viewed

@@ -37,4 +37,14 @@ These optimizations enable GPUs like the 3090 and A100 to deliver near FP8 perfo
 This FP8-quantized model was uploaded to explore high-precision quantization. Traditional int4 quantization, as seen in models like `Qwen/Qwen2.5-Coder-32B-Instruct-int4`, can sometimes produce poor outputs with repeated tokens due to quantization errors. In contrast, FP8 does not require calibration data and achieves robust, lossless compression.
-As shown in Neural Magic's recent paper ([arXiv:2411.02355](https://arxiv.org/pdf/2411.02355)), int4 has limited fidelity recovery from FP16 without careful calibration. FP8, especially in the W8A16 format, maintains high-quality outputs without calibration, making it ideal for high-precision applications such as code generation.

 This FP8-quantized model was uploaded to explore high-precision quantization. Traditional int4 quantization, as seen in models like `Qwen/Qwen2.5-Coder-32B-Instruct-int4`, can sometimes produce poor outputs with repeated tokens due to quantization errors. In contrast, FP8 does not require calibration data and achieves robust, lossless compression.
+As shown in Neural Magic's recent paper ([arXiv:2411.02355](https://arxiv.org/pdf/2411.02355)), int4 has limited fidelity recovery from FP16 without careful calibration. FP8, especially in the W8A16 format, maintains high-quality outputs without calibration, making it ideal for high-precision applications such as code generation.
+### How to Quantize your own models to FP-8 W8A16?
+Included in this is a script that will convert the weights of any HF model to W8A16. (TBH its kinda glitched and makes two dupes to the disk if any one wants to fix it feel free to submit a pr but if its aint broke im not gonna fix it)
+How to use the script:
+Have VLLM installed and run 'pip install llmcompressor==0.1.0'.
+Then literally run the script it will ask you for the model name enter it and it will do the rest **NOTE** this will use CPU ram to avoid OOM errors if you somehow on gods green earth have more GPU vram than CPU ram, edit the script to load to gpu.