Vezora
/

Qwen2.5-Coder-32B-Instruct-fp8-W8A16

Safetensors

qwen2

Model card Files Files and versions Community

Vezora commited on 5 days ago

Commit

b64d635

•

1 Parent(s): 5b7a88a

Update README.md

Browse files

Files changed (1) hide show

README.md +29 -8

README.md CHANGED Viewed

@@ -1,19 +1,40 @@
 ---
 license: apache-2.0
 ---
 ## Overview
-This model can be run using [VLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability > 8.0 (Ampere, A100, A10, 3090, etc.) as a weight-only W8A16 model, utilizing FP8 Marlin.
-The Marlin kernel achieves impressive efficiency by packing 4 8-bit values in an int32 and performing 4xFP8 to 4xFP16/BF16 dequantization using bit arithmetic and SIMT operations. This approach enables nearly a **2x speedup** over FP16 on most models while maintaining almost **completely lossless quality**!
-### FP8 Marlin Details
-- Introduced by [Michael Goin and the Neural Magic team](https://github.com/vllm-project/vllm/pull/5975), FP8 Marlin leverages NVIDIA's GPU architecture to deliver a compact, high-performance format.
-- FP8 achieves nearly lossless compression, making it ideal for models where quantization errors in traditional formats like int4 or int8 may degrade performance.
-### Why FP8?
-I uploaded this FP8-quantized model to experiment with high-precision code handling. Traditional int4 quantization on models like `Qwen/Qwen2.5-Coder-32B-Instruct-int4` sometimes resulted in poor outputs, with repeated tokens due to quantization errors. FP8 format, however, **does not require calibration data** and provides robust, lossless compression.
-As demonstrated in Neural Magic’s recent paper ([arXiv:2411.02355](https://arxiv.org/pdf/2411.02355)), int4 has difficulty recovering fidelity from FP16 unless it’s calibrated carefully. FP8, however, especially in the W8A16 format used here, maintains high-quality outputs without the need for extensive calibration, making it a reliable and performant solution for high-precision applications like code generation.

 ---
 license: apache-2.0
 ---
 ## Overview
+This model is optimized for use with [VLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability > 8.0 (Ampere, A100, A10, 3090, etc.). It utilizes a weight-only FP8 Marlin kernel, providing an efficient W8A16 configuration.
+### Key Features of FP8 Marlin
+The Marlin kernel achieves impressive efficiency by packing 4 8-bit values into an int32 and performing a 4xFP8 to 4xFP16/BF16 dequantization using bit arithmetic and SIMT operations. This approach yields nearly a **2x speedup** over FP16 on most models while maintaining **near lossless quality**.
+#### FP8 Advantages on NVIDIA GPUs
+On newer NVIDIA GPUs (4090/H100 or later), dedicated FP8 tensor cores and hardware allow fast conversion from FP8 to BF16/FP16, maximizing performance. However, older GPUs lack this specific hardware support, preventing activation quantization if we want to leverage FP8. The Marlin kernel addresses this gap effectively, enabling performance gains on Ampere cards (e.g., 3090, A100) without needing full tensor core support.
+Traditional int8 quantization methods often require extensive overhead for data type conversion between int8 and fp16, making them less efficient for inference. Marlin’s FP8 kernel bypasses this limitation by staying predominantly in FP16, removing the need for such conversions during runtime.
+### Optimizations in the Marlin Kernel
+The Marlin kernel is finely tuned for performance, employing several innovative techniques:
+- **Asynchronous Global Weight Loads**: Uses non-blocking `cuda::memcpy_async` (available since Ampere) to load weights directly into shared memory. This minimizes latency by overlapping data transfers with computation.
+- **Circular Shared Memory Queue**: A cyclic buffer system enables uninterrupted data loading, processing, and unloading, ensuring continuous computational flow without stalling.
+- **Optimized Task Scheduling and Synchronization**: Utilizes Stream-K parallelization with non-uniform partitioning, optimizing GPU utilization by minimizing idle time and efficiently managing work distribution across Streaming Multiprocessors (SMs).
+These optimizations enable GPUs like the 3090 and A100 to deliver near FP8 performance with minimal sacrifices, making the Marlin kernel highly effective on non-Ada cards.
+### FP8 Marlin Details
+- Developed by [Michael Goin and the Neural Magic team](https://github.com/vllm-project/vllm/pull/5975), FP8 Marlin is specifically designed for NVIDIA’s GPU architecture, providing a compact and high-performance format.
+- FP8 achieves nearly lossless compression, making it suitable for scenarios where quantization errors in traditional int4 or int8 formats might degrade performance.
+### Why FP8?
+This FP8-quantized model was uploaded to explore high-precision quantization. Traditional int4 quantization, as seen in models like `Qwen/Qwen2.5-Coder-32B-Instruct-int4`, can sometimes produce poor outputs with repeated tokens due to quantization errors. In contrast, FP8 does not require calibration data and achieves robust, lossless compression.
+As shown in Neural Magic's recent paper ([arXiv:2411.02355](https://arxiv.org/pdf/2411.02355)), int4 has limited fidelity recovery from FP16 without careful calibration. FP8, especially in the W8A16 format, maintains high-quality outputs without calibration, making it ideal for high-precision applications such as code generation.