Mungert
/

Qwen2.5-7B-Instruct-GGUF

@@ -65,16 +65,40 @@ Quantization reduces model size and memory usage while maintaining as much accur
 ---
 ### **Summary Table: Model Format Selection**
 | Model Format  | Precision  | Memory Usage  | Device Requirements  | Best Use Case  |
 |--------------|------------|---------------|----------------------|---------------|
-| **BF16**     | Highest       | High      | BF16-supported GPU/CPUs  | High-speed inference with reduced memory |
-| **F16**      | High     | High           | FP16-supported devices | GPU inference when BF16 isn’t available |
-| **Q4_K**     | Low        | Very Low      | CPU or Low-VRAM devices | Best for memory-constrained environments |
-| **Q6_K**     | Medium Low     | Low      | CPU with more memory | Better accuracy while still being quantized |
-| **Q8**       | Medium       | Moderate        | CPU or GPU with enough VRAM | Best accuracy among quantized models |
 ## **Included Files & Details**
@@ -109,10 +133,22 @@ Quantization reduces model size and memory usage while maintaining as much accur
 - **Output & embeddings** quantized to **Q8_0**.
 - All other layers quantized to **Q6_K** .
 ### `Qwen2.5-7B-Instruct-q8_0.gguf`
 - Fully **Q8** quantized model for better accuracy.
-- Requires **more memory** but offers higher precision
 # <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>

 ---
+### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
+These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
+- **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
+  - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
+  - **Trade-off**: Lower accuracy compared to higher-bit quantizations.
+- **IQ3_S**: Small block size for **maximum memory efficiency**.
+  - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
+- **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
+  - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
+- **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
+  - **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
+- **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
+  - **Use case**: Best for **ARM-based devices** or **low-memory environments**.
+---
 ### **Summary Table: Model Format Selection**
 | Model Format  | Precision  | Memory Usage  | Device Requirements  | Best Use Case  |
 |--------------|------------|---------------|----------------------|---------------|
+| **BF16**     | Highest    | High          | BF16-supported GPU/CPUs  | High-speed inference with reduced memory |
+| **F16**      | High       | High          | FP16-supported devices | GPU inference when BF16 isn’t available |
+| **Q4_K**     | Medium Low | Low           | CPU or Low-VRAM devices | Best for memory-constrained environments |
+| **Q6_K**     | Medium     | Moderate      | CPU with more memory | Better accuracy while still being quantized |
+| **Q8_0**     | High       | Moderate      | CPU or GPU with enough VRAM | Best accuracy among quantized models |
+| **IQ3_XS**   | Very Low   | Very Low      | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
+| **Q4_0**     | Low        | Low           | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
+---
 ## **Included Files & Details**
 - **Output & embeddings** quantized to **Q8_0**.
 - All other layers quantized to **Q6_K** .
 ### `Qwen2.5-7B-Instruct-q8_0.gguf`
 - Fully **Q8** quantized model for better accuracy.
+- Requires **more memory** but offers higher precision.
+### `Qwen2.5-7B-Instruct-iq3_xs.gguf`
+- **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
+- Best for **ultra-low-memory devices**.
+### `Qwen2.5-7B-Instruct-iq3_m.gguf`
+- **IQ3_M** quantization, offering a **medium block size** for better accuracy.
+- Suitable for **low-memory devices**.
+### `Qwen2.5-7B-Instruct-q4_0.gguf`
+- Pure **Q4_0** quantization, optimized for **ARM devices**.
+- Best for **low-memory environments**.
+- Prefer IQ4_NL for better accuracy.
 # <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>