Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
@@ -65,16 +65,40 @@ Quantization reduces model size and memory usage while maintaining as much accur
|
|
65 |
|
66 |
---
|
67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
### **Summary Table: Model Format Selection**
|
69 |
|
70 |
| Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
|
71 |
|--------------|------------|---------------|----------------------|---------------|
|
72 |
-
| **BF16** | Highest
|
73 |
-
| **F16** | High
|
74 |
-
| **Q4_K** | Low
|
75 |
-
| **Q6_K** | Medium
|
76 |
-
| **
|
|
|
|
|
77 |
|
|
|
78 |
|
79 |
## **Included Files & Details**
|
80 |
|
@@ -109,10 +133,22 @@ Quantization reduces model size and memory usage while maintaining as much accur
|
|
109 |
- **Output & embeddings** quantized to **Q8_0**.
|
110 |
- All other layers quantized to **Q6_K** .
|
111 |
|
112 |
-
|
113 |
### `Qwen2.5-7B-Instruct-q8_0.gguf`
|
114 |
- Fully **Q8** quantized model for better accuracy.
|
115 |
-
- Requires **more memory** but offers higher precision
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
116 |
|
117 |
# <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
|
118 |
|
|
|
65 |
|
66 |
---
|
67 |
|
68 |
+
### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
|
69 |
+
These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
|
70 |
+
|
71 |
+
- **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
|
72 |
+
- **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
|
73 |
+
- **Trade-off**: Lower accuracy compared to higher-bit quantizations.
|
74 |
+
|
75 |
+
- **IQ3_S**: Small block size for **maximum memory efficiency**.
|
76 |
+
- **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
|
77 |
+
|
78 |
+
- **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
|
79 |
+
- **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
|
80 |
+
|
81 |
+
- **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
|
82 |
+
- **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
|
83 |
+
|
84 |
+
- **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
|
85 |
+
- **Use case**: Best for **ARM-based devices** or **low-memory environments**.
|
86 |
+
|
87 |
+
---
|
88 |
+
|
89 |
### **Summary Table: Model Format Selection**
|
90 |
|
91 |
| Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
|
92 |
|--------------|------------|---------------|----------------------|---------------|
|
93 |
+
| **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
|
94 |
+
| **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn’t available |
|
95 |
+
| **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
|
96 |
+
| **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
|
97 |
+
| **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
|
98 |
+
| **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
|
99 |
+
| **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
|
100 |
|
101 |
+
---
|
102 |
|
103 |
## **Included Files & Details**
|
104 |
|
|
|
133 |
- **Output & embeddings** quantized to **Q8_0**.
|
134 |
- All other layers quantized to **Q6_K** .
|
135 |
|
|
|
136 |
### `Qwen2.5-7B-Instruct-q8_0.gguf`
|
137 |
- Fully **Q8** quantized model for better accuracy.
|
138 |
+
- Requires **more memory** but offers higher precision.
|
139 |
+
|
140 |
+
### `Qwen2.5-7B-Instruct-iq3_xs.gguf`
|
141 |
+
- **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
|
142 |
+
- Best for **ultra-low-memory devices**.
|
143 |
+
|
144 |
+
### `Qwen2.5-7B-Instruct-iq3_m.gguf`
|
145 |
+
- **IQ3_M** quantization, offering a **medium block size** for better accuracy.
|
146 |
+
- Suitable for **low-memory devices**.
|
147 |
+
|
148 |
+
### `Qwen2.5-7B-Instruct-q4_0.gguf`
|
149 |
+
- Pure **Q4_0** quantization, optimized for **ARM devices**.
|
150 |
+
- Best for **low-memory environments**.
|
151 |
+
- Prefer IQ4_NL for better accuracy.
|
152 |
|
153 |
# <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
|
154 |
|