Mungert commited on
Commit
d2f94e6
·
verified ·
1 Parent(s): 06a02f8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +43 -7
README.md CHANGED
@@ -65,16 +65,40 @@ Quantization reduces model size and memory usage while maintaining as much accur
65
 
66
  ---
67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ### **Summary Table: Model Format Selection**
69
 
70
  | Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
71
  |--------------|------------|---------------|----------------------|---------------|
72
- | **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
73
- | **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn’t available |
74
- | **Q4_K** | Low | Very Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
75
- | **Q6_K** | Medium Low | Low | CPU with more memory | Better accuracy while still being quantized |
76
- | **Q8** | Medium | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
 
 
77
 
 
78
 
79
  ## **Included Files & Details**
80
 
@@ -109,10 +133,22 @@ Quantization reduces model size and memory usage while maintaining as much accur
109
  - **Output & embeddings** quantized to **Q8_0**.
110
  - All other layers quantized to **Q6_K** .
111
 
112
-
113
  ### `Qwen2.5-7B-Instruct-q8_0.gguf`
114
  - Fully **Q8** quantized model for better accuracy.
115
- - Requires **more memory** but offers higher precision
 
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
  # <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
118
 
 
65
 
66
  ---
67
 
68
+ ### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)**
69
+ These models are optimized for **extreme memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint.
70
+
71
+ - **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **extreme memory efficiency**.
72
+ - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large.
73
+ - **Trade-off**: Lower accuracy compared to higher-bit quantizations.
74
+
75
+ - **IQ3_S**: Small block size for **maximum memory efficiency**.
76
+ - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive.
77
+
78
+ - **IQ3_M**: Medium block size for better accuracy than **IQ3_S**.
79
+ - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting.
80
+
81
+ - **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy.
82
+ - **Use case**: Best for **low-memory devices** where **Q6_K** is too large.
83
+
84
+ - **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**.
85
+ - **Use case**: Best for **ARM-based devices** or **low-memory environments**.
86
+
87
+ ---
88
+
89
  ### **Summary Table: Model Format Selection**
90
 
91
  | Model Format | Precision | Memory Usage | Device Requirements | Best Use Case |
92
  |--------------|------------|---------------|----------------------|---------------|
93
+ | **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory |
94
+ | **F16** | High | High | FP16-supported devices | GPU inference when BF16 isn’t available |
95
+ | **Q4_K** | Medium Low | Low | CPU or Low-VRAM devices | Best for memory-constrained environments |
96
+ | **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy while still being quantized |
97
+ | **Q8_0** | High | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models |
98
+ | **IQ3_XS** | Very Low | Very Low | Ultra-low-memory devices | Extreme memory efficiency and low accuracy |
99
+ | **Q4_0** | Low | Low | ARM or low-memory devices | llama.cpp can optimize for ARM devices |
100
 
101
+ ---
102
 
103
  ## **Included Files & Details**
104
 
 
133
  - **Output & embeddings** quantized to **Q8_0**.
134
  - All other layers quantized to **Q6_K** .
135
 
 
136
  ### `Qwen2.5-7B-Instruct-q8_0.gguf`
137
  - Fully **Q8** quantized model for better accuracy.
138
+ - Requires **more memory** but offers higher precision.
139
+
140
+ ### `Qwen2.5-7B-Instruct-iq3_xs.gguf`
141
+ - **IQ3_XS** quantization, optimized for **extreme memory efficiency**.
142
+ - Best for **ultra-low-memory devices**.
143
+
144
+ ### `Qwen2.5-7B-Instruct-iq3_m.gguf`
145
+ - **IQ3_M** quantization, offering a **medium block size** for better accuracy.
146
+ - Suitable for **low-memory devices**.
147
+
148
+ ### `Qwen2.5-7B-Instruct-q4_0.gguf`
149
+ - Pure **Q4_0** quantization, optimized for **ARM devices**.
150
+ - Best for **low-memory environments**.
151
+ - Prefer IQ4_NL for better accuracy.
152
 
153
  # <span id="testllm" style="color: #7F7FFF;">🚀 If you find these models useful</span>
154