Crystalcareai
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,171 +1,45 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
---
|
6 |
|
7 |
-
|
|
|
|
|
8 |
|
9 |
-
|
10 |
|
11 |
-
|
12 |
|
13 |
-
All quants made using imatrix option with dataset from [here](https://gist.github.com/bartowski1182/eb213dccb3571f863da82e99418f81e8)
|
14 |
|
15 |
-
|
16 |
|
17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
-
|
20 |
-
<|im_start|>system
|
21 |
-
{system_prompt}<|im_end|>
|
22 |
-
<|im_start|>user
|
23 |
-
{prompt}<|im_end|>
|
24 |
-
<|im_start|>assistant
|
25 |
-
```
|
26 |
-
|
27 |
-
## Download a file (not the whole branch) from below:
|
28 |
-
|
29 |
-
| Filename | Quant type | File Size | Split | Description |
|
30 |
-
| -------- | ---------- | --------- | ----- | ----------- |
|
31 |
-
| [Virtuoso-Small-f32.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/tree/main/Virtuoso-Small-f32) | f32 | 59.09GB | true | Full F32 weights. |
|
32 |
-
| [Virtuoso-Small-Q8_0.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q8_0.gguf) | Q8_0 | 15.70GB | false | Extremely high quality, generally unneeded but max available quant. |
|
33 |
-
| [Virtuoso-Small-Q6_K_L.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q6_K_L.gguf) | Q6_K_L | 12.50GB | false | Uses Q8_0 for embed and output weights. Very high quality, near perfect, *recommended*. |
|
34 |
-
| [Virtuoso-Small-Q6_K.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q6_K.gguf) | Q6_K | 12.12GB | false | Very high quality, near perfect, *recommended*. |
|
35 |
-
| [Virtuoso-Small-Q5_K_L.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q5_K_L.gguf) | Q5_K_L | 10.99GB | false | Uses Q8_0 for embed and output weights. High quality, *recommended*. |
|
36 |
-
| [Virtuoso-Small-Q5_K_M.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q5_K_M.gguf) | Q5_K_M | 10.51GB | false | High quality, *recommended*. |
|
37 |
-
| [Virtuoso-Small-Q5_K_S.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q5_K_S.gguf) | Q5_K_S | 10.27GB | false | High quality, *recommended*. |
|
38 |
-
| [Virtuoso-Small-Q4_K_L.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q4_K_L.gguf) | Q4_K_L | 9.57GB | false | Uses Q8_0 for embed and output weights. Good quality, *recommended*. |
|
39 |
-
| [Virtuoso-Small-Q4_K_M.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q4_K_M.gguf) | Q4_K_M | 8.99GB | false | Good quality, default size for most use cases, *recommended*. |
|
40 |
-
| [Virtuoso-Small-Q3_K_XL.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q3_K_XL.gguf) | Q3_K_XL | 8.61GB | false | Uses Q8_0 for embed and output weights. Lower quality but usable, good for low RAM availability. |
|
41 |
-
| [Virtuoso-Small-Q4_K_S.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q4_K_S.gguf) | Q4_K_S | 8.57GB | false | Slightly lower quality with more space savings, *recommended*. |
|
42 |
-
| [Virtuoso-Small-IQ4_NL.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-IQ4_NL.gguf) | IQ4_NL | 8.55GB | false | Similar to IQ4_XS, but slightly larger. Offers online repacking for ARM CPU inference. |
|
43 |
-
| [Virtuoso-Small-Q4_0.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q4_0.gguf) | Q4_0 | 8.54GB | false | Legacy format, offers online repacking for ARM CPU inference. |
|
44 |
-
| [Virtuoso-Small-Q4_0_8_8.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q4_0_8_8.gguf) | Q4_0_8_8 | 8.52GB | false | Optimized for ARM and AVX inference. Requires 'sve' support for ARM (see details below). *Don't use on Mac*. |
|
45 |
-
| [Virtuoso-Small-Q4_0_4_8.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q4_0_4_8.gguf) | Q4_0_4_8 | 8.52GB | false | Optimized for ARM inference. Requires 'i8mm' support (see details below). *Don't use on Mac*. |
|
46 |
-
| [Virtuoso-Small-Q4_0_4_4.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q4_0_4_4.gguf) | Q4_0_4_4 | 8.52GB | false | Optimized for ARM inference. Should work well on all ARM chips, not for use with GPUs. *Don't use on Mac*. |
|
47 |
-
| [Virtuoso-Small-IQ4_XS.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-IQ4_XS.gguf) | IQ4_XS | 8.12GB | false | Decent quality, smaller than Q4_K_S with similar performance, *recommended*. |
|
48 |
-
| [Virtuoso-Small-Q3_K_L.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q3_K_L.gguf) | Q3_K_L | 7.92GB | false | Lower quality but usable, good for low RAM availability. |
|
49 |
-
| [Virtuoso-Small-Q3_K_M.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q3_K_M.gguf) | Q3_K_M | 7.34GB | false | Low quality. |
|
50 |
-
| [Virtuoso-Small-IQ3_M.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-IQ3_M.gguf) | IQ3_M | 6.92GB | false | Medium-low quality, new method with decent performance comparable to Q3_K_M. |
|
51 |
-
| [Virtuoso-Small-Q3_K_S.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q3_K_S.gguf) | Q3_K_S | 6.66GB | false | Low quality, not recommended. |
|
52 |
-
| [Virtuoso-Small-Q2_K_L.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q2_K_L.gguf) | Q2_K_L | 6.53GB | false | Uses Q8_0 for embed and output weights. Very low quality but surprisingly usable. |
|
53 |
-
| [Virtuoso-Small-IQ3_XS.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-IQ3_XS.gguf) | IQ3_XS | 6.38GB | false | Lower quality, new method with decent performance, slightly better than Q3_K_S. |
|
54 |
-
| [Virtuoso-Small-Q2_K.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-Q2_K.gguf) | Q2_K | 5.77GB | false | Very low quality but surprisingly usable. |
|
55 |
-
| [Virtuoso-Small-IQ2_M.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-IQ2_M.gguf) | IQ2_M | 5.36GB | false | Relatively low quality, uses SOTA techniques to be surprisingly usable. |
|
56 |
-
| [Virtuoso-Small-IQ2_S.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-IQ2_S.gguf) | IQ2_S | 5.00GB | false | Low quality, uses SOTA techniques to be usable. |
|
57 |
-
| [Virtuoso-Small-IQ2_XS.gguf](https://huggingface.co/bartowski/Virtuoso-Small-GGUF/blob/main/Virtuoso-Small-IQ2_XS.gguf) | IQ2_XS | 4.70GB | false | Low quality, uses SOTA techniques to be usable. |
|
58 |
-
|
59 |
-
## Embed/output weights
|
60 |
-
|
61 |
-
Some of these quants (Q3_K_XL, Q4_K_L etc) are the standard quantization method with the embeddings and output weights quantized to Q8_0 instead of what they would normally default to.
|
62 |
-
|
63 |
-
## Downloading using huggingface-cli
|
64 |
-
|
65 |
-
<details>
|
66 |
-
<summary>Click to view download instructions</summary>
|
67 |
-
|
68 |
-
First, make sure you have hugginface-cli installed:
|
69 |
-
|
70 |
-
```
|
71 |
-
pip install -U "huggingface_hub[cli]"
|
72 |
-
```
|
73 |
-
|
74 |
-
Then, you can target the specific file you want:
|
75 |
-
|
76 |
-
```
|
77 |
-
huggingface-cli download bartowski/Virtuoso-Small-GGUF --include "Virtuoso-Small-Q4_K_M.gguf" --local-dir ./
|
78 |
-
```
|
79 |
-
|
80 |
-
If the model is bigger than 50GB, it will have been split into multiple files. In order to download them all to a local folder, run:
|
81 |
-
|
82 |
-
```
|
83 |
-
huggingface-cli download bartowski/Virtuoso-Small-GGUF --include "Virtuoso-Small-Q8_0/*" --local-dir ./
|
84 |
-
```
|
85 |
-
|
86 |
-
You can either specify a new local-dir (Virtuoso-Small-Q8_0) or download them all in place (./)
|
87 |
-
|
88 |
-
</details>
|
89 |
-
|
90 |
-
## Q4_0_X_X information
|
91 |
-
|
92 |
-
New: Thanks to efforts made to have online repacking of weights in [this PR](https://github.com/ggerganov/llama.cpp/pull/9921), you can now just use Q4_0 if your llama.cpp has been compiled for your ARM device.
|
93 |
-
|
94 |
-
Similarly, if you want to get slightly better performance, you can use IQ4_NL thanks to [this PR](https://github.com/ggerganov/llama.cpp/pull/10541) which will also repack the weights for ARM, though only the 4_4 for now. The loading time may be slower but it will result in an overall speed incrase.
|
95 |
-
|
96 |
-
<details>
|
97 |
-
<summary>Click to view Q4_0_X_X information</summary>
|
98 |
-
These are *NOT* for Metal (Apple) or GPU (nvidia/AMD/intel) offloading, only ARM chips (and certain AVX2/AVX512 CPUs).
|
99 |
-
|
100 |
-
If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons [on the original pull request](https://github.com/ggerganov/llama.cpp/pull/5780#pullrequestreview-21657544660)
|
101 |
-
|
102 |
-
To check which one would work best for your ARM chip, you can check [AArch64 SoC features](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) (thanks EloyOn!).
|
103 |
-
|
104 |
-
If you're using a CPU that supports AVX2 or AVX512 (typically server CPUs and AMD's latest Zen5 CPUs) and are not offloading to a GPU, the Q4_0_8_8 may offer a nice speed as well:
|
105 |
-
|
106 |
-
<details>
|
107 |
-
<summary>Click to view benchmarks on an AVX2 system (EPYC7702)</summary>
|
108 |
-
|
109 |
-
| model | size | params | backend | threads | test | t/s | % (vs Q4_0) |
|
110 |
-
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: |
|
111 |
-
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% |
|
112 |
-
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% |
|
113 |
-
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% |
|
114 |
-
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% |
|
115 |
-
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% |
|
116 |
-
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% |
|
117 |
-
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% |
|
118 |
-
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% |
|
119 |
-
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% |
|
120 |
-
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% |
|
121 |
-
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% |
|
122 |
-
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% |
|
123 |
-
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% |
|
124 |
-
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% |
|
125 |
-
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% |
|
126 |
-
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% |
|
127 |
-
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% |
|
128 |
-
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% |
|
129 |
-
|
130 |
-
Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation
|
131 |
-
|
132 |
-
</details>
|
133 |
-
|
134 |
-
</details>
|
135 |
-
|
136 |
-
## Which file should I choose?
|
137 |
-
|
138 |
-
<details>
|
139 |
-
<summary>Click here for details</summary>
|
140 |
-
|
141 |
-
A great write up with charts showing various performances is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)
|
142 |
-
|
143 |
-
The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
|
144 |
-
|
145 |
-
If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
|
146 |
-
|
147 |
-
If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
|
148 |
-
|
149 |
-
Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
|
150 |
-
|
151 |
-
If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
|
152 |
-
|
153 |
-
If you want to get more into the weeds, you can check out this extremely useful feature chart:
|
154 |
-
|
155 |
-
[llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix)
|
156 |
-
|
157 |
-
But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
|
158 |
-
|
159 |
-
These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
|
160 |
|
161 |
-
|
162 |
|
163 |
-
|
|
|
|
|
164 |
|
165 |
-
|
166 |
|
167 |
-
|
168 |
|
169 |
-
|
170 |
|
171 |
-
Want to support my work? Visit my ko-fi page here: https://ko-fi.com/bartowski
|
|
|
1 |
---
|
2 |
+
license: apache-2.0
|
3 |
+
base_model:
|
4 |
+
- Qwen/Qwen2.5-14B
|
5 |
---
|
6 |
|
7 |
+
<div align="center">
|
8 |
+
<img src="https://i.ibb.co/pXD6Bcv/SW2-U-g-QQLSH1-ZAbxhs-Iu-A.webp" alt="Virtuoso-Small" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 100%; height: auto;">
|
9 |
+
</div>
|
10 |
|
11 |
+
# Virtuoso-Small
|
12 |
|
13 |
+
Virtuoso-Small is the debut public release of the Virtuoso series of models by Arcee.ai, designed to bring cutting-edge generative AI capabilities to organizations and developers in a compact, efficient form. With 14 billion parameters, Virtuoso-Small is an accessible entry point for high-quality instruction-following, complex reasoning, and business-oriented generative AI tasks. Its larger siblings, Virtuoso-Forte and Virtuoso-Prime, offer even greater capabilities and are available via API at [models.arcee.ai](https://models.arcee.ai).
|
14 |
|
|
|
15 |
|
16 |
+
## Performance Benchmarks
|
17 |
|
18 |
+
| **Groups** | **Metric** | ↑ | **Value** | ± | **Stderr** |
|
19 |
+
|---------------------------|--------------------------|---|----------:|----|-----------:|
|
20 |
+
| **Leaderboard** | **Accuracy** | ↑ | 0.5194 | ± | 0.0046 |
|
21 |
+
| | Normalized Accuracy | ↑ | 0.5814 | ± | 0.0051 |
|
22 |
+
| | Exact Match | ↑ | 0.3006 | ± | 0.0117 |
|
23 |
+
| | Instruction-Level Loose Accuracy | ↑ | 0.8489 | ± | N/A |
|
24 |
+
| | Instruction-Level Strict Accuracy | ↑ | 0.8249 | ± | N/A |
|
25 |
+
| | Prompt-Level Loose Accuracy | ↑ | 0.7856 | ± | 0.0177 |
|
26 |
+
| | Prompt-Level Strict Accuracy | ↑ | 0.7523 | ± | 0.0186 |
|
27 |
+
| **Leaderboard-BBH** | Normalized Accuracy | ↑ | 0.6516 | ± | 0.0058 |
|
28 |
+
| **Leaderboard-GPQA** | Normalized Accuracy | ↑ | 0.3389 | ± | 0.0137 |
|
29 |
+
| **Leaderboard-Math-Hard** | Exact Match | ↑ | 0.3006 | ± | 0.0117 |
|
30 |
+
| **Leaderboard-MuSR** | Normalized Accuracy | ↑ | 0.4286 | ± | 0.0175 |
|
31 |
|
32 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
+
## Key Features
|
35 |
|
36 |
+
- **Compact and Efficient**: With 14 billion parameters, Virtuoso-Small provides a high-performance solution optimized for smaller hardware configurations without sacrificing quality.
|
37 |
+
- **Business-Oriented**: Tailored for use cases such as customer support, content creation, and technical assistance, Virtuoso-Small meets the demands of modern enterprises.
|
38 |
+
- **Scalable Ecosystem**: Part of the Virtuoso series, Virtuoso-Small is fully interoperable with its larger siblings, Forte and Prime, enabling seamless scaling as your needs grow.
|
39 |
|
40 |
+
---
|
41 |
|
42 |
+
## Deployment Options
|
43 |
|
44 |
+
Virtuoso-Small is available under the Apache-2.0 license and can be deployed locally or accessed through an API at [models.arcee.ai](https://models.arcee.ai). For larger-scale or more demanding applications, consider Virtuoso-Forte or Virtuoso-Prime.
|
45 |
|
|