File size: 10,205 Bytes
c2d910d
 
 
 
ce2f8c2
a75b403
 
036694a
0b4165a
efe703c
dd9cf68
0d6906e
 
dd9cf68
 
 
ce2f8c2
798ca52
 
c345c2c
798ca52
 
 
f0c542d
eed5e11
 
c345c2c
798ca52
 
 
 
 
c5a950f
 
 
7dfaecc
c5a950f
 
 
798ca52
 
 
 
ab951ef
ecb01d7
798ca52
 
 
0e07d36
 
584b2c7
 
798ca52
111a4eb
584b2c7
111a4eb
584b2c7
 
798ca52
ab951ef
 
 
 
 
 
798ca52
dd9cf68
ce2f8c2
798ca52
 
 
 
 
d9b1dcd
798ca52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c345c2c
79c9695
b3c25df
dd9cf68
ce2f8c2
798ca52
 
 
 
c345c2c
798ca52
 
 
 
 
 
 
 
b3c25df
dd9cf68
ce2f8c2
798ca52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
510a000
798ca52
 
 
 
 
 
b3c25df
bb67c97
 
 
 
 
 
 
dd9cf68
ce2f8c2
 
 
 
 
 
 
 
 
 
 
 
 
b3c25df
88a6e8b
 
 
cde23b4
 
9a2df4a
 
b3691a9
 
 
 
 
f2f7ea0
 
b3691a9
 
9f073f2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
---
language:
- en
---

# Status 2024-12-17: huggingface has severely limited my uploads, so requests can be delayed.

HF seems uninterested in doing something about it, so this is likely going to stay.

# To request a quant, open an new discussion in the Community tab (if possible with the full url somewhere in the title or body)

You can see the current quant status at http://hf.tst.eu/status.html

---

# Mini-FAQ

## I miss model XXX

I am not the only one to make quants. For example, **Lewdiculous** makes high-quality imatrix quants of many
small models *and has a great presentation*. I either don't bother with imatrix quants for small models (< 30B), or avoid them
because I saw others already did them, avoiding double work.

Some other notable people which do quants are **Nexesenex**, **bartowski**, **RichardErkhov**, **dranger003** and **Artefact2**.
I'm not saying anything about the quality of their quants, because I probably forgot some really good folks in this list,
and I wouldn't even know, anyways.
Model creators also often provide their own quants. I sometimes skip models because of that,
even if the creator might provide far fewer quants than me.

As always, feel free to request a quant, even if somebody else already did one, or request an imatrix version
for models where I didn't provide them.

## My community discussion is missing

Most likely you brought up problems with the model and I decided I either have to re-do or simply drop the quants.
In the past, I renamed the mode (so you can see my reply), but the huggingface rename function is borked and leaves the files
available under their old name, keeping me from regenerating them (because my scripts can see them already existing).
The only fix seems to be to delete the repo, which unfortunately also deletes the community discussion.

## I miss quant type XXX

The quant types I currently do regularly are:

- static:  (f16) Q8_0 Q4_K_S Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS (Q4_0_4)
- imatrix: Q2_K Q4_K_S IQ3_XXS Q3_K_M (IQ4_NL) Q4_K_M IQ2_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS IQ3_S IQ3_M IQ2_XXS IQ2_XS IQ2_S IQ1_M IQ1_S (Q4_0_4_4 Q4_0_4_8 Q4_0_8_8)

And they are generally (but not always) generated in the order above, for which there are deep reasons.

For models less than 11B size, I experimentally generate f16 versions at the moment (in the static repository).

For models less than 19B size, imatrix IQ4_NL quants will be generated, mostly for the benefit of arm,
where it can give a speed benefit.

The (static) IQ3 quants are no longer generated, as they consistently seem to result in *much* lower quality
quants than even static Q2_K, so it would be s disservice to offer them. *Update*: That might no longer be true, and they might come back.

I specifically do not do Q2_K_S, because I generally think it is not worth it (IQ2_M usually being smaller and better, albeit slower),
and IQ4_NL, because it requires a lot of computing and is generally completely superseded by IQ4_XS.

Q8_0 imatrix quants do not exist - some quanters claim otherwise, but Q8_0 ggufs do not contain any tensor
type that uses the imatrix data, although technically it might be possible to do so.

Older models that pre-date introduction of new quant types generally will have them retrofitted on request.

You can always try to change my mind about all this, but be prepared to bring convincing data.

## What does the "-i1" mean in "-i1-GGUF"?

"mradermacher imatrix type 1"

Originally, I had the idea of using an iterational method of imatrix generation, and wanted to see how well it
fares. That is, create an imatrix from a bad quant (e.g. static Q2_K), then use the new model to generate a
possibly better imatrix. It never happened, but I think sticking to something, even if slightly wrong, is better
changing it. If I make considerable changes to how I create imatrix data I will probably bump it to `-i2` and so on.

since there is some subjectivity/choice in imatrix training data, this also distinguishes it from
quants by other people who made different choices.

## What is the imatrix training data you use, can I have a copy?

My training data consists of about 160k tokens, about half of which is semi-random tokens (sentence fragments)
taken from stories, the other half is kalomaze's groups_merged.txt and a few other things. I have a half and a quarter
set for too big or too stubborn models.

Neither my set nor kalomaze's data contain large amounts of non-english training data, which is why I tend to
not generate imatrix quants for models primarily meant for non-english usage. This is a trade-off, emphasizing
english over other languages. But from (sparse) testing data it looks as if this doesn't actually make a big
difference. More data are always welcome.

Unfortunately, I do not have the rights to publish the testing data, but I might be able to replicate an
equivalent set in the future and publish that.

## Why are you doing this?

Because at some point, I found that some new interesting models weren't available as GGUF anymore - my go-to
source, TheBloke, had vanished. So I quantized a few models for myself. At the time, it was trivial - no imatrix,
only a few quant types, all them very fast to generate.

I then looked into huggingface more closely than just as a download source, and decided uploading would be a
good thing, so others don't have to redo the work on their own. I'm used to sharing most of the things I make
(mostly in free software), so it felt naturally to contribute, even at a minor scale.

Then the number of quant types and their computational complexity exploded, as well as imatrix calculations became a thing.
This increased the time required to make such quants by an order of magnitude. And also the management overhead.

Since I was slowly improving my tooling I grew into it at the same pace as these innovations came out. I probably
would not have started doing this a month later, as I would have been daunted by the complexity and work required.

## You have amazing hardware!?!?!

I regularly see people write that, but I probably have worse hardware than them to create my quants. I currently
have access to eight servers that have good upload speed. Five of them are xeon quad cores class from ~2013, three are
Ryzen 5 hexacores. The faster the server, the smaller the diskspace they have, so I can't just put the big
models on the fast(er) servers.

Imatrix generation is done on my home/work/gaming computer, which received an upgrade to 96GB DDR5 RAM, and
originally had an RTX 4070 (now, again, upgraded to a 4090 due to a generous investment of the company I work for).

I have good download speeds, but bad upload speeds at home, so it's lucky that model downloads are big and imatrix
uploads are small.

## How do you create imatrix files for really big models?

Through a combination of these ingenuous tricks:

1. I am not above using a low quant (e.g. Q4_K_S, IQ3_XS or even Q2_K), reducing the size of the model.
2. An nvme drive is "only" 25-50 times slower than RAM. I lock the first 80GB of the model in RAM, and
   then stream the remaining data from disk for every iteration.
3. Patience.

The few evaluations I have suggests that this gives good quality, and my current set-up allows me to
generate imatrix data for most models in fp16, 70B in Q8_0 and almost everything else in Q4_K_S.

The trick to 3 is not actually having patience, the trick is to automate things to the point where you
don't have to wait for things normally. For example, if all goes well, quantizing a model requires just
a single command (or less) for static quants, and for imatrix quants I need to select the source gguf
and then run another command which handles download/computation/upload. Most of the time, I only have
to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use,
is unfortunately very frequent).

## Why don't you use gguf-split?

TL;DR: I don't have the hardware/resources for that.

Long answer: gguf-split requires a full copy for every quant.
Unlike what many people think, my hardware is rather outdated and not very fast. The extra processing that gguf-split requires
either runs out of space on my systems with fast disk, or takes a very long time and a lot of I/O bandwidth on the slower
disks, all of which already run at their limits. Supporting gguf-split would mean

While this is the blocking reason, I also find it less than ideal that yet another incompatible file format was created that
requires special tools to manage, instead of supporting the tens of thousands of existing quants, of which the vast majority
could just be mmapped together into memory from split files. That doesn't keep me from supporting it, but it would have
been nice to look at the existing reality and/or consult the community before throwing yet another hard to support format out
there without thinking.

There are some developments to make this less of a pain, and I will revisit this issue from time to time to see if it has
become feasible.

Update 2024-07: llama.cpp probably has most of the features needed to make this reality, but I haven't found time to test and implement it yet.

Update 2024-09: just looked at implementing it, and no, the problems that keep me from doing it are still there :(. Must have fantasized it!!?

## So who is mradermacher?

Nobody has asked this, but since there are people who really deserve mention, I'll put this here. "mradermacher" is just a
pseudonymous throwaway account I created to goof around, but then started to quant models. A few months later, @nicoboss joined
and contributed hardware, power and general support - practically all imatrix computatuions are done on his computer(s).
Then @Guilherme34 started to help getting access to models, and @RichardErkhov first gave us the wondrous
FATLLAMA-1.7T, followed by access to his server to quant more models, likely to atone for his sins.

So you should consider "mradermacher" to be the team name for a fictional character called Michael Radermacher.
There are no connections ot anything else on the internet, other than an mradermacher_hf account on reddit.