File size: 10,205 Bytes
c2d910d ce2f8c2 a75b403 036694a 0b4165a efe703c dd9cf68 0d6906e dd9cf68 ce2f8c2 798ca52 c345c2c 798ca52 f0c542d eed5e11 c345c2c 798ca52 c5a950f 7dfaecc c5a950f 798ca52 ab951ef ecb01d7 798ca52 0e07d36 584b2c7 798ca52 111a4eb 584b2c7 111a4eb 584b2c7 798ca52 ab951ef 798ca52 dd9cf68 ce2f8c2 798ca52 d9b1dcd 798ca52 c345c2c 79c9695 b3c25df dd9cf68 ce2f8c2 798ca52 c345c2c 798ca52 b3c25df dd9cf68 ce2f8c2 798ca52 510a000 798ca52 b3c25df bb67c97 dd9cf68 ce2f8c2 b3c25df 88a6e8b cde23b4 9a2df4a b3691a9 f2f7ea0 b3691a9 9f073f2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
language:
- en
---
# Status 2024-12-17: huggingface has severely limited my uploads, so requests can be delayed.
HF seems uninterested in doing something about it, so this is likely going to stay.
# To request a quant, open an new discussion in the Community tab (if possible with the full url somewhere in the title or body)
You can see the current quant status at http://hf.tst.eu/status.html
---
# Mini-FAQ
## I miss model XXX
I am not the only one to make quants. For example, **Lewdiculous** makes high-quality imatrix quants of many
small models *and has a great presentation*. I either don't bother with imatrix quants for small models (< 30B), or avoid them
because I saw others already did them, avoiding double work.
Some other notable people which do quants are **Nexesenex**, **bartowski**, **RichardErkhov**, **dranger003** and **Artefact2**.
I'm not saying anything about the quality of their quants, because I probably forgot some really good folks in this list,
and I wouldn't even know, anyways.
Model creators also often provide their own quants. I sometimes skip models because of that,
even if the creator might provide far fewer quants than me.
As always, feel free to request a quant, even if somebody else already did one, or request an imatrix version
for models where I didn't provide them.
## My community discussion is missing
Most likely you brought up problems with the model and I decided I either have to re-do or simply drop the quants.
In the past, I renamed the mode (so you can see my reply), but the huggingface rename function is borked and leaves the files
available under their old name, keeping me from regenerating them (because my scripts can see them already existing).
The only fix seems to be to delete the repo, which unfortunately also deletes the community discussion.
## I miss quant type XXX
The quant types I currently do regularly are:
- static: (f16) Q8_0 Q4_K_S Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS (Q4_0_4)
- imatrix: Q2_K Q4_K_S IQ3_XXS Q3_K_M (IQ4_NL) Q4_K_M IQ2_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS IQ3_S IQ3_M IQ2_XXS IQ2_XS IQ2_S IQ1_M IQ1_S (Q4_0_4_4 Q4_0_4_8 Q4_0_8_8)
And they are generally (but not always) generated in the order above, for which there are deep reasons.
For models less than 11B size, I experimentally generate f16 versions at the moment (in the static repository).
For models less than 19B size, imatrix IQ4_NL quants will be generated, mostly for the benefit of arm,
where it can give a speed benefit.
The (static) IQ3 quants are no longer generated, as they consistently seem to result in *much* lower quality
quants than even static Q2_K, so it would be s disservice to offer them. *Update*: That might no longer be true, and they might come back.
I specifically do not do Q2_K_S, because I generally think it is not worth it (IQ2_M usually being smaller and better, albeit slower),
and IQ4_NL, because it requires a lot of computing and is generally completely superseded by IQ4_XS.
Q8_0 imatrix quants do not exist - some quanters claim otherwise, but Q8_0 ggufs do not contain any tensor
type that uses the imatrix data, although technically it might be possible to do so.
Older models that pre-date introduction of new quant types generally will have them retrofitted on request.
You can always try to change my mind about all this, but be prepared to bring convincing data.
## What does the "-i1" mean in "-i1-GGUF"?
"mradermacher imatrix type 1"
Originally, I had the idea of using an iterational method of imatrix generation, and wanted to see how well it
fares. That is, create an imatrix from a bad quant (e.g. static Q2_K), then use the new model to generate a
possibly better imatrix. It never happened, but I think sticking to something, even if slightly wrong, is better
changing it. If I make considerable changes to how I create imatrix data I will probably bump it to `-i2` and so on.
since there is some subjectivity/choice in imatrix training data, this also distinguishes it from
quants by other people who made different choices.
## What is the imatrix training data you use, can I have a copy?
My training data consists of about 160k tokens, about half of which is semi-random tokens (sentence fragments)
taken from stories, the other half is kalomaze's groups_merged.txt and a few other things. I have a half and a quarter
set for too big or too stubborn models.
Neither my set nor kalomaze's data contain large amounts of non-english training data, which is why I tend to
not generate imatrix quants for models primarily meant for non-english usage. This is a trade-off, emphasizing
english over other languages. But from (sparse) testing data it looks as if this doesn't actually make a big
difference. More data are always welcome.
Unfortunately, I do not have the rights to publish the testing data, but I might be able to replicate an
equivalent set in the future and publish that.
## Why are you doing this?
Because at some point, I found that some new interesting models weren't available as GGUF anymore - my go-to
source, TheBloke, had vanished. So I quantized a few models for myself. At the time, it was trivial - no imatrix,
only a few quant types, all them very fast to generate.
I then looked into huggingface more closely than just as a download source, and decided uploading would be a
good thing, so others don't have to redo the work on their own. I'm used to sharing most of the things I make
(mostly in free software), so it felt naturally to contribute, even at a minor scale.
Then the number of quant types and their computational complexity exploded, as well as imatrix calculations became a thing.
This increased the time required to make such quants by an order of magnitude. And also the management overhead.
Since I was slowly improving my tooling I grew into it at the same pace as these innovations came out. I probably
would not have started doing this a month later, as I would have been daunted by the complexity and work required.
## You have amazing hardware!?!?!
I regularly see people write that, but I probably have worse hardware than them to create my quants. I currently
have access to eight servers that have good upload speed. Five of them are xeon quad cores class from ~2013, three are
Ryzen 5 hexacores. The faster the server, the smaller the diskspace they have, so I can't just put the big
models on the fast(er) servers.
Imatrix generation is done on my home/work/gaming computer, which received an upgrade to 96GB DDR5 RAM, and
originally had an RTX 4070 (now, again, upgraded to a 4090 due to a generous investment of the company I work for).
I have good download speeds, but bad upload speeds at home, so it's lucky that model downloads are big and imatrix
uploads are small.
## How do you create imatrix files for really big models?
Through a combination of these ingenuous tricks:
1. I am not above using a low quant (e.g. Q4_K_S, IQ3_XS or even Q2_K), reducing the size of the model.
2. An nvme drive is "only" 25-50 times slower than RAM. I lock the first 80GB of the model in RAM, and
then stream the remaining data from disk for every iteration.
3. Patience.
The few evaluations I have suggests that this gives good quality, and my current set-up allows me to
generate imatrix data for most models in fp16, 70B in Q8_0 and almost everything else in Q4_K_S.
The trick to 3 is not actually having patience, the trick is to automate things to the point where you
don't have to wait for things normally. For example, if all goes well, quantizing a model requires just
a single command (or less) for static quants, and for imatrix quants I need to select the source gguf
and then run another command which handles download/computation/upload. Most of the time, I only have
to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use,
is unfortunately very frequent).
## Why don't you use gguf-split?
TL;DR: I don't have the hardware/resources for that.
Long answer: gguf-split requires a full copy for every quant.
Unlike what many people think, my hardware is rather outdated and not very fast. The extra processing that gguf-split requires
either runs out of space on my systems with fast disk, or takes a very long time and a lot of I/O bandwidth on the slower
disks, all of which already run at their limits. Supporting gguf-split would mean
While this is the blocking reason, I also find it less than ideal that yet another incompatible file format was created that
requires special tools to manage, instead of supporting the tens of thousands of existing quants, of which the vast majority
could just be mmapped together into memory from split files. That doesn't keep me from supporting it, but it would have
been nice to look at the existing reality and/or consult the community before throwing yet another hard to support format out
there without thinking.
There are some developments to make this less of a pain, and I will revisit this issue from time to time to see if it has
become feasible.
Update 2024-07: llama.cpp probably has most of the features needed to make this reality, but I haven't found time to test and implement it yet.
Update 2024-09: just looked at implementing it, and no, the problems that keep me from doing it are still there :(. Must have fantasized it!!?
## So who is mradermacher?
Nobody has asked this, but since there are people who really deserve mention, I'll put this here. "mradermacher" is just a
pseudonymous throwaway account I created to goof around, but then started to quant models. A few months later, @nicoboss joined
and contributed hardware, power and general support - practically all imatrix computatuions are done on his computer(s).
Then @Guilherme34 started to help getting access to models, and @RichardErkhov first gave us the wondrous
FATLLAMA-1.7T, followed by access to his server to quant more models, likely to atone for his sins.
So you should consider "mradermacher" to be the team name for a fictional character called Michael Radermacher.
There are no connections ot anything else on the internet, other than an mradermacher_hf account on reddit.
|