|
--- |
|
language: |
|
- en |
|
--- |
|
|
|
# To request a quant, open an new discussion in the Community tab (if possible with the full url somewhere in the title or body) |
|
|
|
**You can search models, compare and download quants at https://hf.tst.eu/** |
|
|
|
**You can see the current quant status at https://hf.tst.eu/status.html** |
|
|
|
# huggingface has severely limited my uploads, so requests can be delayed. |
|
|
|
HF seems uninterested in doing something about it, so this is likely going to stay. Some |
|
workarounds have been implemented, and they mostly work, but sometimes qwuants get delayed |
|
because no uploading is possible. |
|
|
|
--- |
|
|
|
# Mini-FAQ |
|
|
|
## I miss model XXX |
|
|
|
I am not the only one to make quants. For example, **Lewdiculous** makes high-quality imatrix quants of many |
|
small models *and has a great presentation*. I either don't bother with imatrix quants for small models (< 30B), or avoid them |
|
because I saw others already did them, avoiding double work. |
|
|
|
Some other notable people which do quants are **Nexesenex**, **bartowski**, **RichardErkhov**, **dranger003** and **Artefact2**. |
|
I'm not saying anything about the quality of their quants, because I probably forgot some really good folks in this list, |
|
and I wouldn't even know, anyways. |
|
Model creators also often provide their own quants. I sometimes skip models because of that, |
|
even if the creator might provide far fewer quants than me. |
|
|
|
As always, feel free to request a quant, even if somebody else already did one, or request an imatrix version |
|
for models where I didn't provide them. |
|
|
|
## My community discussion is missing |
|
|
|
Most likely you brought up problems with the model and I decided I either have to re-do or simply drop the quants. |
|
In the past, I renamed the mode (so you can see my reply), but the huggingface rename function is borked and leaves the files |
|
available under their old name, keeping me from regenerating them (because my scripts can see them already existing). |
|
The only fix seems to be to delete the repo, which unfortunately also deletes the community discussion. |
|
|
|
## I miss quant type XXX |
|
|
|
The quant types I currently do regularly are: |
|
|
|
- static: (f16) Q8_0 Q4_K_S Q2_K Q6_K Q3_K_M Q3_K_S Q3_K_L Q4_K_M Q5_K_S Q5_K_M IQ4_XS (Q4_0_4) |
|
- imatrix: Q2_K Q4_K_S IQ3_XXS Q3_K_M (IQ4_NL) Q4_K_M IQ2_M Q6_K IQ4_XS Q3_K_S Q3_K_L Q5_K_S Q5_K_M Q4_0 IQ3_XS IQ3_S IQ3_M IQ2_XXS IQ2_XS IQ2_S IQ1_M IQ1_S (Q4_0_4_4 Q4_0_4_8 Q4_0_8_8) |
|
|
|
And they are generally (but not always) generated in the order above, for which there are deep reasons. |
|
|
|
For models less than 11B size, I experimentally generate f16 versions at the moment (in the static repository). |
|
|
|
For models less than 19B size, imatrix IQ4_NL quants will be generated, mostly for the benefit of arm, |
|
where it can give a speed benefit. |
|
|
|
The (static) IQ3 quants are no longer generated, as they consistently seem to result in *much* lower quality |
|
quants than even static Q2_K, so it would be s disservice to offer them. *Update*: That might no longer be true, and they might come back. |
|
|
|
I specifically do not do Q2_K_S, because I generally think it is not worth it (IQ2_M usually being smaller and better, albeit slower), |
|
and IQ4_NL, because it requires a lot of computing and is generally completely superseded by IQ4_XS. |
|
|
|
Q8_0 imatrix quants do not exist - some quanters claim otherwise, but Q8_0 ggufs do not contain any tensor |
|
type that uses the imatrix data, although technically it might be possible to do so. |
|
|
|
Older models that pre-date introduction of new quant types generally will have them retrofitted on request. |
|
|
|
You can always try to change my mind about all this, but be prepared to bring convincing data. |
|
|
|
## What does the "-i1" mean in "-i1-GGUF"? |
|
|
|
"mradermacher imatrix type 1" |
|
|
|
Originally, I had the idea of using an iterational method of imatrix generation, and wanted to see how well it |
|
fares. That is, create an imatrix from a bad quant (e.g. static Q2_K), then use the new model to generate a |
|
possibly better imatrix. It never happened, but I think sticking to something, even if slightly wrong, is better |
|
changing it. If I make considerable changes to how I create imatrix data I will probably bump it to `-i2` and so on. |
|
|
|
since there is some subjectivity/choice in imatrix training data, this also distinguishes it from |
|
quants by other people who made different choices. |
|
|
|
## What is the imatrix training data you use, can I have a copy? |
|
|
|
My training data consists of about 160k tokens, about half of which is semi-random tokens (sentence fragments) |
|
taken from stories, the other half is kalomaze's groups_merged.txt and a few other things. I have a half and a quarter |
|
set for too big or too stubborn models. |
|
|
|
Neither my set nor kalomaze's data contain large amounts of non-english training data, which is why I tend to |
|
not generate imatrix quants for models primarily meant for non-english usage. This is a trade-off, emphasizing |
|
english over other languages. But from (sparse) testing data it looks as if this doesn't actually make a big |
|
difference. More data are always welcome. |
|
|
|
Unfortunately, I do not have the rights to publish the testing data, but I might be able to replicate an |
|
equivalent set in the future and publish that. |
|
|
|
## Why are you doing this? |
|
|
|
Because at some point, I found that some new interesting models weren't available as GGUF anymore - my go-to |
|
source, TheBloke, had vanished. So I quantized a few models for myself. At the time, it was trivial - no imatrix, |
|
only a few quant types, all them very fast to generate. |
|
|
|
I then looked into huggingface more closely than just as a download source, and decided uploading would be a |
|
good thing, so others don't have to redo the work on their own. I'm used to sharing most of the things I make |
|
(mostly in free software), so it felt naturally to contribute, even at a minor scale. |
|
|
|
Then the number of quant types and their computational complexity exploded, as well as imatrix calculations became a thing. |
|
This increased the time required to make such quants by an order of magnitude. And also the management overhead. |
|
|
|
Since I was slowly improving my tooling I grew into it at the same pace as these innovations came out. I probably |
|
would not have started doing this a month later, as I would have been daunted by the complexity and work required. |
|
|
|
## You have amazing hardware!?!?! |
|
|
|
I regularly see people write that, but I probably have worse hardware than them to create my quants. I currently |
|
have access to eight servers that have good upload speed. Five of them are xeon quad cores class from ~2013, three are |
|
Ryzen 5 hexacores. The faster the server, the smaller the diskspace they have, so I can't just put the big |
|
models on the fast(er) servers. |
|
|
|
Imatrix generation is done on my home/work/gaming computer, which received an upgrade to 96GB DDR5 RAM, and |
|
originally had an RTX 4070 (now, again, upgraded to a 4090 due to a generous investment of the company I work for). |
|
|
|
I have good download speeds, but bad upload speeds at home, so it's lucky that model downloads are big and imatrix |
|
uploads are small. |
|
|
|
## How do you create imatrix files for really big models? |
|
|
|
Through a combination of these ingenuous tricks: |
|
|
|
1. I am not above using a low quant (e.g. Q4_K_S, IQ3_XS or even Q2_K), reducing the size of the model. |
|
2. An nvme drive is "only" 25-50 times slower than RAM. I lock the first 80GB of the model in RAM, and |
|
then stream the remaining data from disk for every iteration. |
|
3. Patience. |
|
|
|
The few evaluations I have suggests that this gives good quality, and my current set-up allows me to |
|
generate imatrix data for most models in fp16, 70B in Q8_0 and almost everything else in Q4_K_S. |
|
|
|
The trick to 3 is not actually having patience, the trick is to automate things to the point where you |
|
don't have to wait for things normally. For example, if all goes well, quantizing a model requires just |
|
a single command (or less) for static quants, and for imatrix quants I need to select the source gguf |
|
and then run another command which handles download/computation/upload. Most of the time, I only have |
|
to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use, |
|
is unfortunately very frequent). |
|
|
|
## Why don't you use gguf-split? |
|
|
|
TL;DR: I don't have the hardware/resources for that. |
|
|
|
Long answer: gguf-split requires a full copy for every quant. |
|
Unlike what many people think, my hardware is rather outdated and not very fast. The extra processing that gguf-split requires |
|
either runs out of space on my systems with fast disk, or takes a very long time and a lot of I/O bandwidth on the slower |
|
disks, all of which already run at their limits. Supporting gguf-split would mean |
|
|
|
While this is the blocking reason, I also find it less than ideal that yet another incompatible file format was created that |
|
requires special tools to manage, instead of supporting the tens of thousands of existing quants, of which the vast majority |
|
could just be mmapped together into memory from split files. That doesn't keep me from supporting it, but it would have |
|
been nice to look at the existing reality and/or consult the community before throwing yet another hard to support format out |
|
there without thinking. |
|
|
|
There are some developments to make this less of a pain, and I will revisit this issue from time to time to see if it has |
|
become feasible. |
|
|
|
Update 2024-07: llama.cpp probably has most of the features needed to make this reality, but I haven't found time to test and implement it yet. |
|
|
|
Update 2024-09: just looked at implementing it, and no, the problems that keep me from doing it are still there :(. Must have fantasized it!!? |
|
|
|
## So who is mradermacher? |
|
|
|
Nobody has asked this, but since there are people who really deserve mention, I'll put this here. "mradermacher" is just a |
|
pseudonymous throwaway account I created to goof around, but then started to quant models. A few months later, @nicoboss joined |
|
and contributed hardware, power and general support - practically all imatrix computatuions are done on his computer(s). |
|
Then @Guilherme34 started to help getting access to models, and @RichardErkhov first gave us the wondrous |
|
FATLLAMA-1.7T, followed by access to his server to quant more models, likely to atone for his sins. |
|
|
|
So you should consider "mradermacher" to be the team name for a fictional character called Michael Radermacher. |
|
There are no connections ot anything else on the internet, other than an mradermacher_hf account on reddit. |
|
|