imatrix quants

#2
by oobabooga - opened

Is there a technical limitation to creating imatrix quants for this model? It would be nice to have a better 2-bit quant to benchmark.

@oobabooga The imatrix for this model is being computed right now and is scheduled to be completed in approximately 9 hours. You can check http://hf.tst.eu/status.html to follow its progress.

The size of the model makes computing the imatrix more challenging and time consuming than for smaller models. To compute the imatrix for such massive models we have to connect together multiple servers over RPC as my largest server only has 512 GiB of RAM.

@mradermacher Please prioritize DeepSeek-V3 over DeepSeek-V3-Base and let's postpone RPC imatrix computation of my Hermes-3-Llama-3.1-405B-Samantha and Hermes-3-Llama-3.1-405B-Uncensored models for a few days so we can compute imatrix quants for DeepSeek-V3 first as everyone is waiting for them.

That's nice to hear @nicoboss , I look forward to this quant. It will be the first imatrix quant for this model (I have been tracking this for several days and nobody managed to create one so far).

This comment has been hidden

The imatrix measurements completed successfully. We now started generating imatrix quants for DeepSeek-V3 and DeepSeek-V3-Base. The first imatrix quant that will be generated is i1-Q2_K. You can monitor the progress under http://hf.tst.eu/status.html.

@oobabooga The first imatrix quant for DeepSeek-V3 just got uploaded to https://huggingface.co/mradermacher/DeepSeek-V3-i1-GGUF. Thanks a lot for creating text-generation-webui. It's my favorite UI for LLMs and I love the design of v2.

I recommend downloading i1-Q2_K from https://hf.tst.eu/model#DeepSeek-V3-i1-GGUF so the parts get concatenated while downloading.

Alternatively download the following files and concatinate them using cat:
DeepSeek-V3.i1-Q2_K.gguf.part1of5
DeepSeek-V3.i1-Q2_K.gguf.part2of5
DeepSeek-V3.i1-Q2_K.gguf.part3of5
DeepSeek-V3.i1-Q2_K.gguf.part4of5
DeepSeek-V3.i1-Q2_K.gguf.part5of5

The remaining DeepSeek-V3 imatrix quants are currently being generated and will follow within the next hours/days. You can check http://hf.tst.eu/status.html to see which one comes next.

Edit: i1-IQ3_M is now avilable as well.

Thanks @nicoboss , I'm glad you like the project!

About the quants, are even smaller ones also planned? Specifically:

  • IQ1_S
  • IQ1_M
  • IQ2_XXS
  • IQ2_XS
  • IQ2_S
  • Q2_K_S

The Q2_K is still quite challenging to load at 244 GB.

Also, I wonder if doing multi-part ggufs would be possible in this case, to avoid the need for cat. Like model-00001-of-00005.gguf, model-00002-of-00005.gguf, etc.

Yes, all these are being done. No, we don't have the disk space for multi-part models. We even attempted to patch the gguf splitter so it wouldn't make full copies, but C++ made it practically impossible to do this in a reasonable way.

@oobabooga The following imatrix quants will be provided. Due to the enormous size of this model it will just take a while for them to all be generated and uploaded:

  • i1-IQ1_S
  • i1-IQ1_M
  • i1-IQ2_XXS
  • i1-IQ2_XS
  • i1-IQ2_S
  • i1-IQ2_M
  • i1-Q2_K_S
  • i1-Q2_K
  • i1-IQ3_XXS
  • i1-IQ3_XS
  • i1-IQ3_S
  • i1-Q3_K_S
  • i1-IQ3_M
  • i1-Q3_K_M
  • i1-Q3_K_L
  • i1-IQ4_XS
  • i1-Q4_K_S
  • i1-Q4_0
  • i1-Q4_K_M
  • i1-Q4_1
  • i1-Q5_K_S
  • i1-Q5_K_M
  • i1-Q6_K

The Q2_K is still quite challenging to load at 244 GB.

No worries smaller imatrix quants will come soon.
You could use llama.cpp RPC on all your devices to make it easier. If you want prompt processing acceleration using RPC just start the RPC server using GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 so you are not limited to physical GPU memory. That's exactly how we did the imatrix computation on the 713.4 GB large Q8_0 quants.

To find the best quant given your hardware limitation I recommend looking at the quality column at https://hf.tst.eu/model#DeepSeek-V3-i1-GGUF. While the quality score is not dependent on this specific model it is based on measurements of 616 quants from the Qwen2.5 series and so should give you a good idea.

I wonder if doing multi-part ggufs would be possible in this case, to avoid the need for cat. Like model-00001-of-00005.gguf, model-00002-of-00005.gguf, etc.
To avoid having to concatenate it just download the model from https://hf.tst.eu/model#DeepSeek-V3-i1-GGUF which uses JavaScript to download all the parts to the correct offset resulting in a concatenated file.

We don't like multi-part GGUFs. They require a tool and a full copy of all data to be split and merged. split-GGUFs can be created and merged without copying any data on a supported file system like BTRFS. Having non-merged files locally is a pain as you cannot easily soft link them to your models folder. It's also worth considering that HuggingFace acquired XetHub last summer and once they move away grom Git LFS the 50 GB file limit will be lifted.

No worries smaller imatrix quants will come soon.

And they might be great... but... MoE's and very low bpw quantisations tend to not go together so well (probably because it's really just lots of 37Bs).

We even attempted to patch the gguf splitter so it wouldn't make full copies, but C++ made it practically impossible to do this in a reasonable way.
They require a tool and a full copy of all data to be split and merged.

Thanks for the explanation. I wasn't aware of this limitation.

You could use llama.cpp RPC on all your devices to make it easier.

I'm not sure if that would be beneficial vs offloading layers to an SSD swap, as the wifi bandwidth would presumably be a bigger bottleneck than the SSD speed. Also, my pipeline interacts with llama.cpp through llama-cpp-python, and I'm not sure if RPC is implemented there. Still, that's good to know, I hadn't seen anyone using this feature before.

And they might be great... but... MoE's and very low bpw quantisations tend to not go together so well (probably because it's really just lots of 37Bs).

That's a good point, a dense 671B would be severely undertrained and indeed a better candidate for quantization.

I'm not sure if that would be beneficial vs offloading layers to an SSD swap, as the wifi bandwidth would presumably be a bigger bottleneck than the SSD speed.

The beauty of llama.cpp's RPC functionality is that while loading the model you are distributing all layers to RPC servers running on different devices. During inference each node uses its own CPU/GPU to do prompt processing/token generation using the layers it has in RAM/GPU memory. The only data that travels between the llama.cpp client and the RPC servers is the current state which is relatively small. During imatrix training we had around 200 Mbit/s peak bandwidth usage after initial load and that was with every node being GPU accelerated. The only thing that might take a while is the initial load but that only happens once after which you can enjoy a relatively fast AI model for as long you want.

Offloading layers to an SSD is far slower even if you use RPC without any GPUs as layers being stored in RAM makes a massive performance difference and the overhead added by RPC is low.

my pipeline interacts with llama.cpp through llama-cpp-python, and I'm not sure if RPC is implemented there. Still, that's good to know, I hadn't seen anyone using this feature before.

This is no issue at all. As documented on the llama-cpp-python README.md to make use of RPC just install llama-cpp-python using CMAKE_ARGS="-DGGML_RPC=on" pip install llama-cpp-python
To use it call llama.py using the rpc_servers argument which must contain a comma separated list of RPC servers to use for offloading and n_gpu_layers with the amount of layers you want to offload to RPC servers.
To host an RPC server just run the rpc-server provided by llama.cpp documented under https://github.com/ggerganov/llama.cpp/tree/master/examples/rpc

During imatrix training we had around 200 Mbit/s peak bandwidth usage

Actually peak bandwidth usage was >>1000Mbit/s. Imatrix training used more than 1GBit/s when transferring data (and 0 when computing). But that might still mean that inferencing needs less.

18 out of 24 imatrix quants including the 148.9 GB i1-IQ1_M and 174.4 GB i1-IQ2_XXS are already uploaded.

Imatrix training used more than 1GBit/s when transferring data (and 0 when computing). But that might still mean that inferencing needs less.

Here a screenshot of the Proxmox monitoring from the perspective of one of the RPC servers during DeepSeek-V3 imatrix computation. I excluded incoming traffic as it messed with the y-scale. Incoming bandwidth usage during initial load to download all the layers was 4 Gbit/s (as it just tries to download them as fast as possible) after which there was almost exclusively outgoing traffic. With 200 Mbit/s you have enough bandwidth to not bottleneck RPC especially considering that token generation on CPU will use much less than the prompt processing we did on GPU during imatrix training.

image.png

You are looking at average bandwidth there, though, not peak. Either that, or proxmox outright gives you wrong numbers.

It's just like quantizing deepseek - 500MBps is probably just enough average speed, but with 1.3GBps peak speed, quantizing is about twice as fast as with 500MBps peak speed. So unless RPC secretly interleaves network and computation (which the CPU graph does not lend itself to), 1GBit would likely be a severe limit to imatrix computation speed, even though average bandwidth requirements are far lower.

From the graph, this looks like you are looking at one minute averages.

I now measured it so we know for sure. To run inference on RPC offloaded layers of size 46939.48 MiB on a AMD Ryzen Threadripper 1950X CPU without GPU acceleration one requires a peak bandwidth of 8.28 MBit/s and an average bandwidth of around 0.5 Mbit/s incoming and a peak bandwidth of 3.93 MBit/s and an average bandwidth of around 0.25 Mbit/s outgoing according to nload.

A 10 Mbit/s network should be enough to run CPU inference of DeepSeek-V3 without bottlenecking llama.cpp in any way unless you use faster hardware. To be on the safe side I recommend using at least 25 Mbit/s WLAN supported by every WLAN standard except IEEE 802.11b. Keep in mind that on such slow speeds the initial load will take a really long time.

Wow, so computations really don't have to wait for any network transfers when inferencing? That is impressive, and completely different than imatrix calculations.

So I have benchmarked the IQ1_Mversion (1.75 bpw, 148.9 GB) using my personal benchmark:

  • The result turned out to be 23/48.
  • That's lower than a 2.06 GB quant, Phi-3-mini-4k-instruct-IQ4_XS, which scored 24/48.
  • This benchark took 18 hours to run. I didn't try the RPC option yet, and some 50 or 70 GB worth of layers were sent to swap. I used 20 GPU layers.

The full results can be found here.

For reference , the biggest models I had evaluated before that were Meta-Llama-3.1-405B-Instruct.i1-IQ2_XXS (109 GB) and falcon-180b-chat.Q4_K_M (108.6 GB). I also tried grok-1-IQ2_XS (93.31 GB).

According to lmsys arena, this model is about as good as Claude Sonnet, and I think it should be possible to get better accuracy at ~2 bpw with AQLM/QTIP/QUIP#. But then CPU layers wouldn't be possible unless eg QTIP was ported to llama.cpp, which I think is unlikely.

So the tldr is that this model is kind of hopeless to run locally.

So the tldr is that this model is kind of hopeless to run locally.

Most affordable way to run this locally at reasonably quality is likely to either have two PCs with 192 GB RAM each or 3 PCs with 128 GiB RAM each. Then add a cheap NVidia GPU with at least 8 GiB og GPU memory (with less than 8 GiB of GPU memory llama.cpp RPC crashes based on my testing) to each of them for GPU accelerated prompt processing and make use of llama.cpp RPC to distribute the i1-IQ4_XS model across them. Alternatively, one can could get an AMD Threadripper where 512 GiB of RAM is supported. 512 GiB of RAM allows you to run i1-Q5_K_M without the need of RPC.

According to lmsys arena, this model is about as good as Claude Sonnet, and I think it should be possible to get better accuracy at ~2 bpw with AQLM/QTIP/QUIP#. But then CPU layers wouldn't be possible unless eg QTIP was ported to llama.cpp, which I think is unlikely.

I tested the i1-Q5_K_S version earlier today and am really impressed by the model. It beats LLama 3.1 405B which was my favorite model so far. It is extremely capable and uncensored if you use a Dolphin system prompt. Thanks to it being MoE it runs relatively fast on CPU despite its size.

The result turned out to be 23/48.
That's lower than a 2.06 GB quant, Phi-3-mini-4k-instruct-IQ4_XS, which scored 24/48.

A poor result was expected due to it being a highly quantized MoE model but this worse than I thought.

This benchark took 18 hours to run.

Oh no that's so long. Streaming data from SSD is around 20 times slower than RAM if I remember correctly so it makes sense. If there is any benchmark you or anyone else wants to run on any quant just let me know and I can run it. I'm always willing to help with such projects.

Can I request imatrix quants for DeepSeek-R1 as well? :)

Can I request imatrix quants for DeepSeek-R1 as well? :)

Yes for sure. Nice to see another awesome 671B model.

@mradermacher I will prepare everything for DeepSeek-R1.
There also is DeepSeek-R1-Zero but that is worse and so of less priority so we can do it after DeepSeek-R1.

@mradermacher The DeepSeek-R1 model is now ready for static quants under /tmp/DeepSeek-R1.gguf. I had to put it on spool due to a lack of free SSD storage on other pools. It should be fine as there is still 1.6 TB storage left which is enough to simultaneously store the two largest static DeepSeek-R1 quants plus some other models. Maybe adjust the storage budget or reduce the number of parallel tasks if you are concerned about running out of storage.

By the way, today I have benchmarked your DeepSeek-V3.i1-Q4_K_S quant. It scored 38/48 on my test, paired at the top with Llama-3.1-70B (Q4_K_M) and Mistral-Large-Instruct-2407 (Q4_K_S).

I had to use a server with 512 GB RAM for this benchmark. It took 3 hours to run with with 1 layer in the GPU and the remaining ones (63 if I remember correctly) in the CPU.

Once the R1 quant is available I'll try that one as well.

@nicoboss 1.6TB is absolutely not much. It might work out, it might not. Also, the gguf file has another hardlink that I can't see, and will possibly take up unaccounted space? Otherwise, let's queue and see.

@oobabooga Being on par with llama-3.1 doesn't strike me as a breakthrough for 10x the size :)

Being on par with llama-3.1 doesn't strike me as a breakthrough for 10x the size :)

This may be a limitation of my benchmark more than anything.

I see you are already preparing DeepSeek-R1-Zero. Like DeepSeek-R1 the DeepSeek-R1-Zero model will require manual HF to GGUF conversion due to it needing to first be converted to BF16. I see you downloaded it to /tmp/quant/DeepSeek-R1-Zero. I will convert it to GGUF as soon I have freed up enough storage on other pools to do so.

1.6TB is absolutely not much. It might work out, it might not.

I know. It will be tight. Sorry this is really unlucky timing. I currently have a project running taking up 8 TB of SSD and there is still 4 TB of SSD locked up by the Qwen series performance measurement project for which I would need to pause nico1 to continue. In addition to that I basically copied all that data from the 18 TB hpool to SSD to in preparation of upgrading to the new 54 TB hpool but due the first store canceling my order and the second one exceeding the estimated delivery date I have not yet received the new HDDs locking up even more SSD storage.

Also, the gguf file has another hardlink that I can't see, and will possibly take up unaccounted space?

I hardlinked it to /root to try out the new model. Hardlinks on the same file system shouldn't take up any additional storage as far I'm aware but I deleted it just in case.

Being on par with llama-3.1 doesn't strike me as a breakthrough for 10x the size :)

It is quite remarkable considering the model only has 37B active parameters per token. If one runs a model at scale, I believe active parameters is the most important factor regarding cost of per token.

This may be a limitation of my benchmark more than anything.

I think so as well. 48 tests are too little to have good enough accuracy to be able to differentiate models of similar quality. During the eval project I used 3684 benchmark questions to evaluate each quant. Despite this much larger number of benchmarks I still had measurments inaccuracies of a few percentages. I really appreciate the measurements and the results seam to match my observation of DeepSeek-V3 matching or slightly exceeding all current openly available LLMs depending on the beanchmark used. I’m confident that with DeepSeek-R1 we will have a new leader. My first tests seam very promising.

I see you are already preparing DeepSeek-R1-Zero

Yeah, it was already downloading before I saw your messages. I did block it once I saw your mesage-

I know. It will be tight.

It will, but at least much less so than the last few weeks. The queue on nico has just cleared out a bit.

It is quite remarkable considering the model only has 37B active parameters per token.

It was somewhat in jest, but I think for (quality) performance one has to take the total number of parameters into account. Both for practical reasons (there are always constraints on resources) as well as the total amount of data available to extract knowledge from. If you could do that with an actual 37B that would be very impressive.

Can I request imatrix quants for DeepSeek-R1 as well? :)

@oobabooga The RPC imatrix computation of DeepSeek-R1 completed successfully. The first i1-Q2_K and i1-IQ3_M imatrix quants where already uploaded to https://huggingface.co/mradermacher/DeepSeek-R1-i1-GGUF/tree/main - the remaining imatrix quants will be generated and uploaded in the following hours/days while we are computing the imatrix of DeepSeek-R1-Zero. As always you can check the current status and progress on http://hf.tst.eu/status.html.

Sign up or log in to comment