ymcki
/

Llama-3_1-Nemotron-51B-Instruct-GGUF

@@ -27,9 +27,9 @@ Original model: https://huggingface.co/nvidia/Llama-3_1-Nemotron-51B-Instruct-GG
 ### Assistant:
 ```
-***Important*** for people who wants to do their own quantitization. There is a typo in tokenizer_config.json of the original model that mistakenly set eos_token to '<|eot_id|>' when it should be '<|end_of_text|>'. Please fix it or overwrite with the [tokenizer_config.json](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/tokenizer_config.json) in this repository before you do the gguf conversion yourself.
-Starting from [b4380](https://github.com/ggerganov/llama.cpp/archive/refs/tags/b4380.tar.gz) of llama.cpp, DeciLMForCausalLM's variable Grouped Query Attention is now supported.. Please download it and compile it to run the GGUFs in this repository.
 This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.
@@ -47,33 +47,12 @@ Perplexity for f16 gguf is 6.646565 ± 0.040986.
 | IQ4_NL | calibration_datav3 | 29.30GB | 0.088279 ± 0.003944 | 0.020314 ± 0.000093 | For 32GB cards, e.g. 5090. Minor performance gain doesn't justify its use over IQ4_XS |
 | [IQ4_XS](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ4_XS.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 27.74GB | 0.095486 ± 0.004039 | 0.020962 ± 0.000097 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. Recommended. |
 | Q4_0 | calibration_datav3 | 29.34GB | 0.543042 ± 0.009290 | 0.077602 ± 0.000389 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. |
-| [Q4_0_4_8](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.Q4_0_4_8.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 29.25GB | Same as Q4_0 assumed | Same as Q4_0 assumed | For Apple Silicon |
 | [IQ3_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 23.49GB | 0.313812 ± 0.006299 | 0.054266 ± 0.000205 | Largest model that can fit a single 3090 at 5k context. Not recommeneded for CPU or Apple Silicon due to high computational cost. |
-| [IQ3_S](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_S.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 22.65GB | 0.434774 ± 0.007162 | 0.069264 ± 0.000242 | Largest model that can fit a single 3090 at 8k context. Not recommended for CPU or Apple Silicon due to high computational cost. |
-| [IQ3_XXS](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_XXS.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 20.19GB | 0.638630 ± 0.009693 | 0.092827 ± 0.000367 | Largest model that can fit a single 3090 at 15k context. Not recommended for CPU or Apple Silicon due to high computational cost. |
 | Q3_K_S | calibration_datav3 | 22.65GB | 0.698971 ± 0.010387 | 0.089605 ± 0.000443 | Largest model that can fit a single 3090 that performs well in all platforms |
 | Q3_K_S | none | 22.65GB | 2.224537 ± 0.024868 | 0.283028 ± 0.001220 | Largest model that can fit a single 3090 without imatrix |
-## How to check i8mm support for Apple devices
-ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm. That means Apple Silicon from A15 and M2 works best with Q4_0_4_8.
-For Apple devices,
-```
-sysctl hw
-```
-On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0.
-## Which Q4_0 model to use for Apple devices
-| Brand | Series | Model | i8mm | sve | Quant Type |
-| ----- | ------ | ----- | ---- | --- | -----------|
-| Apple | A | A4 to A14 | No | No | Q4_0_4_4 |
-| Apple | A | A15 to A18 | Yes | No | Q4_0_4_8 |
-| Apple | M | M1 | No | No | Q4_0_4_4 |
-| Apple | M | M2/M3/M4 | Yes | No | Q4_0_4_8 |
 ## Convert safetensors to f16 gguf
 Make sure you have llama.cpp git cloned:

 ### Assistant:
 ```
+***Important*** for people who wants to do their own quantitization. The convert_hf_to_gguf.py in b4380 of llama.cpp doesn't read rope_theta parameter such that it can't generate gguf that can work with prompts longer than 4k tokens. There is currently a [PR](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/convert_hf_to_gguf.py) in llama.cpp to update convert_hf_to_gguf.py. If you can't wait for the PR to get thru, you can download a working convert_hf_to_gguf.py from [here](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/convert_hf_to_gguf.py) in this repository before you do the gguf conversion yourself.
+Starting from [b4380](https://github.com/ggerganov/llama.cpp/archive/refs/tags/b4380.tar.gz) of llama.cpp, DeciLMForCausalLM's variable Grouped Query Attention is now supported. Please download it and compile it to run the GGUFs in this repository.
 This modification should support Llama-3_1-Nemotron 51B-Instruct fully. However, it may not support future DeciLMForCausalLM models that has no_op or linear ffn layers. Well, I suppose these support can be added when there are actually models using that types of layers.
 | IQ4_NL | calibration_datav3 | 29.30GB | 0.088279 ± 0.003944 | 0.020314 ± 0.000093 | For 32GB cards, e.g. 5090. Minor performance gain doesn't justify its use over IQ4_XS |
 | [IQ4_XS](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ4_XS.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 27.74GB | 0.095486 ± 0.004039 | 0.020962 ± 0.000097 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. Recommended. |
 | Q4_0 | calibration_datav3 | 29.34GB | 0.543042 ± 0.009290 | 0.077602 ± 0.000389 | For 32GB cards, e.g. 5090. Too slow for CPU and Apple. |
 | [IQ3_M](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_M.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 23.49GB | 0.313812 ± 0.006299 | 0.054266 ± 0.000205 | Largest model that can fit a single 3090 at 5k context. Not recommeneded for CPU or Apple Silicon due to high computational cost. |
+| [IQ3_S](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_S.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 22.65GB | 0.434774 ± 0.007162 | 0.069264 ± 0.000242 | Largest model that can fit a single 3090 at 7k context. Not recommended for CPU or Apple Silicon due to high computational cost. |
+| [IQ3_XXS](https://huggingface.co/ymcki/Llama-3_1-Nemotron-51B-Instruct-GGUF/blob/main/Llama-3_1-Nemotron-51B-Instruct.imatrix.IQ3_XXS.gguf) | [calibration_datav3](https://gist.githubusercontent.com/bartowski1182/eb213dccb3571f863da82e99418f81e8/raw/b2869d80f5c16fd7082594248e80144677736635/calibration_datav3.txt) | 20.19GB | 0.638630 ± 0.009693 | 0.092827 ± 0.000367 | Largest model that can fit a single 3090 at 13k context. Not recommended for CPU or Apple Silicon due to high computational cost. |
 | Q3_K_S | calibration_datav3 | 22.65GB | 0.698971 ± 0.010387 | 0.089605 ± 0.000443 | Largest model that can fit a single 3090 that performs well in all platforms |
 | Q3_K_S | none | 22.65GB | 2.224537 ± 0.024868 | 0.283028 ± 0.001220 | Largest model that can fit a single 3090 without imatrix |
 ## Convert safetensors to f16 gguf
 Make sure you have llama.cpp git cloned: