ymcki commited on
Commit
efe49ab
·
verified ·
1 Parent(s): c755b90

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -16
README.md CHANGED
@@ -31,18 +31,18 @@ Original model: https://huggingface.co/google/gemma-2-2b-jpn-it
31
  ELIZA-Tasks-100 is pretty standard benchmark for Japanese LLMs.
32
  The perfect score is 5.00. As a reference, bartowski's gemma-2-27b-it.Q6_K.gguf scores 4.04.
33
 
34
- | Filename | Quant type | File Size | Split | ELIZA-Tasks-100 | Description |
35
  | -------- | ---------- | --------- | ----- | --------------- | ----------- |
36
- | [gemma-2-2b-jpn-it.f16.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.f16.gguf) | f16 | 5.24GB | false | 2.90 | Full F16 weights. |
37
- | [gemma-2-2b-jpn-it.Q8_0.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q8_0.gguf) | Q8_0 | 2.78GB | false | 3.06 | Extremely high quality, *recommended*. |
38
- | [gemma-2-2b-jpn-it-imatrix.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0.gguf) | Q4_0 | 1.63GB | false | 2.89 | Good quality, *recommended for edge device <8GB RAM*. |
39
- | [gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf) | Q4_0_8_8 | 1.63GB | false | 2.78 | Good quality, *recommended for edge device <8GB RAM*. |
40
- | [gemma-2-2b-jpn-it-imatrix.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_4_8.gguf) | Q4_0_4_8 | 1.63GB | false | TBD | Good quality, *recommended for edge device <8GB RAM*. |
41
- | [gemma-2-2b-jpn-it-imatrix.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_4_4.gguf) | Q4_0_4_4 | 1.63GB | false | TBD | Good quality, *recommended for edge device <8GB RAM*. |
42
- | [gemma-2-2b-jpn-it.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0.gguf) | Q4_0 | 1.63GB | false | 2.77 | Good quality but imatrix version a bit better. |
43
- | [gemma-2-2b-jpn-it.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0_8_8.gguf) | Q4_0_8_8 | 1.63GB | false | TBD | Poor quality, *not recommended*. |
44
- | [gemma-2-2b-jpn-it.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0_4_8.gguf) | Q4_0_4_8 | 1.63GB | false | TBD | Poor quality, *not recommended*. |
45
- | [gemma-2-2b-jpn-it.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0_4_4.gguf) | Q4_0_4_4 | 1.63GB | false | TBD | Poor quality, *not recommended*. |
46
 
47
  ## How to check i8mm and sve support for ARM devices
48
 
@@ -54,7 +54,6 @@ For ARM devices without both, it is recommended to use Q4_0_4_4.
54
 
55
  With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response.
56
 
57
-
58
  This is a [list](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) of ARM devices that support different ARM instructions. Apparently, it is only a partial list. It is better you check for i8mm and sve support by yourself.
59
 
60
  For Apple devices,
@@ -70,7 +69,9 @@ cat /proc/cpuinfo
70
 
71
  There are also android apps that can display /proc/cpuinfo.
72
 
73
- On the other hand, inference speed for the imatrix ggufs on my Nvidia 3090 is 137t/s for Q4_0, 2.8t/s for Q4_0_8_8. That means for Nvidia, you better off using Q4_0.
 
 
74
 
75
  ## Which Q4_0 model to use for ARM devices
76
  | Brand | Series | Model | i8mm | sve | Quant Type |
@@ -84,12 +85,14 @@ On the other hand, inference speed for the imatrix ggufs on my Nvidia 3090 is 13
84
  | Samsung | Exynos | 2200,2400 | Yes | Yes | Q4_0_8_8 |
85
  | Mediatek | Dimensity | 9000 | Yes | Yes | Q4_0_8_8 |
86
  | Mediatek | Dimensity | 9300 | Yes | No | Q4_0_4_8 |
87
- | Qualcomm Snapdragon | 8 Gen 1 | Yes | Yes | Q4_0_8_8 |
88
- | Qualcomm Snapdragon | 8 Gen 2,8 Gen 3,X Elite | Yes | No | Q4_0_4_8 |
89
 
90
  ## imatrix quantization
91
 
92
- According to this [blog](https://sc-bakushu.hatenablog.com/entry/2024/04/20/050213), adding imatrix to low bit quant can significantly improve performance. The best dataset for Japanese is [MTFMC/imatrix-dataset-for-japanese-llm](https://huggingface.co/datasets/TFMC/imatrix-dataset-for-japanese-llm). Therefore, I also created the imatrix versions of different Q4_0 quants. Indeed, they significantly outperforms the non-imatrix counterparts.
 
 
93
 
94
  ## Convert safetensors to f16 gguf
95
 
 
31
  ELIZA-Tasks-100 is pretty standard benchmark for Japanese LLMs.
32
  The perfect score is 5.00. As a reference, bartowski's gemma-2-27b-it.Q6_K.gguf scores 4.04.
33
 
34
+ | Filename | Quant type | File Size | Split | ELIZA-Tasks-100 | Nvidia 3090 | Description |
35
  | -------- | ---------- | --------- | ----- | --------------- | ----------- |
36
+ | [gemma-2-2b-jpn-it.f16.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.f16.gguf) | f16 | 5.24GB | false | 2.90 | 98t/s | Full F16 weights. |
37
+ | [gemma-2-2b-jpn-it.Q8_0.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q8_0.gguf) | Q8_0 | 2.78GB | false | 3.06 | 140t/s | Extremely high quality, *recommended*. |
38
+ | [gemma-2-2b-jpn-it-imatrix.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0.gguf) | Q4_0 | 1.63GB | false | 2.89 | 137t/s | Good quality, *recommended for edge device <8GB RAM*. |
39
+ | [gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf) | Q4_0_8_8 | 1.63GB | false | 2.78 | 2.79t/s | Good quality, *recommended for edge device <8GB RAM*. |
40
+ | [gemma-2-2b-jpn-it-imatrix.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_4_8.gguf) | Q4_0_4_8 | 1.63GB | false | 2.77 | 2.61t/s | Good quality, *recommended for edge device <8GB RAM*. |
41
+ | [gemma-2-2b-jpn-it-imatrix.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_4_4.gguf) | Q4_0_4_4 | 1.63GB | false | 2.65 | 3.09t/s | Good quality, *recommended for edge device <8GB RAM*. |
42
+ | [gemma-2-2b-jpn-it.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0.gguf) | Q4_0 | 1.63GB | false | 2.77 | 159t/s | Good quality, *recommended for edge device <8GB RAM* |
43
+ | [gemma-2-2b-jpn-it.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0_8_8.gguf) | Q4_0_8_8 | 1.63GB | false | 2.92 | 2.85t/s | Good quality, *recommended for edge device <8GB RAM* |
44
+ | [gemma-2-2b-jpn-it.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0_4_8.gguf) | Q4_0_4_8 | 1.63GB | false | 2.74 | 2.56t/s | Good quality, *recommended for edge device <8GB RAM* |
45
+ | [gemma-2-2b-jpn-it.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0_4_4.gguf) | Q4_0_4_4 | 1.63GB | false | 2.70 | 3.10t/s | Poor quality, *not recommended*. |
46
 
47
  ## How to check i8mm and sve support for ARM devices
48
 
 
54
 
55
  With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response.
56
 
 
57
  This is a [list](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) of ARM devices that support different ARM instructions. Apparently, it is only a partial list. It is better you check for i8mm and sve support by yourself.
58
 
59
  For Apple devices,
 
69
 
70
  There are also android apps that can display /proc/cpuinfo.
71
 
72
+ I was told that for Intel/AMD CPU inference, support for AVX2/AVX512 can also improve the performance of Q4_0_8_8.
73
+
74
+ On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0.
75
 
76
  ## Which Q4_0 model to use for ARM devices
77
  | Brand | Series | Model | i8mm | sve | Quant Type |
 
85
  | Samsung | Exynos | 2200,2400 | Yes | Yes | Q4_0_8_8 |
86
  | Mediatek | Dimensity | 9000 | Yes | Yes | Q4_0_8_8 |
87
  | Mediatek | Dimensity | 9300 | Yes | No | Q4_0_4_8 |
88
+ | Qualcomm | Snapdragon | 8 Gen 1 | Yes | Yes | Q4_0_8_8 |
89
+ | Qualcomm | Snapdragon | 8 Gen 2,8 Gen 3,X Elite | Yes | No | Q4_0_4_8 |
90
 
91
  ## imatrix quantization
92
 
93
+ According to this [blog](https://sc-bakushu.hatenablog.com/entry/2024/04/20/050213), adding imatrix to low bit quant can significantly improve performance. The best dataset for Japanese is [MTFMC/imatrix-dataset-for-japanese-llm](https://huggingface.co/datasets/TFMC/imatrix-dataset-for-japanese-llm). Therefore, I also created the imatrix versions of different Q4_0 quants.
94
+
95
+ However, based on my benchmarking results, the difference is not significant.
96
 
97
  ## Convert safetensors to f16 gguf
98