|
--- |
|
base_model: google/gemma-2-2b-jpn-it |
|
language: |
|
- multilingual |
|
datasets: |
|
- TFMC/imatrix-dataset-for-japanese-llm |
|
library_name: transformers |
|
license: gemma |
|
license_link: https://ai.google.dev/gemma/terms |
|
pipeline_tag: text-generation |
|
tags: |
|
- nlp |
|
- code |
|
quantized_by: ymcki |
|
widget: |
|
- messages: |
|
- role: user |
|
content: Can you provide ways to eat combinations of bananas and dragonfruits? |
|
--- |
|
|
|
Original model: https://huggingface.co/google/gemma-2-2b-jpn-it |
|
|
|
## Prompt format |
|
|
|
``` |
|
<start_of_turn>user |
|
{prompt}<end_of_turn> |
|
<start_of_turn>model |
|
<end_of_turn> |
|
<start_of_turn>model |
|
|
|
``` |
|
|
|
Note that this model does not support a System prompt. |
|
|
|
## Download a file (not the whole branch) from below: |
|
|
|
ELIZA-Tasks-100 is pretty standard benchmark for Japanese LLMs. |
|
The perfect score is 5.00. As a reference, bartowski's gemma-2-27b-it.Q6_K.gguf scores 4.04. |
|
|
|
| Filename | Quant type | File Size | ELIZA-Tasks-100 | Nvidia 3090 | Description | |
|
| -------- | ---------- | --------- | --------------- | ----------- | ----------- | |
|
| [gemma-2-2b-jpn-it.f16.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.f16.gguf) | f16 | 5.24GB | 2.90 | 98t/s | Full F16 weights. | |
|
| [gemma-2-2b-jpn-it.Q8_0.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q8_0.gguf) | Q8_0 | 2.78GB | 3.06 | 140t/s | Extremely high quality, *recommended*. | |
|
| [gemma-2-2b-jpn-it-imatrix.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0.gguf) | Q4_0 | 1.63GB | 2.89 | 137t/s | Good quality, *recommended for edge devices <8GB RAM*. | |
|
| [gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf) | Q4_0_8_8 | 1.63GB | 2.78 | 2.79t/s | Good quality, *recommended for edge devices <8GB RAM*. | |
|
| [gemma-2-2b-jpn-it-imatrix.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_4_8.gguf) | Q4_0_4_8 | 1.63GB | 2.77 | 2.61t/s | Good quality, *recommended for edge devices <8GB RAM*. | |
|
| [gemma-2-2b-jpn-it-imatrix.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it-imatrix.Q4_0_4_4.gguf) | Q4_0_4_4 | 1.63GB | 2.65 | 3.09t/s | Good quality, *recommended for edge devices <8GB RAM*. | |
|
| [gemma-2-2b-jpn-it.Q4_0.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0.gguf) | Q4_0 | 1.63GB | 2.77 | 159t/s | Good quality, *recommended for edge devices <8GB RAM* | |
|
| [gemma-2-2b-jpn-it.Q4_0_8_8.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0_8_8.gguf) | Q4_0_8_8 | 1.63GB | 2.92 | 2.85t/s | Good quality, *recommended for edge devices <8GB RAM* | |
|
| [gemma-2-2b-jpn-it.Q4_0_4_8.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0_4_8.gguf) | Q4_0_4_8 | 1.63GB | 2.74 | 2.56t/s | Good quality, *recommended for edge devices <8GB RAM* | |
|
| [gemma-2-2b-jpn-it.Q4_0_4_4.gguf](https://huggingface.co/ymcki/gemma-2-2b-jpn-it-GGUF/blob/main/gemma-2-2b-jpn-it.Q4_0_4_4.gguf) | Q4_0_4_4 | 1.63GB | 2.70 | 3.10t/s | Good quality, *recommended for edge devices <8GB RAM*. | |
|
|
|
## How to check i8mm and sve support for ARM devices |
|
|
|
ARM i8mm support is necessary to take advantage of Q4_0_4_8 gguf. All ARM architecture >= ARMv8.6-A supports i8mm. |
|
|
|
ARM sve support is necessary to take advantage of Q4_0_8_8 gguf. sve is an optional feature that starts from ARMv8.2-A but majority of ARM chips doesn't implement it. |
|
|
|
For ARM devices without both, it is recommended to use Q4_0_4_4. |
|
|
|
With these support, the inference speed should be faster in the order of Q4_0_8_8 > Q4_0_4_8 > Q4_0_4_4 > Q4_0 without much effect on the quality of response. |
|
|
|
This is a [list](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) of ARM CPUs that support different ARM instructions. Another [list](https://raw.githubusercontent.com/ThomasKaiser/sbc-bench/refs/heads/master/sbc-bench.sh). Apparently, they only covers limited number of ARM CPUs. It is better you check for i8mm and sve support by yourself. |
|
|
|
For Apple devices, |
|
|
|
``` |
|
sysctl hw |
|
``` |
|
|
|
For other ARM devices (ie most Android devices), |
|
``` |
|
cat /proc/cpuinfo |
|
``` |
|
|
|
There are also android apps that can display /proc/cpuinfo. |
|
|
|
I was told that for Intel/AMD CPU inference, support for AVX2/AVX512 can also improve the performance of Q4_0_8_8. |
|
|
|
On the other hand, Nvidia 3090 inference speed is significantly faster for Q4_0 than the other ggufs. That means for GPU inference, you better off using Q4_0. |
|
|
|
## Which Q4_0 model to use for ARM devices |
|
| Brand | Series | Model | i8mm | sve | Quant Type | |
|
| ----- | ------ | ----- | ---- | --- | -----------| |
|
| Apple | A | A4 to A14 | No | No | Q4_0_4_4 | |
|
| Apple | A | A15 to A18 | Yes | No | Q4_0_4_8 | |
|
| Apple | M | M1 | No | No | Q4_0_4_4 | |
|
| Apple | M | M2/M3/M4 | Yes | No | Q4_0_4_8 | |
|
| Google | Tensor | G1,G2 | No | No | Q4_0_4_4 | |
|
| Google | Tensor | G3,G4 | Yes | Yes | Q4_0_8_8 | |
|
| Samsung | Exynos | 2200,2400 | Yes | Yes | Q4_0_8_8 | |
|
| Mediatek | Dimensity | 9000,9000+ | Yes | Yes | Q4_0_8_8 | |
|
| Mediatek | Dimensity | 9300 | Yes | No | Q4_0_4_8 | |
|
| Qualcomm | Snapdragon | 7+ Gen 2,8/8+ Gen 1 | Yes | Yes | Q4_0_8_8 | |
|
| Qualcomm | Snapdragon | 8 Gen 2,8 Gen 3,X Elite | Yes | No | Q4_0_4_8 | |
|
|
|
## imatrix quantization |
|
|
|
According to this [blog](https://sc-bakushu.hatenablog.com/entry/2024/04/20/050213), adding imatrix to low bit quant can significantly improve performance. The best dataset for Japanese is [MTFMC/imatrix-dataset-for-japanese-llm](https://huggingface.co/datasets/TFMC/imatrix-dataset-for-japanese-llm). Therefore, I also created the imatrix versions of different Q4_0 quants. |
|
|
|
However, based on my benchmarking results, the difference is not significant. |
|
|
|
## Convert safetensors to f16 gguf |
|
|
|
Make sure you have llama.cpp git cloned: |
|
|
|
``` |
|
python3 convert_hf_to_gguf.py gemma-2-2b-jpn-it/ --outfile gemma-2-2b-jpn-it.f16.gguf --outtype f16 |
|
``` |
|
|
|
## Convert f16 gguf to Q8_0 gguf without imatrix |
|
Make sure you have llama.cpp compiled: |
|
``` |
|
./llama-quantize gemma-2-2b-jpn-it.f16.gguf gemma-2-2b-jpn-it.Q8_0.gguf q8_0 |
|
``` |
|
|
|
## Convert f16 gguf to other ggufs with imatrix |
|
|
|
First, prepare imatrix from f16 gguf and c4_en_ja_imatrix.txt |
|
|
|
``` |
|
./llama-imatrix -m gemma-2-2b-jpn-it.f16.gguf -f c4_en_ja_imatrix.txt -o gemma-2-2b-jpn-it.imatrix --chunks 32 |
|
``` |
|
|
|
Then, convert f16 gguf with imatrix to create imatrix gguf |
|
|
|
``` |
|
./llama-quantize --imatrix gemma-2-2b-jpn-it.imatrix gemma-2-2b-jpn-it.f16.gguf gemma-2-2b-jpn-it-imatrix.Q4_0_8_8.gguf q4_0_8_8 |
|
``` |
|
|
|
## Downloading using huggingface-cli |
|
|
|
First, make sure you have hugginface-cli installed: |
|
|
|
``` |
|
pip install -U "huggingface_hub[cli]" |
|
``` |
|
|
|
Then, you can target the specific file you want: |
|
|
|
``` |
|
huggingface-cli download ymcki/gemma-2-2b-jpn-it-GGUF --include "gemma-2-2b-jpn-it-Q8_0.gguf" --local-dir ./ |
|
``` |
|
|
|
## Credits |
|
|
|
Thank you bartowski for providing a README.md to get me started. |
|
|
|
Thank you YoutechA320U for the ELYZA-tasks-100 auto evaluation tool. |
|
|