dahara1/gemma-2-27b-it-gguf-japanese-imatrix

更新履歴 update history

2024/09/24

8月8日にgemma-2-9b-itのToeknizerが更新されていたので作り直し(連続するタブの処理などわずかな変更が発生しているようです)
CPUでBF16化処理を実施(特定の場面で微妙に性能が向上しているかもしれません)
iMatrixファイルに日本語データを更に追加(imatrix-jpn-testで検証を実施)

過去の更新履歴

2024/07/20 llama.cppに不具合[llama : fix pre-tokenization of non-special added tokens #8228](https://github.com/ggerganov/llama.cpp/pull/8228)が見つかり、Gemma2モデルは再変換が必要になり対応しました。HTMLタグの処理などが不正確になっていたとの事です。 A bug was found in llama.cpp [llama: fix pre-tokenization of non-special added tokens #8228](https://github.com/ggerganov/llama.cpp/pull/8228), and the Gemma2 model needed to be reconverted. The problem was that HTML tags were not being processed correctly.

単純に再変換するのは面白みがなかったので4bit以上の版は更に精度向上するという説もあるoutput tensorとembeddingをf16にするタイプの変換をしてみました。
Simply reconverting it was not interesting, so I tried converting the output tensor and embedding to f16, which is said to have even greater accuracy in versions of 4 bits or more.
念の為、4bit版は従来の変換とf16タイプの変換の両方をアップロードしてあります。
Just to be on the safe side, I have uploaded both the 4-bit conventional conversion and the f16 conversion.

本ggufモデルについて about this gguf model

gemma-2-27b-itを日本語が多く含まれる重要度行列(iMatrix)を使って量子化したgguf版です。日本語対応能力が多めに保持されている事を期待しています。 This is a quantized gguf version of gemma-2-27b-it using an importance matrix (iMatrix) that contains many Japanese words. I hope it retains more Japanese support.

gemma-2-27b-it-Q4_K_M.ggufは最近のCPU(Ryzen 9 7940HS Processor)であれば3トークン/秒程度の速度で実行する事が確認できています。
It has been confirmed that gemma-2-27b-it-Q4_K_M.gguf runs at about 3 tokens/second on a recent CPU (Ryzen 9 7940HS Processor).

使い方(How to use)

ブラウザインタフェース (browser)

Windows11のターミナル(CMD, Power shell)では日本語が化けてしまうのでブラウザを使ってください
Please use a browser as Japanese characters will be garbled in the Windows 11 terminal (CMD, Power shell).

公式マニュアルに従ってllama.cppをビルドします
Build llama.cpp according to the official manual

ダウンロードしたモデルを指定して下記コマンドを実行します

llama.cpp\build\bin\Release\llama-server -m .\gemma-2-27b-it-Q4_K_M.gguf

ブラウザでhttp://127.0.0.1:8080を開きます
Open http://127.0.0.1:8080 in your browser

コマンドライン (Command Line)

llama-cli -m gemma-2-27b-it-Q4_K_M.gguf -e --temp 0 --repeat-penalty 1.0 -n -2 -p "<start_of_turn>user\nWrite a hello world program<end_of_turn>\n<start_of_turn>model"

その他の疑問など Other questions etc.

Q4_K_Mをwiki.test.raw(英語)を使って計測したperplexityスコアが他の同等GGUF量子化モデルに比べて優れている事は確認済ですが理由はまだわかりません。
I have already confirmed that the perplexity score of Q4_K_M measured using wiki.test.raw is superior to other equivalent GGUF quantization models, but I don't know why yet.

解明されていない疑問はあります
There are unanswered questions.

llama.cppの不具合対応がほぼ完了した後に作成したからperplexityが低くなったのか？
(Was the perplexity low because it was created after the llama.cpp defects were almost completed?)
iMatrixは量子化強度が高いモデルでなければ効果があまりないという説もあるが多言語の観点からもそれは正しいのか？
(Some say that iMatrix is not very effective unless the model has high quantization strength, but is that true from a multilingual point of view)
wiki.test.raw(英語)でperplexityを計測することにどこまで意味があるのか？
(How far does it make sense to measure perplexity with wiki.test.raw (English)?)

その他の版

同じ手法で作ったgemma-2-9b-itも存在します
There is also gemma-2-9b-it made using the same technique.