Convert to ggml format
Hello. I would like to transform this model to ggml format to be able to test it on a machine with lower resources. Is there any procedure or scripts to do it? Thank you.
I tried to convert it to ggml but had some errors due to a missing added_tokens.json
in the tokenizer since the original model uses GPT2's instead of GPT-J-6B's. I just pushed a couple of versions, in float16
and float32
, so the smaller model in float16
should be able to run in CPU using https://github.com/ggerganov/ggml/tree/master/examples/gpt-j. Just remember that the prompt needs to follow the Spanish version of the Alpaca prompt. For example, you could create a prompt.txt
file:
A continuación hay una instrucción que describe una tarea. Escribe una respuesta que complete adecuadamente lo que se pide.
### Instrucción:
$TASK
### Respuesta:
Download the model, compile and install ggml, and then invoke it:
$ TASK="Escribe un email poniendo una excusa para faltar a la reunión" envsubst < prompt.txt | ./bin/gpt-j -m /path/to/bertin-gpt-j-6B-alpaca/ggml-model-f16.bin
Hello. I'm testing the ggml-model-f16.bin, and I have this "promp.txt" file :
A continuación hay una instrucción que describe una tarea. Escribe una respuesta que complete adecuadamente lo que se pide.
### Instrucción:
"Cuentame un cuento."
### Respuesta:
But when I launch the model there are many "gpt_tokenize: unknown token" and the text generate is not good:
$ cat prompt.txt | ./gpt-j -m /bertin-gpt-j-6b-alpaca/ggml-bertin-gpt-j-6B-alpaca/ggml-model-f16.bin
main: seed = 1681503060
gptj_model_load: loading model from '/bertin-gpt-j-6b-alpaca/ggml-bertin-gpt-j-6B-alpaca/ggml-model-f16.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: f16 = 1
gptj_model_load: ggml ctx size = 12438.86 MB
gptj_model_load: memory_size = 896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size = 11542.79 MB / num tensors = 285
gpt_tokenize: unknown token '
'
gpt_tokenize: unknown token 'A'
gpt_tokenize: unknown token 'a'
gpt_tokenize: unknown token 'c'
gpt_tokenize: unknown token 'i'
gpt_tokenize: unknown token '▒'
...
...
gpt_tokenize: unknown token ':'
'pt_tokenize: unknown token '
main: number of tokens in prompt = 32
continu hayna instrcci que describena tar Escibena respesta que complete aduaamen que se pid### InstrcciCuta cue."###RespestaCuá▒ es el cle fist x si 4▒ - 6 = 18▒▒▒▒### knee▒estaEl cle fist x es 6▒<|endoftext|>
main: mem per token = 15488240 bytes
main: load time = 18915.16 ms
main: sample time = 98.78 ms
main: predict time = 17135.78 ms / 267.75 ms per token
main: total time = 43800.53 ms
I'm probably doing something wrong or I need to configure something...
Thank you.
Strange. Just run the same prompt and got this
$ TASK="Cuéntame un cuento" envsubst < es.txt | ./gpt-j -m /../bertin-gpt-j-6B-alpaca/ggml-model-f16.bin
main: seed = 1681808791
gptj_model_load: loading model from '/../bertin-gpt-j-6B-alpaca/ggml-model-f16.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: f16 = 1
gptj_model_load: ggml ctx size = 12438.86 MB
gptj_model_load: memory_size = 896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size = 11542.79 MB / num tensors = 285
main: number of tokens in prompt = 73
A continuación hay una instrucción que describe una tarea. Escribe una respuesta que complete adecuadamente lo que se pide.
### Instrucción:
Cuéntame un cuento
### Respuesta:
Había una vez un pequeño ratón de campo que vivía en un pequeño agujero en el tronco de un árbol alto. El ratón era muy astuto, y a menudo se le veía resolviendo problemas para los demás animales del bosque. Un día, el ratón escuchó hablar de un gran concurso en el que se ofrecía una gran recompensa al mejor cuentacuentos. El ratón decidió participar.
El ratón comenzó a practicar su cuento todos los días, pero ninguno de sus intentos parecía bueno. Un día, el ratón escuchó un ruido fuera del agujero en el tronco. Miró por el agujero y
main: mem per token = 15488240 bytes
main: load time = 24641.92 ms
main: sample time = 71.52 ms
main: predict time = 90674.88 ms / 333.36 ms per token
main: total time = 117546.53 ms
Maybe try downloaidng the model again? The commit hash of the gpt-j binary I'm building is 75824e76bd41818ff8902d244feec1ec6c1d2c86.
Hello.
The model that i have been downloaded has the same hash and size
$ certutil -hashfile ggml-model-f16.bin sha256
SHA256 hash de ggml-model-f16.bin:
8672f372dae6bb7660f070355adb44dbb35747746525192b89a24466aabebbb4
$ ls -l ggml-model-f16.bin
-rw-r--r-- 1 root root 12104027673 Apr 12 23:17 ggml-model-f16.bin
Thank you.
Just tried it again on a totally new VM from GCP using Ubuntu 20.04.01. It all went fine. I'm don't know how to reproduce. You get the same errors with the f32 model? Do you have an up to date version of ggml?
This are the steps I followed:
apt install cmake
git clone https://github.com/ggerganov/ggml
cd ggml/
mkdir build && cd build
cmake ..
make -j16 gpt-2 gpt-j
cd ..
# create prompt file es.txt
wget https://huggingface.co/bertin-project/bertin-gpt-j-6B-alpaca/resolve/main/ggml-model-f16.bin
TASK="Escribe un email poniendo una excusa para faltar a la reunión" envsubst < es.txt | ./build/bin/gpt-j -m ggml-model-f16.bin
Hello.
Yes, I'm having a problem with gpt-j.exe in my computer.
I'm using " https://github.com/LostRuins/koboldcpp " utility with your model, and it's working fine!!
Here is an example:
I'm trying to solve that problem with the generation of gpt-j.exe (no errors...but not working); I'll try in another PC.
Thank you very much!! :)
Awesome! I'm closing the issue then :)
Es posible una cuantización a 4bits?
@hwpoison89 yes, I believe it's possible now using ggml, but they had some issues in the past depending on how that quantization was performed.