KoboldAI/GPT-NeoX-20B-Erebus-GGML · Always returns ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Jun 10, 2023

•

edited Jun 11, 2023

I have tried to use models in this repo, but ggml files in this repo always returns '^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^'. There are f16 and q4_1 versions, and both of them return same wrong responses.

Is there anyone who can run this weights well? My environment is macOS Ventura 13.4, Python 3.10.12 and most recent development version of koboldcpp (branch concedo_experimental, commit hash b9f74db89e1417be171363244aaa6848706266c7).

Thanks.

% python koboldcpp.py --noblas ../models/KoboldAI_GPT-NeoX-20B-ggml/GPT-Neox-20B-Erebus-f16.bin
Welcome to KoboldCpp - Version 1.30
Warning: OpenBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp.so
==========
Loading model: /Volumes/cuttingedge/large_lang_models/models/KoboldAI_GPT-NeoX-20B-ggml/GPT-Neox-20B-Erebus-f16.bin 
[Threads: 4, BlasThreads: 4, SmartContext: False]

---
Identified as GPT-NEO-X model: (ver 401)
Attempting to Load...
---
System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
gpt_neox_v2_model_load: loading model from '/Volumes/cuttingedge/large_lang_models/models/KoboldAI_GPT-NeoX-20B-ggml/GPT-Neox-20B-Erebus-f16.bin' - please wait ...
gpt_neox_v2_model_load: n_vocab = 50432
gpt_neox_v2_model_load: n_ctx   = 2048
gpt_neox_v2_model_load: n_embd  = 6144
gpt_neox_v2_model_load: n_head  = 64
gpt_neox_v2_model_load: n_layer = 44
gpt_neox_v2_model_load: n_rot   = 24
gpt_neox_v2_model_load: par_res = 1
gpt_neox_v2_model_load: ftype   = 1
gpt_neox_v2_model_load: qntvr   = 0
gpt_neox_v2_model_load: ggml ctx size = 49770.77 MB
gpt_neox_v2_model_load: memory_size =  2112.00 MB, n_mem = 90112
gpt_neox_v2_model_load: .................................................................. done
gpt_neox_v2_model_load: model size = 39211.45 MB / num tensors = 532
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 1024, "max_length": 80, "rep_pen": 1.08, "temperature": 0.7, "top_p": 0.92, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 256, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 2, 3, 4, 5], "prompt": "[Character: Emily; species: Human; age: 24; gender: female; physical appearance: cute, attractive; personality: cheerful, upbeat, friendly; likes: chatting; description: Emily has been your childhood friend for many years. She is outgoing, adventurous, and enjoys many interesting hobbies. She has had a secret crush on you for a long time.]\n[The following is a chat message log between Emily and you.]\n\nEmily: Heyo! You there? I think my internet is kinda slow today.\nYou: Hello Emily. Good to hear from you :)\n\n\nYou: The sun rises from west.\nEmily:", "quiet": true, "stop_sequence": ["You:"]}

Processing Prompt [BLAS] (136 / 136 tokens)
Generating (80 / 80 tokens)
Time Taken - Processing:11.7s (86ms/T), Generation:32.7s (409ms/T), Total:44.5s
Output: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
127.0.0.1 - - [10/Jun/2023 23:06:24] "POST /api/v1/generate/ HTTP/1.1" 200 -

% python koboldcpp.py --noblas ../models/KoboldAI_GPT-NeoX-20B-ggml/GPT-NeoX-20B-Erebus-Q4_1.bin  
Welcome to KoboldCpp - Version 1.30
Warning: OpenBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp.so
==========
Loading model: /Volumes/cuttingedge/large_lang_models/models/KoboldAI_GPT-NeoX-20B-ggml/GPT-NeoX-20B-Erebus-Q4_1.bin 
[Threads: 4, BlasThreads: 4, SmartContext: False]

---
Identified as GPT-NEO-X model: (ver 401)
Attempting to Load...
---
System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | 
gpt_neox_v2_model_load: loading model from '/Volumes/cuttingedge/large_lang_models/models/KoboldAI_GPT-NeoX-20B-ggml/GPT-NeoX-20B-Erebus-Q4_1.bin' - please wait ...
gpt_neox_v2_model_load: n_vocab = 50432
gpt_neox_v2_model_load: n_ctx   = 2048
gpt_neox_v2_model_load: n_embd  = 6144
gpt_neox_v2_model_load: n_head  = 64
gpt_neox_v2_model_load: n_layer = 44
gpt_neox_v2_model_load: n_rot   = 24
gpt_neox_v2_model_load: par_res = 1
gpt_neox_v2_model_load: ftype   = 3
gpt_neox_v2_model_load: qntvr   = 0
gpt_neox_v2_model_load: ggml ctx size = 25272.02 MB
gpt_neox_v2_model_load: memory_size =  2112.00 MB, n_mem = 90112
gpt_neox_v2_model_load: .................................................................. done
gpt_neox_v2_model_load: model size = 14712.70 MB / num tensors = 532
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 1024, "max_length": 80, "rep_pen": 1.08, "temperature": 0.7, "top_p": 0.92, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 256, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 2, 3, 4, 5], "prompt": "[Character: Emily; species: Human; age: 24; gender: female; physical appearance: cute, attractive; personality: cheerful, upbeat, friendly; likes: chatting; description: Emily has been your childhood friend for many years. She is outgoing, adventurous, and enjoys many interesting hobbies. She has had a secret crush on you for a long time.]\n[The following is a chat message log between Emily and you.]\n\nEmily: Heyo! You there? I think my internet is kinda slow today.\nYou: Hello Emily. Good to hear from you :)\n\n\nYou: What's up today?\nEmily:", "quiet": true, "stop_sequence": ["You:"]}

Processing Prompt [BLAS] (135 / 135 tokens)
Generating (80 / 80 tokens)
Time Taken - Processing:9.7s (72ms/T), Generation:18.7s (234ms/T), Total:28.4s
Output: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
127.0.0.1 - - [10/Jun/2023 23:02:23] "POST /api/v1/generate/ HTTP/1.1" 200 -

ppgabe

Jul 15, 2023

How were you able to run this model? I'm trying to run it on KoboldCPP, v1.35, both with --noblas and --useclblas. I'm only getting the following:

Loading model: /home/verbosepanda/koboldcpp/models/GPT-NeoX-20B-Erebus-Q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]

Identified as GPT-NEO-X model: (ver 401)
Attempting to Load...

System Info: AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
gpt_neox_v2_model_load: loading model from '/home/verbosepanda/koboldcpp/models/GPT-NeoX-20B-Erebus-Q4_1.bin' - please wait ...
gpt_neox_v2_model_load: n_vocab = 50432
gpt_neox_v2_model_load: n_ctx = 2048
gpt_neox_v2_model_load: n_embd = 6144
gpt_neox_v2_model_load: n_head = 64
gpt_neox_v2_model_load: n_layer = 44
gpt_neox_v2_model_load: n_rot = 24
gpt_neox_v2_model_load: par_res = 1
gpt_neox_v2_model_load: ftype = 3
gpt_neox_v2_model_load: qntvr = 0
gpt_neox_v2_model_load: ggml ctx size = 25272.02 MB
GGML_V2_ASSERT: otherarch/ggml_v2.c:3959: ctx->mem_buffer != NULL
Aborted (core dumped)

Henk717

KoboldAI org Jul 15, 2023

Just tested the model and had no issues with the 1.35.H3 release you can find here : https://github.com/henk717/koboldcpp/releases/download/1.35/koboldcpp.exe
Because the main developer is out of town these week the H releases are my own bugfixed uploads until he is back and can upload an official 1.35.1.

ppgabe

Jul 16, 2023

Just tested the model and had no issues with the 1.35.H3 release you can find here : https://github.com/henk717/koboldcpp/releases/download/1.35/koboldcpp.exe
Because the main developer is out of town these week the H releases are my own bugfixed uploads until he is back and can upload an official 1.35.1.

I tested your version of KoboldCPP and can confirm that it works.

teneriffa

Jul 16, 2023

@Henk717 I visited your forked repo and saw what you had changed, and I applied it to development version of koboldcpp(branch: concedo_experimental). It did not solve my problem.

But I have never guessed koboldcpp may have a problem. Thanks @Henk717 for good idea.

Now I found that when I made sched_yield() active by uncommenting, llama.cpp, the base of koboldcpp, was failed, even It cannot load the model. It looks like that llama.cpp has a problem.

Henk717

KoboldAI org Jul 16, 2023

That change is not related to your issue, it was a big performance regression from upstream. The fact the model works so poorly for you leads me to think you have a corrupt download. Verify the hash.

teneriffa

Jul 16, 2023

•

edited Jul 16, 2023

@Henk717 I compared the hash values and they were exactly same. I think it may be a platform specific problem especially on Apple Silicon.

Henk717

KoboldAI org Jul 16, 2023

Could be, none of us have apple hardware so if regressions happen we can't test it ourselves. If you can find out which commit on Koboldcpp makes it work again let us know.

teneriffa

Jul 16, 2023

What I have tested today have most recent commits.

koboldcpp
- branch: concedo_experimental, commit hash: 5941514e95809472aca70c3a5c5fab580ff56df3
- branch: concedo, commit hash: 5941514e95809472aca70c3a5c5fab580ff56df3
llama.cpp
- branch: master, commit hash: 6e7cca404748dd4b1a3affd0d1296e37f4ac0a6f

Always returns ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Identified as GPT-NEO-X model: (ver 401)Attempting to Load...

Identified as GPT-NEO-X model: (ver 401)
Attempting to Load...