Always returns ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I have tried to use models in this repo, but ggml files in this repo always returns '^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^'. There are f16 and q4_1 versions, and both of them return same wrong responses.
Is there anyone who can run this weights well? My environment is macOS Ventura 13.4, Python 3.10.12 and most recent development version of koboldcpp (branch concedo_experimental, commit hash b9f74db89e1417be171363244aaa6848706266c7).
Thanks.
% python koboldcpp.py --noblas ../models/KoboldAI_GPT-NeoX-20B-ggml/GPT-Neox-20B-Erebus-f16.bin
Welcome to KoboldCpp - Version 1.30
Warning: OpenBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp.so
==========
Loading model: /Volumes/cuttingedge/large_lang_models/models/KoboldAI_GPT-NeoX-20B-ggml/GPT-Neox-20B-Erebus-f16.bin
[Threads: 4, BlasThreads: 4, SmartContext: False]
---
Identified as GPT-NEO-X model: (ver 401)
Attempting to Load...
---
System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
gpt_neox_v2_model_load: loading model from '/Volumes/cuttingedge/large_lang_models/models/KoboldAI_GPT-NeoX-20B-ggml/GPT-Neox-20B-Erebus-f16.bin' - please wait ...
gpt_neox_v2_model_load: n_vocab = 50432
gpt_neox_v2_model_load: n_ctx = 2048
gpt_neox_v2_model_load: n_embd = 6144
gpt_neox_v2_model_load: n_head = 64
gpt_neox_v2_model_load: n_layer = 44
gpt_neox_v2_model_load: n_rot = 24
gpt_neox_v2_model_load: par_res = 1
gpt_neox_v2_model_load: ftype = 1
gpt_neox_v2_model_load: qntvr = 0
gpt_neox_v2_model_load: ggml ctx size = 49770.77 MB
gpt_neox_v2_model_load: memory_size = 2112.00 MB, n_mem = 90112
gpt_neox_v2_model_load: .................................................................. done
gpt_neox_v2_model_load: model size = 39211.45 MB / num tensors = 532
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001
Input: {"n": 1, "max_context_length": 1024, "max_length": 80, "rep_pen": 1.08, "temperature": 0.7, "top_p": 0.92, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 256, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 2, 3, 4, 5], "prompt": "[Character: Emily; species: Human; age: 24; gender: female; physical appearance: cute, attractive; personality: cheerful, upbeat, friendly; likes: chatting; description: Emily has been your childhood friend for many years. She is outgoing, adventurous, and enjoys many interesting hobbies. She has had a secret crush on you for a long time.]\n[The following is a chat message log between Emily and you.]\n\nEmily: Heyo! You there? I think my internet is kinda slow today.\nYou: Hello Emily. Good to hear from you :)\n\n\nYou: The sun rises from west.\nEmily:", "quiet": true, "stop_sequence": ["You:"]}
Processing Prompt [BLAS] (136 / 136 tokens)
Generating (80 / 80 tokens)
Time Taken - Processing:11.7s (86ms/T), Generation:32.7s (409ms/T), Total:44.5s
Output: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
127.0.0.1 - - [10/Jun/2023 23:06:24] "POST /api/v1/generate/ HTTP/1.1" 200 -
% python koboldcpp.py --noblas ../models/KoboldAI_GPT-NeoX-20B-ggml/GPT-NeoX-20B-Erebus-Q4_1.bin
Welcome to KoboldCpp - Version 1.30
Warning: OpenBLAS library file not found. Non-BLAS library will be used.
Initializing dynamic library: koboldcpp.so
==========
Loading model: /Volumes/cuttingedge/large_lang_models/models/KoboldAI_GPT-NeoX-20B-ggml/GPT-NeoX-20B-Erebus-Q4_1.bin
[Threads: 4, BlasThreads: 4, SmartContext: False]
---
Identified as GPT-NEO-X model: (ver 401)
Attempting to Load...
---
System Info: AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
gpt_neox_v2_model_load: loading model from '/Volumes/cuttingedge/large_lang_models/models/KoboldAI_GPT-NeoX-20B-ggml/GPT-NeoX-20B-Erebus-Q4_1.bin' - please wait ...
gpt_neox_v2_model_load: n_vocab = 50432
gpt_neox_v2_model_load: n_ctx = 2048
gpt_neox_v2_model_load: n_embd = 6144
gpt_neox_v2_model_load: n_head = 64
gpt_neox_v2_model_load: n_layer = 44
gpt_neox_v2_model_load: n_rot = 24
gpt_neox_v2_model_load: par_res = 1
gpt_neox_v2_model_load: ftype = 3
gpt_neox_v2_model_load: qntvr = 0
gpt_neox_v2_model_load: ggml ctx size = 25272.02 MB
gpt_neox_v2_model_load: memory_size = 2112.00 MB, n_mem = 90112
gpt_neox_v2_model_load: .................................................................. done
gpt_neox_v2_model_load: model size = 14712.70 MB / num tensors = 532
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001
Input: {"n": 1, "max_context_length": 1024, "max_length": 80, "rep_pen": 1.08, "temperature": 0.7, "top_p": 0.92, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 256, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 2, 3, 4, 5], "prompt": "[Character: Emily; species: Human; age: 24; gender: female; physical appearance: cute, attractive; personality: cheerful, upbeat, friendly; likes: chatting; description: Emily has been your childhood friend for many years. She is outgoing, adventurous, and enjoys many interesting hobbies. She has had a secret crush on you for a long time.]\n[The following is a chat message log between Emily and you.]\n\nEmily: Heyo! You there? I think my internet is kinda slow today.\nYou: Hello Emily. Good to hear from you :)\n\n\nYou: What's up today?\nEmily:", "quiet": true, "stop_sequence": ["You:"]}
Processing Prompt [BLAS] (135 / 135 tokens)
Generating (80 / 80 tokens)
Time Taken - Processing:9.7s (72ms/T), Generation:18.7s (234ms/T), Total:28.4s
Output: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
127.0.0.1 - - [10/Jun/2023 23:02:23] "POST /api/v1/generate/ HTTP/1.1" 200 -
How were you able to run this model? I'm trying to run it on KoboldCPP, v1.35, both with --noblas and --useclblas. I'm only getting the following:
Loading model: /home/verbosepanda/koboldcpp/models/GPT-NeoX-20B-Erebus-Q4_1.bin
[Threads: 5, BlasThreads: 5, SmartContext: False]
Identified as GPT-NEO-X model: (ver 401)
Attempting to Load...
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
gpt_neox_v2_model_load: loading model from '/home/verbosepanda/koboldcpp/models/GPT-NeoX-20B-Erebus-Q4_1.bin' - please wait ...
gpt_neox_v2_model_load: n_vocab = 50432
gpt_neox_v2_model_load: n_ctx = 2048
gpt_neox_v2_model_load: n_embd = 6144
gpt_neox_v2_model_load: n_head = 64
gpt_neox_v2_model_load: n_layer = 44
gpt_neox_v2_model_load: n_rot = 24
gpt_neox_v2_model_load: par_res = 1
gpt_neox_v2_model_load: ftype = 3
gpt_neox_v2_model_load: qntvr = 0
gpt_neox_v2_model_load: ggml ctx size = 25272.02 MB
GGML_V2_ASSERT: otherarch/ggml_v2.c:3959: ctx->mem_buffer != NULL
Aborted (core dumped)
Just tested the model and had no issues with the 1.35.H3 release you can find here : https://github.com/henk717/koboldcpp/releases/download/1.35/koboldcpp.exe
Because the main developer is out of town these week the H releases are my own bugfixed uploads until he is back and can upload an official 1.35.1.
Just tested the model and had no issues with the 1.35.H3 release you can find here : https://github.com/henk717/koboldcpp/releases/download/1.35/koboldcpp.exe
Because the main developer is out of town these week the H releases are my own bugfixed uploads until he is back and can upload an official 1.35.1.
I tested your version of KoboldCPP and can confirm that it works.
@Henk717 I visited your forked repo and saw what you had changed, and I applied it to development version of koboldcpp(branch: concedo_experimental). It did not solve my problem.
But I have never guessed koboldcpp may have a problem. Thanks @Henk717 for good idea.
Now I found that when I made sched_yield()
active by uncommenting, llama.cpp, the base of koboldcpp, was failed, even It cannot load the model. It looks like that llama.cpp has a problem.
That change is not related to your issue, it was a big performance regression from upstream. The fact the model works so poorly for you leads me to think you have a corrupt download. Verify the hash.
@Henk717 I compared the hash values and they were exactly same. I think it may be a platform specific problem especially on Apple Silicon.
Could be, none of us have apple hardware so if regressions happen we can't test it ourselves. If you can find out which commit on Koboldcpp makes it work again let us know.
What I have tested today have most recent commits.
- koboldcpp
- branch: concedo_experimental, commit hash: 5941514e95809472aca70c3a5c5fab580ff56df3
- branch: concedo, commit hash: 5941514e95809472aca70c3a5c5fab580ff56df3
- llama.cpp
- branch: master, commit hash: 6e7cca404748dd4b1a3affd0d1296e37f4ac0a6f