TheBloke/CodeLlama-7B-Instruct-GGUF · Very slow response on LM Studio with these settings

Apr 8, 2024

•

edited Apr 8, 2024

Hello,

This is my first time to try running local models using LM Studio and I downloaded this model:

{
  "name": "codellama_codellama-7b-instruct-hf",
  "arch": "llama",
  "quant": "Q4_K_S",
  "context_length": 16384,
  "embedding_length": 4096,
  "num_layers": 32,
  "rope": {
    "freq_base": 1000000,
    "dimension_count": 128
  },
  "head_count": 32,
  "head_count_kv": 32,
  "parameters": "7B"
}

I have a Windows 11 laptop running Intel i7 10th gen with 24GB RAM and a dedicated Nvidia Geforce GTX 1650 Ti 4GB display memory. And I have these settings for the model in LM Studio:

n_gpu_layers (GPU offload): 4
use_mlock (Keep entire model in RAM) set to true
n_threads (CPU Threads): 6
n_batch (Prompt eval batch size): 512
n_ctx (Context Length): 2048

But it takes so long to return the first token and it's slow also in writing the answers.
Is there anything wrong with these settings? Is there any other setting I should modify? I tried to disable n_gpu_layers by setting it to 0 and also tried increasing the value but it didn't solve the problem.

I reverted to the default settings below and it takes about 40 seconds to start responsing:

{
  "name": "Config for Chat ID 1712524642526",
  "load_params": {
    "n_ctx": 2048,
    "n_batch": 512,
    "rope_freq_base": 0,
    "rope_freq_scale": 0,
    "n_gpu_layers": 10,
    "use_mlock": true,
    "main_gpu": 0,
    "tensor_split": [
      0
    ],
    "seed": -1,
    "f16_kv": true,
    "use_mmap": true,
    "no_kv_offload": false,
    "num_experts_used": 0
  },
  "inference_params": {
    "n_threads": 4,
    "n_predict": -1,
    "top_k": 40,
    "min_p": 0.05,
    "top_p": 0.95,
    "temp": 0.8,
    "repeat_penalty": 1.1,
    "input_prefix": "\n### Instruction:\n",
    "input_suffix": "\n### Response:\n",
    "antiprompt": [
      "### Instruction:"
    ],
    "pre_prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request.",
    "pre_prompt_suffix": "\n",
    "pre_prompt_prefix": "",
    "seed": -1,
    "tfs_z": 1,
    "typical_p": 1,
    "repeat_last_n": 64,
    "frequency_penalty": 0,
    "presence_penalty": 0,
    "n_keep": 0,
    "logit_bias": {},
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.1,
    "memory_f16": true,
    "multiline_input": false,
    "penalize_nl": true
  }
}

Assist, please.

BR

YaTharThShaRma999

Apr 8, 2024

@yassersharaf try offloading more layers, maybe 10? That might help(4 layers does not equal 4gb vram). Also set mlock to false.

yassersharaf

Apr 8, 2024

@YaTharThShaRma999 that helped it reduces response time to 15 seconds. how to find the best settings to decrease it to 2 or 3 seconds only if that's possible locally? Thanks a lot.

YaTharThShaRma999

Apr 8, 2024

@yassersharaf offload even more. Maybe 20? Try the max you can go before you get out of memory.