I've tried many times by loading model by hugging face transformer, but cannot load it or load with lora,

#1
by geroge - opened

Hi,trelis, I've paid the model ,could you please give me the code for loading the model by using vllm or huggingface transformers AutoModel? I've follow the instructions in your colab ,but it does not work in this model.

My workspace is using double 3090 card.

First try : by using colab code

import transformers
import torch
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TextStreamer

# !pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers.git
# !pip install -q -U git+https://github.com/huggingface/accelerate.git
# !pip install -q -U einops
# !pip install -q -U safetensors
# !pip install -q -U torch
# !pip install -q -U xformers

runtime = "gpu"  # OR "cpu"

if runtime == "cpu":
    runtimeFlag = "cpu"
elif runtime == "gpu":
    runtimeFlag = "cuda"
else:
    print("Invalid runtime. Please set it to either 'cpu' or 'gpu'.")
    runtimeFlag = None

cache_dir = "trelis/modelcache"
model_id = "trelis/deepseek-coder-6.7b-instruct-function-calling-v2"
model = None

if runtime == "gpu":
    # Load the model in 4-bit to allow it to fit in a free Google Colab runtime with a CPU and T4 GPU
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True, #adds speed with minimal loss of quality.
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map='auto', # for inference use 'auto', for training use device_map={"":0}
        # device_map=runtimeFlag,
        trust_remote_code=True,
        # rope_scaling = {"type": "dynamic", "factor": 2.0}, # allows for a max sequence length of 8192 tokens !!! [not tested in this notebook yet]
        cache_dir=cache_dir)
    # Not possible to use bits and bits if using cpu only, afaik
else:
    model = AutoModelForCausalLM.from_pretrained(model_id, device_map=runtimeFlag, trust_remote_code=True, cache_dir=cache_dir) # this can easily exhaust Colab RAM. Note that bfloat16 can't be used on cpu.

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, cache_dir=cache_dir, use_fast=True) # will use the Rust fast tokenizer if available
print(tokenizer.eos_token)

search_bing_metadata = {
    "function": "search_bing",
    "description": "Search the web for content on Bing. This allows users to search online/the internet/the web for content.",
    "arguments": [
        {
            "name": "query",
            "type": "string",
            "description": "The search query string"
        }
    ]
}

functionList = ''
functionList += json.dumps(search_bing_metadata, indent=4, separators=(',', ': '))
print(functionList)

# Define a stream *with* function calling capabilities
user_prompt = "Search bing for the tallest mountain in Ireland"

# Define the roles and markers
B_INST, E_INST = "[INST]", "[/INST]"
B_FUNC, E_FUNC = "<FUNCTIONS>", "</FUNCTIONS>\n\n"
# Format your prompt template
prompt = f"{B_FUNC}{functionList.strip()}{E_FUNC}{B_INST} {user_prompt.strip()} {E_INST}\n\n"
inputs = tokenizer([prompt], return_tensors="pt").to("cpu")
streamer = TextStreamer(tokenizer)
# Despite returning the usual output, the streamer will also print the generated text to stdout.
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=500)

# print("Runtime flag is:", runtimeFlag)

Second try:
By using recommend code in "deepseek-coder-6.7b-instruct-function-calling-adapters-v2"

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "trelis/deepseek-coder-6.7b-instruct-function-calling-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,device_map='cuda').cuda()
messages=[
    { 'role': 'user', 'content': "write a quick sort algorithm in python."}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
# 32021 is the id of <|EOT|> token
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, top_k=50, top_p=0.95, num_return_sequences=1, eos_token_id=32021)
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

It raise error in following three cases:
First one is raised by deepspeed , if i specific device_map to auto , then it'll cause deepspeed cannot load both in cpu and gpu.
Second error is caused by hugging face , said the cuda() function does not support bitsandbytes 4bit quantization.
Third one is when i set device_map to cuda:0 it'll cause oom error by nvidia.
Hopefully for you reply, thank you.

Yeah, I think the issue is that the weights are saved to hub in float32 (which is uncommon) so you're running oom.

The solution is to set

torch_dtype=torch.bfloat16 #if on an A100 or Ampere architecture GPU,
torch_dtype=torch.float16 #if on a T4

I think your 3090 is Ampere architecture, so you can add the first line when loading the model.

BTW, there should be no need to have deepspeed to run this (unless you're trying to train with deepspeed).

Also, you should set device_map = "cuda"? (or, swap in device_map=runtimeFlag in the above code, which will do the same).

Try to test things out once, and if they don't work, I'll make a v3 of this model. v2 is deprecated now making it hard to get working as well.

Yeah, I think the issue is that the weights are saved to hub in float32 (which is uncommon) so you're running oom.

The solution is to set

torch_dtype=torch.bfloat16 #if on an A100 or Ampere architecture GPU,
torch_dtype=torch.float16 #if on a T4

I think your 3090 is Ampere architecture, so you can add the first line when loading the model.

BTW, there should be no need to have deepspeed to run this (unless you're trying to train with deepspeed).

Also, you should set device_map = "cuda"? (or, swap in device_map=runtimeFlag in the above code, which will do the same).

I've made updates to the Colab notebook.

Thanks ! It works on my computer.
I also buy this adapter 'deepseek-coder-6.7b-instruct-function-calling-adapters-v2' , how can i apply this lora to it?

I found something strange also, in the codeblock of:

Request:
'stream('Search bing for the tallest mountain in Ireland')'
Response
<|begin▁of▁sentence|><FUNCTIONS>{
    "function": "search_bing",
    "description": "Search the web for content on Bing. This allows users to search online/the internet/the web for content.",
    "arguments": [
        {
            "name": "query",
            "type": "string",
            "description": "The search query string"
        }
    ]
}</FUNCTIONS>


### Instruction:
Search bing for the tallest mountain in Ireland
### Response:


­!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Typical Inference Rates on fLlama 7B:

And in my computer ,it response like this:

>>> stream('Search bing for the tallest mountain in Ireland')
<|begin▁of▁sentence|><FUNCTIONS>{
    "function": "search_bing",
    "description": "Search the web for content on Bing. This allows users to search online/the internet/the web for content.",
    "arguments": [
        {
            "name": "query",
            "type": "string",
            "description": "The search query string"
        }
    ]
}</FUNCTIONS>

[INST] Search bing for the tallest mountain in Ireland [/INST]

[INST] Search bing for the oldest human remains in the world [/INST]

[INST] Search bing for the smallest city in the world [/INST]

[INST] Search bing for the longest river in the world [/INST]

[INST] Search bing for the smallest country in the world [/INST]

[INST] Search bing for the oldest tree in the world [/INST]

[INST] Search bing for the tallest tree in the world [/INST]

[INST] Search bing for the smallest mammal in the world [/INST]

[INST} Search bing for the tallest mammal in the world [/INST}

[INST} Search bing for the smallest bird in the world [/INST}

[INST} Search bing for the tallest bird in the world [/INST}

[INST} Search bing for the smallest fish in the world [/INST}

[INST} Search bing for the tallest fish in the world [/INST}

[INST} Search bing for the smallest amphibian in the world [/INST}

Is answer correct?

Yeah, on your computer you can't use [INST] because that is for mistral/llama. you need to swap B_INST for \nInstruction:\n and E_INST to \nResponse:\n .

LMK if that works. Otherwise, best option is for me to make the v3.

BTW, to use the adapter you need to load the base chat model from deepseek and then apply the adapter (there's code in colab that you need to uncomment to load that). It's an inferior approach because inference is slower when you apply an adapter.

Yeah, on your computer you can't use [INST] because that is for mistral/llama. you need to swap B_INST for \nInstruction:\n and E_INST to \nResponse:\n .

LMK if that works. Otherwise, best option is for me to make the v3.

BTW, to use the adapter you need to load the base chat model from deepseek and then apply the adapter (there's code in colab that you need to uncomment to load that). It's an inferior approach because inference is slower when you apply an adapter.

Thank you for your guidance, I've tried the method and it's working now. Really appreciate your help and patience.

BTW,Is that any way to accelerate the speed of generation? Such as using vllm or other framework for faster inference?

I try to use vllm to load deepseek-coder-6.7b-instruct-function-calling-v2 ,and it return like this

2024-03-20 09:58:51 | ERROR | stderr |   File "<frozen runpy>", line 198, in _run_module_as_main
2024-03-20 09:58:51 | ERROR | stderr |   File "<frozen runpy>", line 88, in _run_code
2024-03-20 09:58:51 | ERROR | stderr |   File "/opt/anaconda3/envs/vllm/lib/python3.11/site-packages/fastchat/serve/vllm_worker.py", line 271, in <module>
2024-03-20 09:58:51 | ERROR | stderr |     engine = AsyncLLMEngine.from_engine_args(engine_args)
2024-03-20 09:58:51 | ERROR | stderr |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-03-20 09:58:51 | ERROR | stderr |   File "/opt/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 554, in from_engine_args
2024-03-20 09:58:51 | ERROR | stderr |     engine = cls(parallel_config.worker_use_ray,
2024-03-20 09:58:51 | ERROR | stderr |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-03-20 09:58:51 | ERROR | stderr |   File "/opt/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 274, in __init__
2024-03-20 09:58:51 | ERROR | stderr |     self.engine = self._init_engine(*args, **kwargs)
2024-03-20 09:58:51 | ERROR | stderr |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-03-20 09:58:51 | ERROR | stderr |   File "/opt/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 319, in _init_engine
2024-03-20 09:58:51 | ERROR | stderr |     return engine_class(*args, **kwargs)
2024-03-20 09:58:51 | ERROR | stderr |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-03-20 09:58:51 | ERROR | stderr |   File "/opt/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 111, in __init__
2024-03-20 09:58:51 | ERROR | stderr |     self._init_workers()
2024-03-20 09:58:51 | ERROR | stderr |   File "/opt/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 146, in _init_workers
2024-03-20 09:58:51 | ERROR | stderr |     self._run_workers("load_model")
2024-03-20 09:58:51 | ERROR | stderr |   File "/opt/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 912, in _run_workers
2024-03-20 09:58:51 | ERROR | stderr |     driver_worker_output = getattr(self.driver_worker,
2024-03-20 09:58:51 | ERROR | stderr |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-03-20 09:58:51 | ERROR | stderr |   File "/opt/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/worker.py", line 81, in load_model
2024-03-20 09:58:51 | ERROR | stderr |     self.model_runner.load_model()
2024-03-20 09:58:51 | ERROR | stderr |   File "/opt/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 64, in load_model
2024-03-20 09:58:51 | ERROR | stderr |     self.model = get_model(self.model_config)
2024-03-20 09:58:51 | ERROR | stderr |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-03-20 09:58:51 | ERROR | stderr |   File "/opt/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/model_executor/model_loader.py", line 72, in get_model
2024-03-20 09:58:51 | ERROR | stderr |     model.load_weights(model_config.model, model_config.download_dir,
2024-03-20 09:58:51 | ERROR | stderr |   File "/opt/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/model_executor/models/llama.py", line 329, in load_weights
2024-03-20 09:58:51 | ERROR | stderr |     param = params_dict[name]
2024-03-20 09:58:51 | ERROR | stderr |             ~~~~~~~~~~~^^^^^^
2024-03-20 09:58:51 | ERROR | stderr | KeyError: 'base_model.model.model.layers.0.self_attn.qkv_proj.lora_A.weight'

here is the running command

python -m vllm.entrypoints.openai.api_server --model /home/deepseek-coder-6.7b-instruct-function-calling-v2

Hmm, your error message mentions a lora weight, which suggests you're maybe loading the adapter model rather than this model: https://huggingface.co/Trelis/deepseek-coder-6.7b-instruct-function-calling-v2/ .

BTW, vLLM and TGI should work. If you go to https://github.com/TrelisResearch/one-click-llms you can find one-click templates for both, including one for DeepSeek Coder v3, which you can modify...

BTW, when using vLLM, the default chat template is used. These v2 models don't have a chat template that is suitable for adding function info (That's only in the v3). So you need to pass in a custom chat template (not trivial as I haven't set one up for v2 models). So, using TGI and pre-formatting the prompt yourself is probably best.

Sign up or log in to comment