Is it GPTQ/AWQ?

#1
by yiouyou - opened

I'd like to buy. What's the format of this model? And what's the training loss of this model?

Thanks,

This is a full precision bf16 model. It is not AWQ or GPTQ.

This model was trained with quantized LoRA (bnb nf4) followed by merging of the adapter to the base model, so I expect the perplexity to be a tiny bit higher than the base model. For example, if base perplexity were 1.00, there might be a perplexity increase of about 0.05 . However, if you load the base model larryvrh/Yi-34B-200K-Llamafied and then apply the adapter, that should eliminate that perplexity increase. (the perplexity increase has to do with imperfect merging of adapters when using quantization).

The main drawback of this current model is that it is fine-tuned from a base model that is not SFT fine-tuned to a chat format. So, I expect this model to do well when function calls are required, but it will only be as good at conversations as the base model. For this reason, I'm deprecating this model in favour of:

"Trelis/Yi-34B-200K-Llamafied-chat-SFT-function-calling-v2"

That is a function calling fine-tuned version of a chat fine-tuned model (Trelis/Yi-34B-200K-Llamafied-chat-SFT). Furthermore, all fine-tunes were done in bf16 (not quantized) so there is no perplexity loss as described above. Additionally, I'm making an AWQ version: Trelis/Yi-34B-200K-Llamafied-chat-SFT-function-calling-v2-AWQ. I don't plan to make a GPTQ because perplexity of GPTQ is bad compared to AWQ, but let me know if you need GPTQ.

"Trelis/Yi-34B-200K-Llamafied-chat-SFT-function-calling-v2" came out, well done! One question, does this one has the adapter merging issue? Do you plan to generate the adapter version of "chat+function calling"? Based on your suggestion, larryvrh/Yi-34B-200K-Llamafied + adpater seems better than AWQ?

Thanks,

I'm using text-generation-webui to load LLM, it seems that AWQ takes more GPU memory than GPTQ, and AWQ can't be pointed to a specific GPU, so even I know AWQ seems better than GPTQ, I still need GPTQ to run LLM. And Ollama seems only support GGUF custom LLM, so I guess GPTQ is what I need at the moment. Thanks~

One more question, "chat-SFT-function-calling-v2" supports both chat and function-calling, doesn't it?

Howdy @yiouyou :

  1. "Trelis/Yi-34B-200K-Llamafied-chat-SFT-function-calling-v2" came out, well done! One question, does this one has the adapter merging issue?

Answer: No. This model was trained in bf16 precision (no quantization), so there is no adapter merging issue. BTW, the model card is now up allowing purchase access.

  1. Do you plan to generate the adapter version of "chat+function calling"? Based on your suggestion, larryvrh/Yi-34B-200K-Llamafied + adpater seems better than AWQ?
    "Trelis/Yi-34B-200K-Llamafied-chat-SFT-function-calling-adapters-v2"

Answer: The adapter version is up. However, since the model is fine-tuned in bf16 (no quantization), there is no advantage to loading the base model and applying the adapter. The result will be the same. Loading the base model is probably easier.

  1. I'm using text-generation-webui to load LLM, it seems that AWQ takes more GPU memory than GPTQ, and AWQ can't be pointed to a specific GPU, so even I know AWQ seems better than GPTQ, I still need GPTQ to run LLM. And Ollama seems only support GGUF custom LLM, so I guess GPTQ is what I need at the moment. Thanks~

Answer: Understood, I'll work to put up a GPTQ for the chat-SFT-function-calling-v2 model. BTW, maybe be worth trying vLLM.

  1. One more question, "chat-SFT-function-calling-v2" supports both chat and function-calling, doesn't it?

Correct.

RonanMcGovern changed discussion status to closed

Sign up or log in to comment