Text Generation
Transformers
English
llama
Inference Endpoints

load using gptq-for-llama?

#1
by iateadonut - opened

When I try to run WizardCoder 4bit, I get this error message:

python server.py --listen --chat --model GodRain_WizardCoder-15B-V1.1-4bit --loader gptq-for-llama
2023-07-25 18:25:26 INFO:Loading GodRain_WizardCoder-15B-V1.1-4bit...
2023-07-25 18:25:26 ERROR:The model could not be loaded because its type could not be inferred from its name.
2023-07-25 18:25:26 ERROR:Please specify the type manually using the --model_type argument.

The oobabooga interface says that:
On some systems, AutoGPTQ can be 2x slower than GPTQ-for-LLaMa. You can manually select the GPTQ-for-LLaMa loader above.

I'm only getting about 2 tokens/s on a 4090, so I'm trying to see how I can speed it up.

  1. Will GPTQ-for-LLaMA be a better model loader than AutoGPTQ?
  2. If so, how can I run it? Will it run? And what is the parameter for the --model_type argument?

Sign up or log in to comment