tsumeone
/

stable-vicuna-13B-4bit-128g-cuda

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

stable-vicuna-13B-4bit-128g-cuda / README.md

tsumeone's picture

Update README.md

841c2c6 over 1 year ago

|

history blame contribute delete

730 Bytes

	Quantized version of this: https://huggingface.co/TheBloke/stable-vicuna-13B-HF

	Big thank you to TheBloke for uploading the HF version above. Unfortunately, his GPTQ quant doesn't run on 0cc4m's fork of KAI/GPTQ so I am uploading one that does.

	GPTQ quantization using https://github.com/0cc4m/GPTQ-for-LLaMa for compatibility with 0cc4m's fork of KoboldAI.

	Command used to quantize:
	```python llama.py c:\stable-vicuna-13B-HF c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors 4bit-128g.safetensors```

	This model works best with the following prompting. Also, it really does not like to stop on its own and will likely keep going on forever if you let it.

	```
	### Human:
	What is 2+2?

	### Assistant:


	```