This is an fp16 copy of jarradh/llama2_70b_chat_uncensored for faster downloading and less disk space usage than the fp32 original. I simply imported the model to CPU with torch_dtype=torch.float16 and then exported it again. I also added a chat_template entry derived from the model card to the tokenizer_config.json file, which previously didn't have one. All credit for the model goes to jarradh.
Arguable a better name for this model would be something like Llama-2-70B_Wizard-Vicuna-Uncensored-fp16, but to avoid confusion I'm sticking with jarradh's naming scheme.
Repositories available
- GPTQ models for GPU inference, with multiple quantisation parameter options.
- 2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference
- 2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference, plus fp16 GGUF for requantizing
- Jarrad Hope's unquantised model in fp16 pytorch format, for GPU inference and further conversions
- Jarrad Hope's original unquantised fp32 model in pytorch format, for further conversions
Prompt template: Human-Response
### HUMAN:
{prompt}
### RESPONSE:
- Downloads last month
- 32
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.