This repository contains 2-bit quantized LLaMA-v1 models in GGUF format for use with llama.cpp. All tensors are quantized with Q2_K, except for output.weight, which is Q6_K, and, in the case of LLaMA-v2-70B, attn_v, which is Q4_K. The quantized models differ from the standard llama.cpp 2-bit quantization in two ways:

  • These are actual 2-bit quantized models instead of the mostly 3-bit quantization provided by the standard llama.cpp Q2_K quantization method
  • The models were prepared with a refined (but not yet published) k-quants quantization approach

The table shows Wikitext perplexities for a context length of 2048 tokens computed with these models using llama.cpp

Model Perplexity
7B 6.4023
13B 5.3967
30B 4.5065
65B 3.9136
Downloads last month
10
GGUF
Model size
13B params
Architecture
llama
Inference API
Unable to determine this model's library. Check the docs .