smpanaro
/

Llama-2-7b-NuGPTQ

+---
+datasets:
+- wikitext
+metrics:
+- perplexity
+---
+**N**on-**u**niform **GPTQ** (NuGPTQ) combines [GPTQ](https://arxiv.org/abs/2210.17323), [SqueezeLLM](https://arxiv.org/abs/2306.07629) and [output scaling](https://stephenpanaro.com/blog/llm-quantization-for-iphone) for a competitive whole-tensor (no grouping) LLM compression method.
+Results for Llama-2-7b-hf:
+|Method       |WikitextPPL (↓)|Delta |
+|--           |--             |--    |
+|float16      |8.7071         |0     |
+|AWQ          |8.9760         |0.2689|
+|NuGPTQ (This)|9.2754         |0.5683|
+|GPTQ†        |9.4686         |0.7615|
+<sub>† g128, desc_act=True</sub>
+<details>
+<summary>perplexity reproduction steps</summary>
+```shell
+git clone https://github.com/EleutherAI/lm-evaluation-harness
+cd lm-evaluation-harness
+pip install -e .
+pip install optimum
+huggingface-cli login
+# Set batch size based on your GPU.
+lm_eval --model hf \
+    --model_args pretrained=meta-llama/Llama-2-7b-hf,dtype="float16" \
+    --tasks wikitext \
+    --batch_size 1
+# hf (pretrained=meta-llama/Llama-2-7b-hf,dtype=float16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
+# | Tasks  |Version|Filter|n-shot|    Metric     |Value |   |Stderr|
+# |--------|------:|------|-----:|---------------|-----:|---|------|
+# |wikitext|      2|none  |     0|word_perplexity|8.7071|±  |N/A   |
+# |        |       |none  |     0|byte_perplexity|1.4989|±  |N/A   |
+# |        |       |none  |     0|bits_per_byte  |0.5839|±  |N/A   |
+lm_eval --model hf \
+    --model_args pretrained=smpanaro/Llama-2-7b-NuGPTQ,dtype="float16",use_safetensors=True,trust_remote_code=True \
+    --tasks wikitext \
+    --batch_size 1
+# hf (pretrained=smpanaro/llama-2-7b-nugptq,dtype=float16,use_safetensors=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
+# | Tasks  |Version|Filter|n-shot|    Metric     |Value |   |Stderr|
+# |--------|------:|------|-----:|---------------|-----:|---|------|
+# |wikitext|      2|none  |     0|word_perplexity|9.2754|±  |N/A   |
+# |        |       |none  |     0|byte_perplexity|1.5167|±  |N/A   |
+# |        |       |none  |     0|bits_per_byte  |0.6009|±  |N/A   |
+pip install auto-gptq
+lm_eval --model hf \
+    --model_args pretrained=TheBloke/Llama-2-7B-GPTQ,dtype="float16",revision=gptq-4bit-128g-actorder_True \
+    --tasks wikitext \
+    --batch_size 1
+# hf (pretrained=TheBloke/Llama-2-7B-GPTQ,dtype=float16,revision=gptq-4bit-128g-actorder_True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
+# | Tasks  |Version|Filter|n-shot|    Metric     |Value |   |Stderr|
+# |--------|------:|------|-----:|---------------|-----:|---|------|
+# |wikitext|      2|none  |     0|word_perplexity|9.4686|±  |N/A   |
+# |        |       |none  |     0|byte_perplexity|1.5225|±  |N/A   |
+# |        |       |none  |     0|bits_per_byte  |0.6065|±  |N/A   |
+lm_eval --model hf \
+    --model_args pretrained=TheBloke/Llama-2-7B-GPTQ,dtype="float16",revision=gptq-4bit-32g-actorder_True \
+    --tasks wikitext \
+    --batch_size 1
+# hf (pretrained=TheBloke/Llama-2-7B-GPTQ,dtype=float16,revision=gptq-4bit-32g-actorder_True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
+# | Tasks  |Version|Filter|n-shot|    Metric     |Value |   |Stderr|
+# |--------|------:|------|-----:|---------------|-----:|---|------|
+# |wikitext|      2|none  |     0|word_perplexity|9.3801|±  |N/A   |
+# |        |       |none  |     0|byte_perplexity|1.5199|±  |N/A   |
+# |        |       |none  |     0|bits_per_byte  |0.6040|±  |N/A   |
+pip install autoawq
+lm_eval --model hf \
+    --model_args pretrained=TheBloke/Llama-2-7B-AWQ,dtype="float16" \
+    --tasks wikitext \
+    --batch_size 1
+# hf (pretrained=thebloke/llama-2-7b-awq,dtype=float16), gen_kwargs: (none), limit: none, num_fewshot: none, batch_size: 1
+# | Tasks  |Version|Filter|n-shot|    Metric     |Value |   |Stderr|
+# |--------|------:|------|-----:|---------------|-----:|---|------|
+# |wikitext|      2|none  |     0|word_perplexity|8.9760|±  |N/A   |
+# |        |       |none  |     0|byte_perplexity|1.5074|±  |N/A   |
+# |        |       |none  |     0|bits_per_byte  |0.5921|±  |N/A   |
+```
+</details>
+The model is fake quantized which means each weight has <= 16 (2<sup>4</sup>) unique values, but they are stored in float16. The uniqueness can be checked as follows:
+```python
+from transformers import AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained("smpanaro/Llama-2-7b-NuGPTQ")
+linear_layers = ["k_proj", "q_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
+count = 0
+for key, tensor in model.state_dict().items():
+    if "weight" not in key:
+        continue
+    if any([l in key for l in linear_layers]):
+        assert tensor.unique().shape[0] <= 16, f"{key} has more than 16 unique values"
+        print("✓", end="", flush=True)
+        count += 1
+print()
+# 32 model layers * 7 linear layers
+print(f"{count} out of 224 linear layers have 16 unique values.")
+```