TheBloke
/

Llama-2-70B-GPTQ

@@ -35,6 +35,16 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
 Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for these quantisations!
 ## Repositories available
 * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
@@ -96,19 +106,19 @@ Remember to update Transformers to latest Github version before trying to use th
 pip3 install git+https://github.com/huggingface/transformers
 ```
 1. Click the **Model tab**.
 2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-GPTQ`.
-  - To download from a specific branch, enter for example `TheBloke/Llama-2-70B-GPTQ:gptq-4bit-32g-actorder_True`
   - see Provided Files above for the list of branches for each option.
 3. Click **Download**.
 4. The model will start downloading. Once it's finished it will say "Done"
-5. In the top left, click the refresh icon next to **Model**.
-6. In the **Model** dropdown, choose the model you just downloaded: `Llama-2-70B-GPTQ`
-7. The model will automatically load, and is now ready for use!
-8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
-  * Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
-9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 ## How to use this GPTQ model from Python code
@@ -121,6 +131,8 @@ And update Transformers to the latest version:
 pip3 install git+https://github.com/huggingface/transformers
 ```
 Then try the following example code:
 ```python
@@ -136,6 +148,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
         model_basename=model_basename
         use_safetensors=True,
         trust_remote_code=True,
         device="cuda:0",
@@ -147,6 +160,7 @@ To download from a specific branch, use the revision parameter, as in this examp
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
         revision="gptq-4bit-32g-actorder_True",
         model_basename=model_basename,
         use_safetensors=True,
         trust_remote_code=True,

 Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for these quantisations!
+## Required: latest version of Transformers
+Before trying these GPTQs, please update Transformers to the latest Github code:
+```
+pip3 install git+https://github.com/huggingface/transformers
+```
+If using a UI like text-generation-webui, make sure to do this in the Python environment of text-generation-webui.
 ## Repositories available
 * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
 pip3 install git+https://github.com/huggingface/transformers
 ```
 1. Click the **Model tab**.
 2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-GPTQ`.
+  - To download from a specific branch, enter for example `TheBloke/Llama-2-70B-GPTQ:gptq-4bit-128g-actorder_True`
   - see Provided Files above for the list of branches for each option.
 3. Click **Download**.
 4. The model will start downloading. Once it's finished it will say "Done"
+5. Set Loader to AutoGPTQ or GPTQ-for-LLaMA
+  - If you use AutoGPTQ, make sure "No inject fused attention" is ticked
+6. In the top left, click the refresh icon next to **Model**.
+7. In the **Model** dropdown, choose the model you just downloaded: `Llama-2-70B-chat-GPTQ`
+8. The model will automatically load, and is now ready for use!
+9. Then click **Save settings for this model** followed by **Reload the Model** in the top right to make sure your settings are persisted.
+10. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 ## How to use this GPTQ model from Python code
 pip3 install git+https://github.com/huggingface/transformers
 ```
+**Note**: you must set `inject_fused_attention=False` for Llama 2 70B models; see below.
 Then try the following example code:
 ```python
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
         model_basename=model_basename
+        inject_fused_attention=False,
         use_safetensors=True,
         trust_remote_code=True,
         device="cuda:0",
 model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
         revision="gptq-4bit-32g-actorder_True",
+        inject_fused_attention=False,
         model_basename=model_basename,
         use_safetensors=True,
         trust_remote_code=True,