TheBloke
/

Llama-2-70B-GPTQ

@@ -40,6 +40,16 @@ Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for pro
 * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
 * [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-70b-hf)
 ## Prompt template: None
 ```
@@ -54,10 +64,10 @@ Each separate quant is in a different branch.  See below for instructions on fet
 | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
 | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
-| main | 4 | None | True | 35.33 GB | True | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
-| gptq-4bit-32g-actorder_True | 4 | 32 | True | Still processing | True | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
-| gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
-| gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | True | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 | gptq-3bit--1g-actorder_True | 3 | None | True | Still processing | False | AutoGPTQ | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
 | gptq-3bit-128g-actorder_False | 3 | 128 | False | Still processing | False | AutoGPTQ | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. |
 | gptq-3bit-128g-actorder_True | 3 | 128 | True | Still processing | False | AutoGPTQ | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
@@ -78,6 +88,15 @@ Please make sure you're using the latest version of [text-generation-webui](http
 It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
 1. Click the **Model tab**.
 2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-GPTQ`.
   - To download from a specific branch, enter for example `TheBloke/Llama-2-70B-GPTQ:gptq-4bit-32g-actorder_True`
@@ -97,6 +116,11 @@ First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) instal
 `GITHUB_ACTIONS=true pip install auto-gptq`
 Then try the following example code:
 ```python

 * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
 * [Original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/meta-llama/Llama-2-70b-hf)
+## Required: latest version of Transformers
+Before trying these GPTQs, please update Transformers to the latest Github code:
+```
+pip3 install git+https://github.com/huggingface/transformers
+```
+If using a UI like text-generation-webui, make sure to do this in the Python environment of text-generation-webui.
 ## Prompt template: None
 ```
 | Branch | Bits | Group Size | Act Order (desc_act) | File Size | ExLlama Compatible? | Made With | Description |
 | ------ | ---- | ---------- | -------------------- | --------- | ------------------- | --------- | ----------- |
+| main | 4 | None | True | 35.33 GB | False | AutoGPTQ | Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options. |
+| gptq-4bit-32g-actorder_True | 4 | 32 | True | Still processing | False | AutoGPTQ | 4-bit, with Act Order and group size. 32g gives highest possible inference quality, with maximum VRAM usage. Poor AutoGPTQ CUDA speed. |
+| gptq-4bit-64g-actorder_True | 4 | 64 | True | 37.99 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 64g uses less VRAM than 32g, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
+| gptq-4bit-128g-actorder_True | 4 | 128 | True | 36.65 GB | False | AutoGPTQ | 4-bit, with Act Order and group size. 128g uses even less VRAM, but with slightly lower accuracy. Poor AutoGPTQ CUDA speed. |
 | gptq-3bit--1g-actorder_True | 3 | None | True | Still processing | False | AutoGPTQ | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
 | gptq-3bit-128g-actorder_False | 3 | 128 | False | Still processing | False | AutoGPTQ | 3-bit, with group size 128g but no act-order. Slightly higher VRAM requirements than 3-bit None. |
 | gptq-3bit-128g-actorder_True | 3 | 128 | True | Still processing | False | AutoGPTQ | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
 It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
+Note: ExLlama is not currently compatible with Llama 2 70B. Please try GPTQ-for-LLaMa, or AutoGPTQ.
+Remember to update Transformers to latest Github version before trying to use this model:
+```
+pip3 install git+https://github.com/huggingface/transformers
+```
 1. Click the **Model tab**.
 2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-GPTQ`.
   - To download from a specific branch, enter for example `TheBloke/Llama-2-70B-GPTQ:gptq-4bit-32g-actorder_True`
 `GITHUB_ACTIONS=true pip install auto-gptq`
+And update Transformers to the latest version:
+```
+pip3 install git+https://github.com/huggingface/transformers
+```
 Then try the following example code:
 ```python