Update README.md
Browse files
README.md
CHANGED
@@ -35,6 +35,16 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
|
|
35 |
|
36 |
Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for these quantisations!
|
37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
## Repositories available
|
39 |
|
40 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
|
@@ -96,19 +106,19 @@ Remember to update Transformers to latest Github version before trying to use th
|
|
96 |
pip3 install git+https://github.com/huggingface/transformers
|
97 |
```
|
98 |
|
99 |
-
|
100 |
1. Click the **Model tab**.
|
101 |
2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-GPTQ`.
|
102 |
-
- To download from a specific branch, enter for example `TheBloke/Llama-2-70B-GPTQ:gptq-4bit-
|
103 |
- see Provided Files above for the list of branches for each option.
|
104 |
3. Click **Download**.
|
105 |
4. The model will start downloading. Once it's finished it will say "Done"
|
106 |
-
5.
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
9.
|
|
|
112 |
|
113 |
## How to use this GPTQ model from Python code
|
114 |
|
@@ -121,6 +131,8 @@ And update Transformers to the latest version:
|
|
121 |
pip3 install git+https://github.com/huggingface/transformers
|
122 |
```
|
123 |
|
|
|
|
|
124 |
Then try the following example code:
|
125 |
|
126 |
```python
|
@@ -136,6 +148,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
|
136 |
|
137 |
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
138 |
model_basename=model_basename
|
|
|
139 |
use_safetensors=True,
|
140 |
trust_remote_code=True,
|
141 |
device="cuda:0",
|
@@ -147,6 +160,7 @@ To download from a specific branch, use the revision parameter, as in this examp
|
|
147 |
|
148 |
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
149 |
revision="gptq-4bit-32g-actorder_True",
|
|
|
150 |
model_basename=model_basename,
|
151 |
use_safetensors=True,
|
152 |
trust_remote_code=True,
|
|
|
35 |
|
36 |
Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for these quantisations!
|
37 |
|
38 |
+
## Required: latest version of Transformers
|
39 |
+
|
40 |
+
Before trying these GPTQs, please update Transformers to the latest Github code:
|
41 |
+
|
42 |
+
```
|
43 |
+
pip3 install git+https://github.com/huggingface/transformers
|
44 |
+
```
|
45 |
+
|
46 |
+
If using a UI like text-generation-webui, make sure to do this in the Python environment of text-generation-webui.
|
47 |
+
|
48 |
## Repositories available
|
49 |
|
50 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
|
|
|
106 |
pip3 install git+https://github.com/huggingface/transformers
|
107 |
```
|
108 |
|
|
|
109 |
1. Click the **Model tab**.
|
110 |
2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-GPTQ`.
|
111 |
+
- To download from a specific branch, enter for example `TheBloke/Llama-2-70B-GPTQ:gptq-4bit-128g-actorder_True`
|
112 |
- see Provided Files above for the list of branches for each option.
|
113 |
3. Click **Download**.
|
114 |
4. The model will start downloading. Once it's finished it will say "Done"
|
115 |
+
5. Set Loader to AutoGPTQ or GPTQ-for-LLaMA
|
116 |
+
- If you use AutoGPTQ, make sure "No inject fused attention" is ticked
|
117 |
+
6. In the top left, click the refresh icon next to **Model**.
|
118 |
+
7. In the **Model** dropdown, choose the model you just downloaded: `Llama-2-70B-chat-GPTQ`
|
119 |
+
8. The model will automatically load, and is now ready for use!
|
120 |
+
9. Then click **Save settings for this model** followed by **Reload the Model** in the top right to make sure your settings are persisted.
|
121 |
+
10. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
122 |
|
123 |
## How to use this GPTQ model from Python code
|
124 |
|
|
|
131 |
pip3 install git+https://github.com/huggingface/transformers
|
132 |
```
|
133 |
|
134 |
+
**Note**: you must set `inject_fused_attention=False` for Llama 2 70B models; see below.
|
135 |
+
|
136 |
Then try the following example code:
|
137 |
|
138 |
```python
|
|
|
148 |
|
149 |
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
150 |
model_basename=model_basename
|
151 |
+
inject_fused_attention=False,
|
152 |
use_safetensors=True,
|
153 |
trust_remote_code=True,
|
154 |
device="cuda:0",
|
|
|
160 |
|
161 |
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
162 |
revision="gptq-4bit-32g-actorder_True",
|
163 |
+
inject_fused_attention=False,
|
164 |
model_basename=model_basename,
|
165 |
use_safetensors=True,
|
166 |
trust_remote_code=True,
|