TheBloke commited on
Commit
a6b0939
1 Parent(s): de4a7c0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -8
README.md CHANGED
@@ -35,6 +35,16 @@ Multiple GPTQ parameter permutations are provided; see Provided Files below for
35
 
36
  Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for these quantisations!
37
 
 
 
 
 
 
 
 
 
 
 
38
  ## Repositories available
39
 
40
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
@@ -96,19 +106,19 @@ Remember to update Transformers to latest Github version before trying to use th
96
  pip3 install git+https://github.com/huggingface/transformers
97
  ```
98
 
99
-
100
  1. Click the **Model tab**.
101
  2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-GPTQ`.
102
- - To download from a specific branch, enter for example `TheBloke/Llama-2-70B-GPTQ:gptq-4bit-32g-actorder_True`
103
  - see Provided Files above for the list of branches for each option.
104
  3. Click **Download**.
105
  4. The model will start downloading. Once it's finished it will say "Done"
106
- 5. In the top left, click the refresh icon next to **Model**.
107
- 6. In the **Model** dropdown, choose the model you just downloaded: `Llama-2-70B-GPTQ`
108
- 7. The model will automatically load, and is now ready for use!
109
- 8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
110
- * Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
111
- 9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 
112
 
113
  ## How to use this GPTQ model from Python code
114
 
@@ -121,6 +131,8 @@ And update Transformers to the latest version:
121
  pip3 install git+https://github.com/huggingface/transformers
122
  ```
123
 
 
 
124
  Then try the following example code:
125
 
126
  ```python
@@ -136,6 +148,7 @@ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
136
 
137
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
138
  model_basename=model_basename
 
139
  use_safetensors=True,
140
  trust_remote_code=True,
141
  device="cuda:0",
@@ -147,6 +160,7 @@ To download from a specific branch, use the revision parameter, as in this examp
147
 
148
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
149
  revision="gptq-4bit-32g-actorder_True",
 
150
  model_basename=model_basename,
151
  use_safetensors=True,
152
  trust_remote_code=True,
 
35
 
36
  Many thanks to William Beauchamp from [Chai](https://chai-research.com/) for providing the hardware for these quantisations!
37
 
38
+ ## Required: latest version of Transformers
39
+
40
+ Before trying these GPTQs, please update Transformers to the latest Github code:
41
+
42
+ ```
43
+ pip3 install git+https://github.com/huggingface/transformers
44
+ ```
45
+
46
+ If using a UI like text-generation-webui, make sure to do this in the Python environment of text-generation-webui.
47
+
48
  ## Repositories available
49
 
50
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Llama-2-70B-GPTQ)
 
106
  pip3 install git+https://github.com/huggingface/transformers
107
  ```
108
 
 
109
  1. Click the **Model tab**.
110
  2. Under **Download custom model or LoRA**, enter `TheBloke/Llama-2-70B-GPTQ`.
111
+ - To download from a specific branch, enter for example `TheBloke/Llama-2-70B-GPTQ:gptq-4bit-128g-actorder_True`
112
  - see Provided Files above for the list of branches for each option.
113
  3. Click **Download**.
114
  4. The model will start downloading. Once it's finished it will say "Done"
115
+ 5. Set Loader to AutoGPTQ or GPTQ-for-LLaMA
116
+ - If you use AutoGPTQ, make sure "No inject fused attention" is ticked
117
+ 6. In the top left, click the refresh icon next to **Model**.
118
+ 7. In the **Model** dropdown, choose the model you just downloaded: `Llama-2-70B-chat-GPTQ`
119
+ 8. The model will automatically load, and is now ready for use!
120
+ 9. Then click **Save settings for this model** followed by **Reload the Model** in the top right to make sure your settings are persisted.
121
+ 10. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
122
 
123
  ## How to use this GPTQ model from Python code
124
 
 
131
  pip3 install git+https://github.com/huggingface/transformers
132
  ```
133
 
134
+ **Note**: you must set `inject_fused_attention=False` for Llama 2 70B models; see below.
135
+
136
  Then try the following example code:
137
 
138
  ```python
 
148
 
149
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
150
  model_basename=model_basename
151
+ inject_fused_attention=False,
152
  use_safetensors=True,
153
  trust_remote_code=True,
154
  device="cuda:0",
 
160
 
161
  model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
162
  revision="gptq-4bit-32g-actorder_True",
163
+ inject_fused_attention=False,
164
  model_basename=model_basename,
165
  use_safetensors=True,
166
  trust_remote_code=True,