TheBloke commited on
Commit
6d2122a
·
1 Parent(s): e7fb300

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -54
README.md CHANGED
@@ -32,49 +32,25 @@ I haven't specifically tested VRAM requirements yet but will aim to do so at som
32
 
33
  If you want to try CPU inference instead, check out my GGML repo: [TheBloke/alpaca-lora-65B-GGML](https://huggingface.co/TheBloke/alpaca-lora-65B-GGML).
34
 
35
- ## GIBBERISH OUTPUT IN `text-generation-webui`?
36
-
37
- Please read the Provided Files section below. You should use `alpaca-lora-65B-GPTQ-4bit-128g.no-act-order.safetensors` unless you are able to use the latest Triton branch of GPTQ-for-LLaMa.
38
-
39
  ## Provided files
40
 
41
- Three files are provided. **The second and third files will not work unless you use a recent version of the Triton branch of GPTQ-for-LLaMa**
42
-
43
- Specifically, the last two files use `--act-order` for maximum quantisation quality and will not work with oobabooga's fork of GPTQ-for-LLaMa. Therefore at this time it will also not work with the CUDA branch of GPTQ-for-LLaMa, or `text-generation-webui` one-click installers.
44
 
45
- Unless you are able to use the latest Triton GPTQ-for-LLaMa code, please use `medalpaca-13B-GPTQ-4bit-128g.no-act-order.safetensors`
46
-
47
- * `alpaca-lora-65B-GPTQ-4bit-128g.no-act-order.safetensors`
48
- * Works with all versions of GPTQ-for-LLaMa code, both Triton and CUDA branches
49
- * Works with text-generation-webui one-click-installers
50
- * Works on Windows
51
  * Will require ~40GB of VRAM, meaning you'll need an A100 or 2 x 24GB cards.
52
- * I haven't yet tested how much VRAM is required exactly so it's possible it won't run on an A100 40GB
53
  * Parameters: Groupsize = 128g. No act-order.
54
  * Command used to create the GPTQ:
55
  ```
56
  CUDA_VISIBLE_DEVICES=0 python3 llama.py alpaca-lora-65B-HF c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors alpaca-lora-65B-GPTQ-4bit-128g.no-act-order.safetensors
57
  ```
58
- * `alpaca-lora-65B-GPTQ-4bit-128g.safetensors`
59
- * Only works with the latest Triton branch of GPTQ-for-LLaMa
60
- * **Does not** work with text-generation-webui one-click-installers
61
- * **Does not** work on Windows
62
- * Will require 40+GB of VRAM, meaning you'll need an A100 or 2 x 24GB cards.
63
- * I haven't yet tested how much VRAM is required exactly so it's possible it won't run on an A100 40GB
64
  * Parameters: Groupsize = 128g. act-order.
65
- * Offers highest quality quantisation, but requires recent Triton GPTQ-for-LLaMa code and more VRAM
66
  * Command used to create the GPTQ:
67
  ```
68
  CUDA_VISIBLE_DEVICES=0 python3 llama.py alpaca-lora-65B-HF c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors alpaca-lora-65B-GPTQ-4bit-128g.safetensors
69
  ```
70
- * `alpaca-lora-65B-GPTQ-4bit-1024g.safetensors`
71
- * Only works with the latest Triton branch of GPTQ-for-LLaMa
72
- * **Does not** work with text-generation-webui one-click-installers
73
- * **Does not** work on Windows
74
- * Should require less VRAM than the 128g file, so hopefully it will run in an A100 40GB
75
- * I haven't yet tested how much VRAM is required exactly
76
  * Parameters: Groupsize = 1024g. act-order.
77
- * Offers the benefits of act-order, but at a higher groupsize to reduce VRAM requirements
78
  * Command used to create the GPTQ:
79
  ```
80
  CUDA_VISIBLE_DEVICES=0 python3 llama.py alpaca-lora-65B-HF c4 --wbits 4 --true-sequential --act-order --groupsize 1024 --save_safetensors alpaca-lora-65B-GPTQ-4bit-1024g.safetensors
@@ -82,32 +58,7 @@ Unless you are able to use the latest Triton GPTQ-for-LLaMa code, please use `me
82
 
83
  ## How to run in `text-generation-webui`
84
 
85
- File `alpaca-lora-65B-GPTQ-4bit-128g.no-act-order.safetensors` can be loaded the same as any other GPTQ file, without requiring any updates to [oobaboogas text-generation-webui](https://github.com/oobabooga/text-generation-webui).
86
-
87
- [Instructions on using GPTQ 4bit files in text-generation-webui are here](https://github.com/oobabooga/text-generation-webui/wiki/GPTQ-models-\(4-bit-mode\)).
88
-
89
- The other two `safetensors` model files were created using `--act-order` to give the maximum possible quantisation quality, but this means it requires that the latest Triton GPTQ-for-LLaMa is used inside the UI.
90
-
91
- If you want to use the act-order `safetensors` files and need to update the Triton branch of GPTQ-for-LLaMa, here are the commands I used to clone the Triton branch of GPTQ-for-LLaMa, clone text-generation-webui, and install GPTQ into the UI:
92
- ```
93
- # Clone text-generation-webui, if you don't already have it
94
- git clone https://github.com/oobabooga/text-generation-webui
95
- # Make a repositories directory
96
- mkdir text-generation-webui/repositories
97
- cd text-generation-webui/repositories
98
- # Clone the latest GPTQ-for-LLaMa code inside text-generation-webui
99
- git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa
100
- ```
101
-
102
- Then install this model into `text-generation-webui/models` and launch the UI as follows:
103
- ```
104
- cd text-generation-webui
105
- python server.py --model alpaca-lora-65B-GPTQ-4bit --wbits 4 --groupsize 128 --model_type Llama # add any other command line args you want
106
- ```
107
-
108
- The above commands assume you have installed all dependencies for GPTQ-for-LLaMa and text-generation-webui. Please see their respective repositories for further information.
109
-
110
- If you can't update GPTQ-for-LLaMa to the latest Triton branch, or don't want to, you can use `alpaca-lora-65B-GPTQ-4bit-128g.no-act-order.safetensors` as mentioned above, which should work without any upgrades to text-generation-webui.
111
 
112
  <!-- footer start -->
113
  ## Discord
 
32
 
33
  If you want to try CPU inference instead, check out my GGML repo: [TheBloke/alpaca-lora-65B-GGML](https://huggingface.co/TheBloke/alpaca-lora-65B-GGML).
34
 
 
 
 
 
35
  ## Provided files
36
 
37
+ Three files are provided, in separate branches.
 
 
38
 
39
+ * `alpaca-lora-65B-GPTQ-4bit-128g.no-act-order.safetensors` - branch main
 
 
 
 
 
40
  * Will require ~40GB of VRAM, meaning you'll need an A100 or 2 x 24GB cards.
 
41
  * Parameters: Groupsize = 128g. No act-order.
42
  * Command used to create the GPTQ:
43
  ```
44
  CUDA_VISIBLE_DEVICES=0 python3 llama.py alpaca-lora-65B-HF c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors alpaca-lora-65B-GPTQ-4bit-128g.no-act-order.safetensors
45
  ```
46
+ * `alpaca-lora-65B-GPTQ-4bit-128g.safetensors` - branch gptq-4bit-128g-actorder_True
 
 
 
 
 
47
  * Parameters: Groupsize = 128g. act-order.
 
48
  * Command used to create the GPTQ:
49
  ```
50
  CUDA_VISIBLE_DEVICES=0 python3 llama.py alpaca-lora-65B-HF c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors alpaca-lora-65B-GPTQ-4bit-128g.safetensors
51
  ```
52
+ * `alpaca-lora-65B-GPTQ-4bit-1024g.safetensors` - branch gptq-4bit-1024g-actorder_True
 
 
 
 
 
53
  * Parameters: Groupsize = 1024g. act-order.
 
54
  * Command used to create the GPTQ:
55
  ```
56
  CUDA_VISIBLE_DEVICES=0 python3 llama.py alpaca-lora-65B-HF c4 --wbits 4 --true-sequential --act-order --groupsize 1024 --save_safetensors alpaca-lora-65B-GPTQ-4bit-1024g.safetensors
 
58
 
59
  ## How to run in `text-generation-webui`
60
 
61
+ Please see one of my more recent repos for instructions on loading GPTQ models in text-generation-webui.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  <!-- footer start -->
64
  ## Discord