Update README.md
Browse files
README.md
CHANGED
@@ -14,23 +14,28 @@ tags:
|
|
14 |
- facebook
|
15 |
- meta
|
16 |
- pytorch
|
17 |
-
- llama
|
18 |
- llama-3
|
19 |
license: llama3.2
|
|
|
|
|
20 |
---
|
21 |
# Represents
|
22 |
A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
|
23 |
|
24 |
-
## Use with transformers
|
25 |
Starting with
|
26 |
- `transformers==4.45.1`
|
|
|
27 |
- `torch==2.3.1`
|
28 |
- `numpy==2.0.0`
|
29 |
- `autoawq==0.2.6`
|
30 |
|
31 |
### For CUDA users
|
32 |
|
|
|
33 |
```python
|
|
|
|
|
34 |
from awq import AutoAWQForCausalLM
|
35 |
from transformers import AutoTokenizer, TextStreamer
|
36 |
|
@@ -59,14 +64,47 @@ generation_output = model.generate(
|
|
59 |
)
|
60 |
```
|
61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
62 |
#### Issue/Solution
|
63 |
- torch.from_numpy fails
|
64 |
- This might be due to certain issues within `torch==2.3.1` .cpp files. Since AutoAWQ uses torch version 2.3.1, instead of most recent, this issue might occur within module `marlin.py -> def _get_perms()`
|
65 |
- Module path: Python\Python311\site-packages\awq\modules\linear\marlin.py
|
66 |
- Solution:
|
67 |
-
-
|
68 |
-
|
69 |
-
3. ```python
|
70 |
def _get_perms():
|
71 |
perm = []
|
72 |
for i in range(32):
|
@@ -97,8 +135,4 @@ def _get_perms():
|
|
97 |
for i in range(4):
|
98 |
scale_perm_single.extend([2 * i + j for j in [0, 1, 8, 9, 16, 17, 24, 25]])
|
99 |
return perm, scale_perm, scale_perm_single
|
100 |
-
```
|
101 |
-
|
102 |
-
### For MPS users
|
103 |
-
```
|
104 |
```
|
|
|
14 |
- facebook
|
15 |
- meta
|
16 |
- pytorch
|
|
|
17 |
- llama-3
|
18 |
license: llama3.2
|
19 |
+
base_model:
|
20 |
+
- meta-llama/Llama-3.2-1B-Instruct
|
21 |
---
|
22 |
# Represents
|
23 |
A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
|
24 |
|
25 |
+
## Use with transformers/autoawq
|
26 |
Starting with
|
27 |
- `transformers==4.45.1`
|
28 |
+
- `accelerate==0.34.2`
|
29 |
- `torch==2.3.1`
|
30 |
- `numpy==2.0.0`
|
31 |
- `autoawq==0.2.6`
|
32 |
|
33 |
### For CUDA users
|
34 |
|
35 |
+
AutoAWQ
|
36 |
```python
|
37 |
+
"""NOTE: this example uses `fuse_layers=True` to fuse attention and mlp layers together for faster inference"""
|
38 |
+
|
39 |
from awq import AutoAWQForCausalLM
|
40 |
from transformers import AutoTokenizer, TextStreamer
|
41 |
|
|
|
64 |
)
|
65 |
```
|
66 |
|
67 |
+
Transformers
|
68 |
+
```python
|
69 |
+
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
|
70 |
+
import torch
|
71 |
+
|
72 |
+
quant_id = "ciCic/llama-3.2-1B-Instruct-AWQ"
|
73 |
+
tokenizer = AutoTokenizer.from_pretrained(quant_id, trust_remote_code=True)
|
74 |
+
model = AutoModelForCausalLM.from_pretrained(
|
75 |
+
quant_id,
|
76 |
+
torch_dtype=torch.float16,
|
77 |
+
low_cpu_mem_usage=True,
|
78 |
+
device_map="cuda"
|
79 |
+
)
|
80 |
+
|
81 |
+
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
|
82 |
+
|
83 |
+
# Convert prompt to tokens
|
84 |
+
prompt = "You're standing on the surface of the Earth. "\
|
85 |
+
"You walk one mile south, one mile west and one mile north. "\
|
86 |
+
"You end up exactly where you started. Where are you?"
|
87 |
+
|
88 |
+
tokens = tokenizer(
|
89 |
+
prompt,
|
90 |
+
return_tensors='pt'
|
91 |
+
).input_ids.cuda()
|
92 |
+
|
93 |
+
# Generate output
|
94 |
+
generation_output = model.generate(
|
95 |
+
tokens,
|
96 |
+
streamer=streamer,
|
97 |
+
max_new_tokens=512
|
98 |
+
)
|
99 |
+
```
|
100 |
+
|
101 |
#### Issue/Solution
|
102 |
- torch.from_numpy fails
|
103 |
- This might be due to certain issues within `torch==2.3.1` .cpp files. Since AutoAWQ uses torch version 2.3.1, instead of most recent, this issue might occur within module `marlin.py -> def _get_perms()`
|
104 |
- Module path: Python\Python311\site-packages\awq\modules\linear\marlin.py
|
105 |
- Solution:
|
106 |
+
- there are several operations to numpy (cpu) then back to tensor (gpu) which could be completely replaced by tensor without having to use numpy, this will solve (temporarily) the from_numpy() issue
|
107 |
+
```python
|
|
|
108 |
def _get_perms():
|
109 |
perm = []
|
110 |
for i in range(32):
|
|
|
135 |
for i in range(4):
|
136 |
scale_perm_single.extend([2 * i + j for j in [0, 1, 8, 9, 16, 17, 24, 25]])
|
137 |
return perm, scale_perm, scale_perm_single
|
|
|
|
|
|
|
|
|
138 |
```
|