Update README.md
Browse files
README.md
CHANGED
@@ -18,17 +18,87 @@ tags:
|
|
18 |
- llama-3
|
19 |
license: llama3.2
|
20 |
---
|
|
|
21 |
A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
|
22 |
|
23 |
-
|
24 |
Starting with
|
25 |
- `transformers==4.45.1`
|
26 |
- `torch==2.3.1`
|
27 |
- `numpy==2.0.0`
|
28 |
- `autoawq==0.2.6`
|
29 |
|
30 |
-
|
31 |
|
32 |
```python
|
|
|
|
|
33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
```
|
|
|
|
18 |
- llama-3
|
19 |
license: llama3.2
|
20 |
---
|
21 |
+
# Represents
|
22 |
A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
|
23 |
|
24 |
+
## Use with transformers
|
25 |
Starting with
|
26 |
- `transformers==4.45.1`
|
27 |
- `torch==2.3.1`
|
28 |
- `numpy==2.0.0`
|
29 |
- `autoawq==0.2.6`
|
30 |
|
31 |
+
### For CUDA users
|
32 |
|
33 |
```python
|
34 |
+
from awq import AutoAWQForCausalLM
|
35 |
+
from transformers import AutoTokenizer, TextStreamer
|
36 |
|
37 |
+
quant_id = "ciCic/llama-3.2-1B-Instruct-AWQ"
|
38 |
+
model = AutoAWQForCausalLM.from_quantized(quant_id, fuse_layers=True)
|
39 |
+
tokenizer = AutoTokenizer.from_pretrained(quant_id, trust_remote_code=True)
|
40 |
+
|
41 |
+
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
|
42 |
+
|
43 |
+
# Declare prompt
|
44 |
+
prompt = "You're standing on the surface of the Earth. "\
|
45 |
+
"You walk one mile south, one mile west and one mile north. "\
|
46 |
+
"You end up exactly where you started. Where are you?"
|
47 |
+
|
48 |
+
# Tokenization of the prompt
|
49 |
+
tokens = tokenizer(
|
50 |
+
prompt,
|
51 |
+
return_tensors='pt'
|
52 |
+
).input_ids.cuda()
|
53 |
+
|
54 |
+
# Generate output in a streaming fashion
|
55 |
+
generation_output = model.generate(
|
56 |
+
tokens,
|
57 |
+
streamer=streamer,
|
58 |
+
max_new_tokens=512
|
59 |
+
)
|
60 |
+
```
|
61 |
+
|
62 |
+
#### Issue/Solution
|
63 |
+
- torch.from_numpy fails
|
64 |
+
- This might be due to certain issues within `torch==2.3.1` .cpp files. Since AutoAWQ uses torch version 2.3.1, instead of most recent, this issue might occur within module `marlin.py -> def _get_perms()`
|
65 |
+
- Module path: Python\Python311\site-packages\awq\modules\linear\marlin.py
|
66 |
+
- Solution:
|
67 |
+
- 1. there are several operations to numpy (cpu) then back to tensor (gpu) which could be completely replaced by tensor without having to use numpy, this will solve (temporarily) the from_numpy() issue
|
68 |
+
- 2. E.g.
|
69 |
+
3. ```python
|
70 |
+
def _get_perms():
|
71 |
+
perm = []
|
72 |
+
for i in range(32):
|
73 |
+
perm1 = []
|
74 |
+
col = i // 4
|
75 |
+
for block in [0, 1]:
|
76 |
+
for row in [
|
77 |
+
2 * (i % 4),
|
78 |
+
2 * (i % 4) + 1,
|
79 |
+
2 * (i % 4 + 4),
|
80 |
+
2 * (i % 4 + 4) + 1,
|
81 |
+
]:
|
82 |
+
perm1.append(16 * row + col + 8 * block)
|
83 |
+
|
84 |
+
for j in range(4):
|
85 |
+
perm.extend([p + 256 * j for p in perm1])
|
86 |
+
|
87 |
+
# perm = np.array(perm)
|
88 |
+
perm = torch.asarray(perm)
|
89 |
+
# interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
|
90 |
+
interleave = torch.asarray([0, 2, 4, 6, 1, 3, 5, 7])
|
91 |
+
perm = perm.reshape((-1, 8))[:, interleave].ravel()
|
92 |
+
# perm = torch.from_numpy(perm)
|
93 |
+
scale_perm = []
|
94 |
+
for i in range(8):
|
95 |
+
scale_perm.extend([i + 8 * j for j in range(8)])
|
96 |
+
scale_perm_single = []
|
97 |
+
for i in range(4):
|
98 |
+
scale_perm_single.extend([2 * i + j for j in [0, 1, 8, 9, 16, 17, 24, 25]])
|
99 |
+
return perm, scale_perm, scale_perm_single
|
100 |
+
```
|
101 |
+
|
102 |
+
### For MPS users
|
103 |
```
|
104 |
+
```
|