ciCic commited on
Commit
2cebe1a
·
verified ·
1 Parent(s): be1abad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -2
README.md CHANGED
@@ -18,17 +18,87 @@ tags:
18
  - llama-3
19
  license: llama3.2
20
  ---
 
21
  A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
22
 
23
- ### Use with transformers
24
  Starting with
25
  - `transformers==4.45.1`
26
  - `torch==2.3.1`
27
  - `numpy==2.0.0`
28
  - `autoawq==0.2.6`
29
 
30
- you can run conversational inference using the Transformers Auto classes with the `generate()` function.
31
 
32
  ```python
 
 
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  ```
 
 
18
  - llama-3
19
  license: llama3.2
20
  ---
21
+ # Represents
22
  A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
23
 
24
+ ## Use with transformers
25
  Starting with
26
  - `transformers==4.45.1`
27
  - `torch==2.3.1`
28
  - `numpy==2.0.0`
29
  - `autoawq==0.2.6`
30
 
31
+ ### For CUDA users
32
 
33
  ```python
34
+ from awq import AutoAWQForCausalLM
35
+ from transformers import AutoTokenizer, TextStreamer
36
 
37
+ quant_id = "ciCic/llama-3.2-1B-Instruct-AWQ"
38
+ model = AutoAWQForCausalLM.from_quantized(quant_id, fuse_layers=True)
39
+ tokenizer = AutoTokenizer.from_pretrained(quant_id, trust_remote_code=True)
40
+
41
+ streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
42
+
43
+ # Declare prompt
44
+ prompt = "You're standing on the surface of the Earth. "\
45
+ "You walk one mile south, one mile west and one mile north. "\
46
+ "You end up exactly where you started. Where are you?"
47
+
48
+ # Tokenization of the prompt
49
+ tokens = tokenizer(
50
+ prompt,
51
+ return_tensors='pt'
52
+ ).input_ids.cuda()
53
+
54
+ # Generate output in a streaming fashion
55
+ generation_output = model.generate(
56
+ tokens,
57
+ streamer=streamer,
58
+ max_new_tokens=512
59
+ )
60
+ ```
61
+
62
+ #### Issue/Solution
63
+ - torch.from_numpy fails
64
+ - This might be due to certain issues within `torch==2.3.1` .cpp files. Since AutoAWQ uses torch version 2.3.1, instead of most recent, this issue might occur within module `marlin.py -> def _get_perms()`
65
+ - Module path: Python\Python311\site-packages\awq\modules\linear\marlin.py
66
+ - Solution:
67
+ - 1. there are several operations to numpy (cpu) then back to tensor (gpu) which could be completely replaced by tensor without having to use numpy, this will solve (temporarily) the from_numpy() issue
68
+ - 2. E.g.
69
+ 3. ```python
70
+ def _get_perms():
71
+ perm = []
72
+ for i in range(32):
73
+ perm1 = []
74
+ col = i // 4
75
+ for block in [0, 1]:
76
+ for row in [
77
+ 2 * (i % 4),
78
+ 2 * (i % 4) + 1,
79
+ 2 * (i % 4 + 4),
80
+ 2 * (i % 4 + 4) + 1,
81
+ ]:
82
+ perm1.append(16 * row + col + 8 * block)
83
+
84
+ for j in range(4):
85
+ perm.extend([p + 256 * j for p in perm1])
86
+
87
+ # perm = np.array(perm)
88
+ perm = torch.asarray(perm)
89
+ # interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
90
+ interleave = torch.asarray([0, 2, 4, 6, 1, 3, 5, 7])
91
+ perm = perm.reshape((-1, 8))[:, interleave].ravel()
92
+ # perm = torch.from_numpy(perm)
93
+ scale_perm = []
94
+ for i in range(8):
95
+ scale_perm.extend([i + 8 * j for j in range(8)])
96
+ scale_perm_single = []
97
+ for i in range(4):
98
+ scale_perm_single.extend([2 * i + j for j in [0, 1, 8, 9, 16, 17, 24, 25]])
99
+ return perm, scale_perm, scale_perm_single
100
+ ```
101
+
102
+ ### For MPS users
103
  ```
104
+ ```