ciCic
/

llama-3.2-1B-Instruct-AWQ

@@ -18,17 +18,87 @@ tags:
 - llama-3
 license: llama3.2
 ---
 A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
-### Use with transformers
 Starting with
 - `transformers==4.45.1`
 - `torch==2.3.1`
 - `numpy==2.0.0`
 - `autoawq==0.2.6`
-you can run conversational inference using the Transformers Auto classes with the `generate()` function.
 ```python
 ```

 - llama-3
 license: llama3.2
 ---
+# Represents
 A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
+## Use with transformers
 Starting with
 - `transformers==4.45.1`
 - `torch==2.3.1`
 - `numpy==2.0.0`
 - `autoawq==0.2.6`
+### For CUDA users
 ```python
+from awq import AutoAWQForCausalLM
+from transformers import AutoTokenizer, TextStreamer
+quant_id = "ciCic/llama-3.2-1B-Instruct-AWQ"
+model = AutoAWQForCausalLM.from_quantized(quant_id, fuse_layers=True)
+tokenizer = AutoTokenizer.from_pretrained(quant_id, trust_remote_code=True)
+streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
+# Declare prompt
+prompt = "You're standing on the surface of the Earth. "\
+        "You walk one mile south, one mile west and one mile north. "\
+        "You end up exactly where you started. Where are you?"
+# Tokenization of the prompt
+tokens = tokenizer(
+   prompt,
+    return_tensors='pt'
+).input_ids.cuda()
+# Generate output in a streaming fashion
+generation_output = model.generate(
+    tokens,
+    streamer=streamer,
+    max_new_tokens=512
+)
+```
+#### Issue/Solution
+- torch.from_numpy fails
+  - This might be due to certain issues within `torch==2.3.1` .cpp files. Since AutoAWQ uses torch version 2.3.1, instead of most recent, this issue might occur within module `marlin.py -> def _get_perms()`
+  - Module path: Python\Python311\site-packages\awq\modules\linear\marlin.py
+  - Solution:
+    - 1. there are several operations to numpy (cpu) then back to tensor (gpu) which could be completely replaced by tensor without having to use numpy, this will solve (temporarily) the from_numpy() issue
+      - 2. E.g.
+        3. ```python
+def _get_perms():
+    perm = []
+    for i in range(32):
+        perm1 = []
+        col = i // 4
+        for block in [0, 1]:
+            for row in [
+                2 * (i % 4),
+                2 * (i % 4) + 1,
+                2 * (i % 4 + 4),
+                2 * (i % 4 + 4) + 1,
+            ]:
+                perm1.append(16 * row + col + 8 * block)
+        for j in range(4):
+            perm.extend([p + 256 * j for p in perm1])
+    # perm = np.array(perm)
+    perm = torch.asarray(perm)
+    # interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7])
+    interleave = torch.asarray([0, 2, 4, 6, 1, 3, 5, 7])
+    perm = perm.reshape((-1, 8))[:, interleave].ravel()
+    # perm = torch.from_numpy(perm)
+    scale_perm = []
+    for i in range(8):
+        scale_perm.extend([i + 8 * j for j in range(8)])
+    scale_perm_single = []
+    for i in range(4):
+        scale_perm_single.extend([2 * i + j for j in [0, 1, 8, 9, 16, 17, 24, 25]])
+    return perm, scale_perm, scale_perm_single
+```
+### For MPS users
 ```
+```