ciCic
/

llama-3.2-1B-Instruct-AWQ

@@ -14,23 +14,28 @@ tags:
 - facebook
 - meta
 - pytorch
-- llama
 - llama-3
 license: llama3.2
 ---
 # Represents
 A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
-## Use with transformers
 Starting with
 - `transformers==4.45.1`
 - `torch==2.3.1`
 - `numpy==2.0.0`
 - `autoawq==0.2.6`
 ### For CUDA users
 ```python
 from awq import AutoAWQForCausalLM
 from transformers import AutoTokenizer, TextStreamer
@@ -59,14 +64,47 @@ generation_output = model.generate(
 )
 ```
 #### Issue/Solution
 - torch.from_numpy fails
   - This might be due to certain issues within `torch==2.3.1` .cpp files. Since AutoAWQ uses torch version 2.3.1, instead of most recent, this issue might occur within module `marlin.py -> def _get_perms()`
   - Module path: Python\Python311\site-packages\awq\modules\linear\marlin.py
   - Solution:
-    - 1. there are several operations to numpy (cpu) then back to tensor (gpu) which could be completely replaced by tensor without having to use numpy, this will solve (temporarily) the from_numpy() issue
-      - 2. E.g.
-        3. ```python
 def _get_perms():
     perm = []
     for i in range(32):
@@ -97,8 +135,4 @@ def _get_perms():
     for i in range(4):
         scale_perm_single.extend([2 * i + j for j in [0, 1, 8, 9, 16, 17, 24, 25]])
     return perm, scale_perm, scale_perm_single
-```
-### For MPS users
-```
 ```

 - facebook
 - meta
 - pytorch
 - llama-3
 license: llama3.2
+base_model:
+- meta-llama/Llama-3.2-1B-Instruct
 ---
 # Represents
 A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
+## Use with transformers/autoawq
 Starting with
 - `transformers==4.45.1`
+- `accelerate==0.34.2`
 - `torch==2.3.1`
 - `numpy==2.0.0`
 - `autoawq==0.2.6`
 ### For CUDA users
+AutoAWQ
 ```python
+"""NOTE: this example uses `fuse_layers=True` to fuse attention and mlp layers together for faster inference"""
 from awq import AutoAWQForCausalLM
 from transformers import AutoTokenizer, TextStreamer
 )
 ```
+Transformers
+```python
+from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
+import torch
+quant_id = "ciCic/llama-3.2-1B-Instruct-AWQ"
+tokenizer = AutoTokenizer.from_pretrained(quant_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    quant_id,
+    torch_dtype=torch.float16,
+    low_cpu_mem_usage=True,
+    device_map="cuda"
+)
+streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
+# Convert prompt to tokens
+prompt = "You're standing on the surface of the Earth. "\
+        "You walk one mile south, one mile west and one mile north. "\
+        "You end up exactly where you started. Where are you?"
+tokens = tokenizer(
+   prompt,
+    return_tensors='pt'
+).input_ids.cuda()
+# Generate output
+generation_output = model.generate(
+    tokens,
+    streamer=streamer,
+    max_new_tokens=512
+)
+```
 #### Issue/Solution
 - torch.from_numpy fails
   - This might be due to certain issues within `torch==2.3.1` .cpp files. Since AutoAWQ uses torch version 2.3.1, instead of most recent, this issue might occur within module `marlin.py -> def _get_perms()`
   - Module path: Python\Python311\site-packages\awq\modules\linear\marlin.py
   - Solution:
+    - there are several operations to numpy (cpu) then back to tensor (gpu) which could be completely replaced by tensor without having to use numpy, this will solve (temporarily) the from_numpy() issue
+```python
 def _get_perms():
     perm = []
     for i in range(32):
     for i in range(4):
         scale_perm_single.extend([2 * i + j for j in [0, 1, 8, 9, 16, 17, 24, 25]])
     return perm, scale_perm, scale_perm_single
 ```