ciCic commited on
Commit
ef7d746
·
verified ·
1 Parent(s): 2cebe1a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -9
README.md CHANGED
@@ -14,23 +14,28 @@ tags:
14
  - facebook
15
  - meta
16
  - pytorch
17
- - llama
18
  - llama-3
19
  license: llama3.2
 
 
20
  ---
21
  # Represents
22
  A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
23
 
24
- ## Use with transformers
25
  Starting with
26
  - `transformers==4.45.1`
 
27
  - `torch==2.3.1`
28
  - `numpy==2.0.0`
29
  - `autoawq==0.2.6`
30
 
31
  ### For CUDA users
32
 
 
33
  ```python
 
 
34
  from awq import AutoAWQForCausalLM
35
  from transformers import AutoTokenizer, TextStreamer
36
 
@@ -59,14 +64,47 @@ generation_output = model.generate(
59
  )
60
  ```
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  #### Issue/Solution
63
  - torch.from_numpy fails
64
  - This might be due to certain issues within `torch==2.3.1` .cpp files. Since AutoAWQ uses torch version 2.3.1, instead of most recent, this issue might occur within module `marlin.py -> def _get_perms()`
65
  - Module path: Python\Python311\site-packages\awq\modules\linear\marlin.py
66
  - Solution:
67
- - 1. there are several operations to numpy (cpu) then back to tensor (gpu) which could be completely replaced by tensor without having to use numpy, this will solve (temporarily) the from_numpy() issue
68
- - 2. E.g.
69
- 3. ```python
70
  def _get_perms():
71
  perm = []
72
  for i in range(32):
@@ -97,8 +135,4 @@ def _get_perms():
97
  for i in range(4):
98
  scale_perm_single.extend([2 * i + j for j in [0, 1, 8, 9, 16, 17, 24, 25]])
99
  return perm, scale_perm, scale_perm_single
100
- ```
101
-
102
- ### For MPS users
103
- ```
104
  ```
 
14
  - facebook
15
  - meta
16
  - pytorch
 
17
  - llama-3
18
  license: llama3.2
19
+ base_model:
20
+ - meta-llama/Llama-3.2-1B-Instruct
21
  ---
22
  # Represents
23
  A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq]
24
 
25
+ ## Use with transformers/autoawq
26
  Starting with
27
  - `transformers==4.45.1`
28
+ - `accelerate==0.34.2`
29
  - `torch==2.3.1`
30
  - `numpy==2.0.0`
31
  - `autoawq==0.2.6`
32
 
33
  ### For CUDA users
34
 
35
+ AutoAWQ
36
  ```python
37
+ """NOTE: this example uses `fuse_layers=True` to fuse attention and mlp layers together for faster inference"""
38
+
39
  from awq import AutoAWQForCausalLM
40
  from transformers import AutoTokenizer, TextStreamer
41
 
 
64
  )
65
  ```
66
 
67
+ Transformers
68
+ ```python
69
+ from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
70
+ import torch
71
+
72
+ quant_id = "ciCic/llama-3.2-1B-Instruct-AWQ"
73
+ tokenizer = AutoTokenizer.from_pretrained(quant_id, trust_remote_code=True)
74
+ model = AutoModelForCausalLM.from_pretrained(
75
+ quant_id,
76
+ torch_dtype=torch.float16,
77
+ low_cpu_mem_usage=True,
78
+ device_map="cuda"
79
+ )
80
+
81
+ streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
82
+
83
+ # Convert prompt to tokens
84
+ prompt = "You're standing on the surface of the Earth. "\
85
+ "You walk one mile south, one mile west and one mile north. "\
86
+ "You end up exactly where you started. Where are you?"
87
+
88
+ tokens = tokenizer(
89
+ prompt,
90
+ return_tensors='pt'
91
+ ).input_ids.cuda()
92
+
93
+ # Generate output
94
+ generation_output = model.generate(
95
+ tokens,
96
+ streamer=streamer,
97
+ max_new_tokens=512
98
+ )
99
+ ```
100
+
101
  #### Issue/Solution
102
  - torch.from_numpy fails
103
  - This might be due to certain issues within `torch==2.3.1` .cpp files. Since AutoAWQ uses torch version 2.3.1, instead of most recent, this issue might occur within module `marlin.py -> def _get_perms()`
104
  - Module path: Python\Python311\site-packages\awq\modules\linear\marlin.py
105
  - Solution:
106
+ - there are several operations to numpy (cpu) then back to tensor (gpu) which could be completely replaced by tensor without having to use numpy, this will solve (temporarily) the from_numpy() issue
107
+ ```python
 
108
  def _get_perms():
109
  perm = []
110
  for i in range(32):
 
135
  for i in range(4):
136
  scale_perm_single.extend([2 * i + j for j in [0, 1, 8, 9, 16, 17, 24, 25]])
137
  return perm, scale_perm, scale_perm_single
 
 
 
 
138
  ```