|
--- |
|
language: |
|
- en |
|
- de |
|
- fr |
|
- it |
|
- pt |
|
- hi |
|
- es |
|
- th |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
tags: |
|
- facebook |
|
- meta |
|
- pytorch |
|
- llama-3 |
|
license: llama3.2 |
|
base_model: |
|
- meta-llama/Llama-3.2-1B-Instruct |
|
--- |
|
# Represents |
|
A quantized version of Llama 3.2 1B Instruct with Activation-aware Weight Quantization (AWQ)[https://github.com/mit-han-lab/llm-awq] |
|
|
|
## Use with transformers/autoawq |
|
Starting with |
|
- `transformers==4.45.1` |
|
- `accelerate==0.34.2` |
|
- `torch==2.3.1` |
|
- `numpy==2.0.0` |
|
- `autoawq==0.2.6` |
|
|
|
Experimented with |
|
- OS = Windows |
|
- GPU = Nvidia GeForce RTX 3080 10gb |
|
- CPU = Intel Core i5-9600K |
|
- RAM = 32GB |
|
|
|
### For CUDA users |
|
|
|
**AutoAWQ** |
|
|
|
NOTE: this example uses `fuse_layers=True` to fuse attention and mlp layers together for faster inference |
|
```python |
|
from awq import AutoAWQForCausalLM |
|
from transformers import AutoTokenizer, TextStreamer |
|
|
|
quant_id = "ciCic/llama-3.2-1B-Instruct-AWQ" |
|
model = AutoAWQForCausalLM.from_quantized(quant_id, fuse_layers=True) |
|
tokenizer = AutoTokenizer.from_pretrained(quant_id, trust_remote_code=True) |
|
|
|
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) |
|
|
|
# Declare prompt |
|
prompt = "You're standing on the surface of the Earth. "\ |
|
"You walk one mile south, one mile west and one mile north. "\ |
|
"You end up exactly where you started. Where are you?" |
|
|
|
# Tokenization of the prompt |
|
tokens = tokenizer( |
|
prompt, |
|
return_tensors='pt' |
|
).input_ids.cuda() |
|
|
|
# Generate output in a streaming fashion |
|
generation_output = model.generate( |
|
tokens, |
|
streamer=streamer, |
|
max_new_tokens=512 |
|
) |
|
``` |
|
|
|
**Transformers** |
|
|
|
```python |
|
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM |
|
import torch |
|
|
|
quant_id = "ciCic/llama-3.2-1B-Instruct-AWQ" |
|
tokenizer = AutoTokenizer.from_pretrained(quant_id, trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
quant_id, |
|
torch_dtype=torch.float16, |
|
low_cpu_mem_usage=True, |
|
device_map="cuda" |
|
) |
|
|
|
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) |
|
|
|
# Convert prompt to tokens |
|
prompt = "You're standing on the surface of the Earth. "\ |
|
"You walk one mile south, one mile west and one mile north. "\ |
|
"You end up exactly where you started. Where are you?" |
|
|
|
tokens = tokenizer( |
|
prompt, |
|
return_tensors='pt' |
|
).input_ids.cuda() |
|
|
|
# Generate output |
|
generation_output = model.generate( |
|
tokens, |
|
streamer=streamer, |
|
max_new_tokens=512 |
|
) |
|
``` |
|
|
|
#### Issue/Solution |
|
- torch.from_numpy fails |
|
- This might be due to certain issues within `torch==2.3.1` .cpp files. Since AutoAWQ uses torch version 2.3.1, instead of most recent, this issue might occur within module `marlin.py -> def _get_perms()` |
|
- Module path: Python\Python311\site-packages\awq\modules\linear\marlin.py |
|
- Solution: |
|
- there are several operations to numpy (cpu) then back to tensor (gpu) which could be completely replaced by tensor without having to use numpy, this will solve (temporarily) the from_numpy() issue |
|
```python |
|
def _get_perms(): |
|
perm = [] |
|
for i in range(32): |
|
perm1 = [] |
|
col = i // 4 |
|
for block in [0, 1]: |
|
for row in [ |
|
2 * (i % 4), |
|
2 * (i % 4) + 1, |
|
2 * (i % 4 + 4), |
|
2 * (i % 4 + 4) + 1, |
|
]: |
|
perm1.append(16 * row + col + 8 * block) |
|
|
|
for j in range(4): |
|
perm.extend([p + 256 * j for p in perm1]) |
|
|
|
# perm = np.array(perm) |
|
perm = torch.asarray(perm) |
|
# interleave = np.array([0, 2, 4, 6, 1, 3, 5, 7]) |
|
interleave = torch.asarray([0, 2, 4, 6, 1, 3, 5, 7]) |
|
perm = perm.reshape((-1, 8))[:, interleave].ravel() |
|
# perm = torch.from_numpy(perm) |
|
scale_perm = [] |
|
for i in range(8): |
|
scale_perm.extend([i + 8 * j for j in range(8)]) |
|
scale_perm_single = [] |
|
for i in range(4): |
|
scale_perm_single.extend([2 * i + j for j in [0, 1, 8, 9, 16, 17, 24, 25]]) |
|
return perm, scale_perm, scale_perm_single |
|
``` |